MiMo‑V2‑Flash: Revolutionizing Long‑Context Language Modeling with Hybrid Attention

Key Takeaway

MiMo‑V2‑Flash bietet mit nur 15 B aktiven Parametern und einer effizienten Hybrid‑Attention‑Architektur sowie Multi‑Token‑Prediction eine herausragende Kombination aus Langkontext‑Modellierung und hoher Inferenz‑Geschwindigkeit, wodurch es bisherige große Modelle übertrifft.

Summary

Modellstruktur

Mixture‑of‑Experts (MoE) Modell mit 309 B Gesamtparametern, 15 B aktiv.
Hybrid‑Attention‑Architektur: 5 SWA‑Layer zu 1 GA‑Layer (RATIO 5:1), Sliding‑Window‑Attention (SWA) mit 128‑Token‑Fenster, Global Attention (GA) für globale Kontextinformation.
Learnable Attention‑Sink‑Bias zur Aufrechterhaltung der Langkontext‑Leistung bei aggressivem Fenster.

Multi��Token Prediction (MTP)

Leichtes MTP‑Modul (0.33 B Param./Block) mit dichten FFNs.
Tripelt die Ausgabegeschwindigkeit bei Inferenz, verbessert RL‑Ausgabe durch schnelleres Roll‑out.
Integriert sowohl im Training als auch in der Inferenz, keine separate Speculative‑Decoding‑Phase.

Effiziente Pre‑Training‑Pipeline

27 T Tokens, FP8‑Mixed‑Precision, native 32 k Sequenzlänge.
Kontextfenster bis zu 256 k Token unterstützt.
Reduktion der KV‑Cache‑Speicherung um fast 6× im Vergleich zu Standard‑MoE‑Modellen (durch 128‑Token‑Fenster).

Agentic Fähigkeiten

Post‑Training mit Multi‑Teacher On‑Policy Distillation (MOPD) und groß angelegtem Agentic‑Reinforcement‑Learning.
Erreicht SOTA‑Leistungen in SWE‑Bench und komplexen Reasoning‑Aufgaben (z. B. MMLU‑Pro, GPQA‑Diamond).

Modelldownloads & Open‑Source

Zwei Hauptmodelle verfügbar: MiMo-V2-Flash-Base und MiMo-V2-Flash (gleich groß).
Beide Modelle (309 B Total, 15 B Active, 256 k Context) frei herunterladbar auf HuggingFace.
3‑Layer MTP‑Gewichte ausgeliefert zur Förderung von Community‑Forschung.

Benchmark‑Ergebnisse (Base Modell)

Starke Leistungen auf BBH (3‑shot 88.5 % vs. Vergleichsmodelle ≈ 88 %) und MMLU (5‑shot 86.7 %).
Verbesserte Ergebnisse bei MMLU‑Redux (90.6 %) und MMLU‑Pro (73.2 %)…
Weitere Aufzählung der Benchmarks …

Benchmark‑Ergebnisse (Post‑Training)

Auf Basis der MTP‑Gewichte übertrifft MiMo‑V2‑Flash zahlreiche konkurrierende große Modelle in diversen Benchmarks.

Related queries

Wo kann ich die MiMo‑V2‑Flash‑Gewichte herunterladen?

Wie funktioniert die Hybrid‑Attention‑Architektur vonMi‑V2‑Flash?

Auf welchen Benchmarks übertrifft MiMo‑V2‑Flash konkurrierende große Modelle?

Quelle: https://github.com/XiaomiMiMo/MiMo-V2-Flash

MiMo‑V2‑Flash: Revolutionizing Long‑Context Language Modeling with Hybrid Attention

MiMo‑V2‑Flash: Revolutionizing Long‑Context Language Modeling with Hybrid Attention

Key Takeaway

Summary

Modellstruktur

Multi��Token Prediction (MTP)

Effiziente Pre‑Training‑Pipeline

Agentic Fähigkeiten

Modelldownloads & Open‑Source

Benchmark‑Ergebnisse (Base Modell)

Benchmark‑Ergebnisse (Post‑Training)

Related queries

Submit a Comment Cancel reply

Recent Posts

Recent Comments

MiMo‑V2‑Flash: Revolutionizing Long‑Context Language Modeling with Hybrid Attention

MiMo‑V2‑Flash: Revolutionizing Long‑Context Language Modeling with Hybrid Attention

Key Takeaway

Summary

Modellstruktur

Multi���Token Prediction (MTP)

Effiziente Pre‑Training‑Pipeline

Agentic Fähigkeiten

Modelldownloads & Open‑Source

Benchmark‑Ergebnisse (Base Modell)

Benchmark‑Ergebnisse (Post‑Training)

Related queries

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Multi��Token Prediction (MTP)