Stepfun AI team has released Step-Audio 2 MiniAn 8B parameter speech-to-speech large audio language model (LELM) that distributes expressive, grounded and real-time audio interactions. Issued under Apache 2.0 LicenseThis open-source model receives state-of-the-art performance in speech recognition, audio understanding, and speech conversation benchmarks-is associated with commercial systems such as GPT-4O-Audio.

key features
1. Integrated Audio -Gathering Tokening
Unlike Cascade ASR+LLM+TTS pipelines, Step-Audio 2 integrates Polymorphic token modelingWhere? Text and audio tokens share a single modeling stream,
It enables:
- Spontaneous argument in lessons and audio.
- on the fly Voice style switching During estimates.
- Consistency in cementic, processodic and emotional output.
2. Expressive and emotional generation
The model does not only transfer speech – it explains Paralysis characteristics Like pitch, rhythm, emotion, timber and style. This allows interacting with realistic emotional tone such as whispering, sadness or enthusiasm. Benchmark on Stepwell-audio-parellistic Get step-audio 2 83.1% accuracyGPT-4O beyond audio (43.5%) and qwen -omni (44.2%).
3. Recover-hired speech generation
Step-Audio 2 includes Multimodal raga,
- Web search integration For factual grounding.
- Audible discovery-A novel capability enables a large library that reinforces real voices and involves them in reactions Duplicate voice time/style On time.
4. Tool calling and multimodal region
The system is spread beyond speech synthesis by supporting the system Equipment callBenchmarks show that Step-Audio 2 matches LLM Tool selection and parameter accuracyWhile specificly excellent Audio Search Tool Call-Unavailable capacity in one-care llms.
Training and data value
- Text + Audio Corpus: 1.356t tokens
- Audio Hours: 8m+ real and synthetic hours
- Speaker variety: ~ 50k sounds in languages and dialects
- Pretrening Pipeline: Multi-stage courses covers ASR, TTS, speech-to-spicch translation, and emotion-labeled interactive synthesis.
This allows large-scale training step-audio 2 mini to maintain strong text arguments (through its Qwen2-Audio and Cosyvoice Foundation), mastery in fine-audio modeling.
Demonstration rich


Automatic speech recognition (ASR)
- English: Average 3.14% (Beats GPT-4O transfers at average 4.5%).
- Sugar: Average CER 3.08% (much lower than GPT -4O and Quven -Heomani).
- Strong in dialects and accents.
Audio understanding (MMAU benchmark)
- Step-Audio 2: 78.0 average, OMNI-R1 (77.0) and Audio Flamingo 3 (73.1).
- The strongest in Sound and speech argument work,
Speech translation
- Covost 2 (s2tt): Bleu 39.26 (highest between open and closed models).
- CVSS (S2ST): Bleu 30.87, ahead of GPT-4O (23.68).
Condensed benchmark
- Chinese conversation: Overall best 83.3 (original) And 68.2 (Pro),
- English conversation: Competitive with GPT-4o (83.9 vs 84.5), far ahead of other open models.

conclusion
Step-Audio 2 Mini Advanced to developers and research community, multimodal speech makes intelligence accessible. by combining Qwen2-AudioWith reasoning capacity Cosyvoice’s toknization pipelineAnd increase with Recover-based basisStepfun has given one of the most capable Open Oudio llms,
Check it paper And Model on a hug face. Feel free to check us Github page for tutorials, codes and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ mL subredit More membership Our newspaper,
Asif razzaq is CEO of Marktechpost Media Inc .. As a visionary entrepreneur and engineer, ASIF is committed to using the ability of artificial intelligence for social good. His most recent effort is the launch of an Artificial Intelligence Media Platform, Marktekpost, which stands for his intensive coverage of machine learning and deep learning news, technically sound and easily understand by a comprehensive audience. The stage claims more than 2 million monthly ideas, reflecting its popularity among the audience.