A Robust End-to-End Framework for Multi-Speaker Speech Transcription
1 ASLP@NPU, Northwestern Polytechnical University | 2 Soul AI Lab, China | 3 Moonstep AI, China
SoulX-Transcriber is an end-to-end Speech LLM for timestamped speaker-attributed ASR. Rather than relying on a cascaded pipeline, the model directly learns speaker attribution, timestamped segmentation, and transcription in a single framework, producing coherent speaker-consistent transcripts for overlapping and fast-turn conversations.
We propose a speaker characteristics-driven audio matching pipeline that automatically selects the most suitable reference audio for each utterance, producing more natural, context-aligned simulated dialogues.
Speaker-aware multi-task Continues Pre-Training plus Supervised Fine-tuned strengthens speaker representation and robustness to conversations, mitigating same-gender confusion, overlap, and boundary errors.
SoulX-Transcriber achieves superior performance on the AISHELL-4 and AliMeeting benchmarks via a unified diarization and recognition framework, which directly produces structured outputs consisting of timestamps, speaker labels, and transcripts.
All metrics are lower-is-better (↓). Toggle between table and chart views.
| Model | AISHELL-4[2]ZH | AliMeeting[3]ZH | AMI-SDM[4]EN | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DER↓ | WER↓ | cpWER↓ | ∆cp↓ | DER↓ | WER↓ | cpWER↓ | ∆cp↓ | DER↓ | WER↓ | cpWER↓ | ∆cp↓ | |
| VibeVoice-ASR[1] | 6.77 | 21.4 | 24.99 | 3.59 | 10.92 | 27.4 | 29.33 | 1.93 | 13.43 | 24.65 | 28.82 | 4.17 |
| Gemini-2.5-Pro | 36.07 | 19.81 | 25.11 | 5.30 | 56.39 | 30.16 | 39.29 | 9.13 | 50.28 | 31.66 | 39.98 | 8.32 |
| Gemini-3.1-pro-preview | 24.84 | 24.86 | 24.81 | -0.05 | 30.76 | 18.82 | 18.99 | 0.17 | 40.40 | 30.82 | 32.97 | 2.15 |
| Qwen3.5-omni | 22.33 | 15.13 | 14.71 | -0.42 | 26.46 | 12.44 | 12.79 | 0.35 | 30.05 | 28.57 | 33.46 | 4.89 |
| SoulX-Transcriber (Ours) | 2.89 | 14.16 | 13.90 | -0.26 | 5.39 | 13.07 | 13.61 | 0.54 | 11.67 | 25.28 | 32.81 | 7.53 |
| Model | AliMeeting[3]ZH | AISHELL-4[2]ZH | ||||||
|---|---|---|---|---|---|---|---|---|
| DER↓ | CER↓ | cpCER↓ | ∆cp↓ | DER↓ | CER↓ | cpCER↓ | ∆cp↓ | |
| Qwen3-Omni-30B-Instruct | 38.36 | 25.28 | 37.54 | 12.26 | 34.71 | 15.95 | 23.63 | 7.68 |
| VibeVoice-ASR[1] | 18.00 | 29.72 | 31.94 | 2.22 | 9.17 | 19.54 | 22.95 | 3.41 |
| Gemini-2.5-Pro | 58.14 | 31.69 | 42.22 | 10.53 | 40.87 | 20.26 | 26.31 | 6.05 |
| Gemini-3.1-pro-preview | 38.75 | 26.75 | 32.84 | 6.09 | 22.03 | 22.75 | 27.43 | 4.68 |
| SoulX-Transcriber (Ours) | 5.72 | 16.22 | 16.99 | 0.77 | 7.73 | 14.49 | 17.82 | 3.33 |
| Model | Social Conversation | Drama | Podcast | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DER↓ | WER↓ | cpWER↓ | ∆cp↓ | DER↓ | WER↓ | cpWER↓ | ∆cp↓ | DER↓ | WER↓ | cpWER↓ | ∆cp↓ | |
| VibeVoice-ASR[1] | 2.76 | 30.34 | 31.77 | 1.43 | 27.78 | 19.60 | 31.43 | 24.01 | 4.70 | 8.88 | 14.58 | 5.70 |
| Gemini-3.1-pro-preview | 38.69 | 29.14 | 36.72 | 7.58 | 34.87 | 10.01 | 21.03 | 11.02 | 24.56 | 23.89 | 27.21 | 3.32 |
| SoulX-Transcriber (Ours) | 1.32 | 6.73 | 7.31 | 0.58 | 23.56 | 5.17 | 20.58 | 15.41 | 21.15 | 7.50 | 19.37 | 11.87 |
Real-world audio processed by SoulX-Transcriber. Press play and watch the transcript scroll in sync. 🗣️ Overlapping speech from multiple speakers highlights simultaneously.