SoulX-Transcriber

Soul AILab

A Robust End-to-End Framework for Multi-Speaker Speech Transcription

Yuhang Dai1,2*, Haopeng Lin2*, Zhennan Lin1, Jiale Qian2, Jun Wu2, Hanke Xie1,2, Hao Meng2, Hanlin Wen2, Chuang Ding3, Shunshun Yin2, Ming Tao2, Lei Xie1, Xinsheng Wang2†

1 ASLP@NPU, Northwestern Polytechnical University  |  2 Soul AI Lab, China  |  3 Moonstep AI, China

↓ scroll to explore

SoulX-Transcriber

SoulX-Transcriber is an end-to-end Speech LLM for timestamped speaker-attributed ASR. Rather than relying on a cascaded pipeline, the model directly learns speaker attribution, timestamped segmentation, and transcription in a single framework, producing coherent speaker-consistent transcripts for overlapping and fast-turn conversations.

🔄

A more natural and authentic approach to dialogue generation

We propose a speaker characteristics-driven audio matching pipeline that automatically selects the most suitable reference audio for each utterance, producing more natural, context-aligned simulated dialogues.

📈

Speaker-aware multi-stage training

Speaker-aware multi-task Continues Pre-Training plus Supervised Fine-tuned strengthens speaker representation and robustness to conversations, mitigating same-gender confusion, overlap, and boundary errors.

🏆

Outstanding performance

SoulX-Transcriber achieves superior performance on the AISHELL-4 and AliMeeting benchmarks via a unified diarization and recognition framework, which directly produces structured outputs consisting of timestamps, speaker labels, and transcripts.

Comprehensive Multi-Domain Evaluation

All metrics are lower-is-better (↓). Toggle between table and chart views.

Open-Source Benchmark
Model AISHELL-4[2]ZH AliMeeting[3]ZH AMI-SDM[4]EN
DER↓WER↓cpWER↓∆cp↓DER↓WER↓cpWER↓∆cp↓DER↓WER↓cpWER↓∆cp↓
VibeVoice-ASR[1]6.7721.424.993.5910.9227.429.331.9313.4324.6528.824.17
Gemini-2.5-Pro36.0719.8125.115.3056.3930.1639.299.1350.2831.6639.988.32
Gemini-3.1-pro-preview24.8424.8624.81-0.0530.7618.8218.990.1740.4030.8232.972.15
Qwen3.5-omni22.3315.1314.71-0.4226.4612.4412.790.3530.0528.5733.464.89
SoulX-Transcriber (Ours)2.8914.1613.90-0.265.3913.0713.610.5411.6725.2832.817.53
Longform Benchmark(5min segments)
Model AliMeeting[3]ZH AISHELL-4[2]ZH
DER↓CER↓cpCER↓∆cp↓DER↓CER↓cpCER↓∆cp↓
Qwen3-Omni-30B-Instruct38.3625.2837.5412.2634.7115.9523.637.68
VibeVoice-ASR[1]18.0029.7231.942.229.1719.5422.953.41
Gemini-2.5-Pro58.1431.6942.2210.5340.8720.2626.316.05
Gemini-3.1-pro-preview38.7526.7532.846.0922.0322.7527.434.68
SoulX-Transcriber (Ours)5.7216.2216.990.777.7314.4917.823.33
Internal Multi-Domain Benchmark
Model Social Conversation Drama Podcast
DER↓WER↓cpWER↓∆cp↓DER↓WER↓cpWER↓∆cp↓DER↓WER↓cpWER↓∆cp↓
VibeVoice-ASR[1]2.7630.3431.771.4327.7819.6031.4324.014.708.8814.585.70
Gemini-3.1-pro-preview38.6929.1436.727.5834.8710.0121.0311.0224.5623.8927.213.32
SoulX-Transcriber (Ours)1.326.737.310.5823.565.1720.5815.4121.157.5019.3711.87

Live Transcription Examples

Real-world audio processed by SoulX-Transcriber. Press play and watch the transcript scroll in sync. 🗣️ Overlapping speech from multiple speakers highlights simultaneously.

References

[1]
Peng, Zhiliang and Yu, Jianwei and Chang, Yaoyao and Wang, Zilong and Dong, Li and Hao, Yingbo and Tu, Yujie and Yang, Chenyu and Wang, Wenhui and Xu, Songchen and others, “VIBEVOICE-ASR Technical Report,” CoRR, 2026. [Link]
[2]
Fu, Yihui and Cheng, Luyao and Lv, Shubo and Jv, Yukai and Kong, Yuxiang and Chen, Zhuo and Hu, Yanxin and Xie, Lei and Wu, Jian and Bu, Hui and Xu, Xin and Du, Jun and Chen, Jingdong, “AISHELL-4,” Proc. Interspeech, 2021. [Link]
[3]
Yu, Fan and Zhang, Shiliang and Fu, Yihui and Xie, Lei and Zheng, Siqi and Du, Zhihao and Huang, Weilong and Guo, Pengcheng and Yan, Zhijie and Ma, Bin and Xu, Xin and Bu, Hui, “M2MeT 2.0,” Proc. ASRU, 2023. [Link]
[4]
Carletta, Jean and Ashby, Simone and Bourban, Sebastien and Flynn, Mike and Guillemot, Mael and Hain, Thomas and Kadlec, Jaroslav and Karaiskos, Vasilis and Kraaij, Wessel and Kronenthal, Melissa and Lathoud, Guillaume and Lincoln, Mike and Lisowska, Agnes and McCowan, Iain and Post, Wilfried and Reidsma, Dennis and Wellner, Pierre, “The AMI Meeting Corpus,” MLMI, Springer, 2005. [Link]