SoulX-Transcriber

Soul AILab

A Robust End-to-End Framework for Multi-Speaker Speech Transcription

Yuhang Dai^1,2^*, Haopeng Lin²^*, Zhennan Lin¹, Jiale Qian², Jun Wu², Hanke Xie^1,2, Hao Meng², Hanlin Wen², Chuang Ding³, Shunshun Yin², Ming Tao², Lei Xie¹, Xinsheng Wang^2†

¹ ASLP@NPU, Northwestern Polytechnical University | ² Soul AI Lab, China | ³ Moonstep AI, China

Technical Report 🤗 HuggingFace GitHub 🏫 Soul AILab

↓ scroll to explore

Overview

SoulX-Transcriber

SoulX-Transcriber is an end-to-end Speech LLM for timestamped speaker-attributed ASR. Rather than relying on a cascaded pipeline, the model directly learns speaker attribution, timestamped segmentation, and transcription in a single framework, producing coherent speaker-consistent transcripts for overlapping and fast-turn conversations.

🔄

A more natural and authentic approach to dialogue generation

We propose a speaker characteristics-driven audio matching pipeline that automatically selects the most suitable reference audio for each utterance, producing more natural, context-aligned simulated dialogues.

📈

Speaker-aware multi-stage training

Speaker-aware multi-task Continues Pre-Training plus Supervised Fine-tuned strengthens speaker representation and robustness to conversations, mitigating same-gender confusion, overlap, and boundary errors.

🏆

Outstanding performance

SoulX-Transcriber achieves superior performance on the AISHELL-4 and AliMeeting benchmarks via a unified diarization and recognition framework, which directly produces structured outputs consisting of timestamps, speaker labels, and transcripts.

Results

Comprehensive Multi-Domain Evaluation

All metrics are lower-is-better (↓). Toggle between table and chart views.

Open-Source Benchmark

Model	AISHELL-4^[2]ZH				AliMeeting^[3]ZH				AMI-SDM^[4]EN
Model	DER↓	WER↓	cpWER↓	∆cp↓	DER↓	WER↓	cpWER↓	∆cp↓	DER↓	WER↓	cpWER↓	∆cp↓
VibeVoice-ASR^[1]	6.77	21.4	24.99	3.59	10.92	27.4	29.33	1.93	13.43	24.65	28.82	4.17
Gemini-2.5-Pro	36.07	19.81	25.11	5.30	56.39	30.16	39.29	9.13	50.28	31.66	39.98	8.32
Gemini-3.1-pro-preview	24.84	24.86	24.81	-0.05	30.76	18.82	18.99	0.17	40.40	30.82	32.97	2.15
Qwen3.5-omni	22.33	15.13	14.71	-0.42	26.46	12.44	12.79	0.35	30.05	28.57	33.46	4.89
SoulX-Transcriber (Ours)	2.89	14.16	13.90	-0.26	5.39	13.07	13.61	0.54	11.67	25.28	32.81	7.53

Longform Benchmark(5min segments)

Model	AliMeeting^[3]ZH				AISHELL-4^[2]ZH
Model	DER↓	CER↓	cpCER↓	∆cp↓	DER↓	CER↓	cpCER↓	∆cp↓
Qwen3-Omni-30B-Instruct	38.36	25.28	37.54	12.26	34.71	15.95	23.63	7.68
VibeVoice-ASR^[1]	18.00	29.72	31.94	2.22	9.17	19.54	22.95	3.41
Gemini-2.5-Pro	58.14	31.69	42.22	10.53	40.87	20.26	26.31	6.05
Gemini-3.1-pro-preview	38.75	26.75	32.84	6.09	22.03	22.75	27.43	4.68
SoulX-Transcriber (Ours)	5.72	16.22	16.99	0.77	7.73	14.49	17.82	3.33

Internal Multi-Domain Benchmark

Model	Social Conversation				Drama				Podcast
Model	DER↓	WER↓	cpWER↓	∆cp↓	DER↓	WER↓	cpWER↓	∆cp↓	DER↓	WER↓	cpWER↓	∆cp↓
VibeVoice-ASR^[1]	2.76	30.34	31.77	1.43	27.78	19.60	31.43	24.01	4.70	8.88	14.58	5.70
Gemini-3.1-pro-preview	38.69	29.14	36.72	7.58	34.87	10.01	21.03	11.02	24.56	23.89	27.21	3.32
SoulX-Transcriber (Ours)	1.32	6.73	7.31	0.58	23.56	5.17	20.58	15.41	21.15	7.50	19.37	11.87

Demo

Live Transcription Examples

Real-world audio processed by SoulX-Transcriber. Press play and watch the transcript scroll in sync. 🗣️ Overlapping speech from multiple speakers highlights simultaneously.

Cite

References

[1]

Peng, Zhiliang and Yu, Jianwei and Chang, Yaoyao and Wang, Zilong and Dong, Li and Hao, Yingbo and Tu, Yujie and Yang, Chenyu and Wang, Wenhui and Xu, Songchen and others, “VIBEVOICE-ASR Technical Report,” CoRR, 2026. [Link]

[2]

Fu, Yihui and Cheng, Luyao and Lv, Shubo and Jv, Yukai and Kong, Yuxiang and Chen, Zhuo and Hu, Yanxin and Xie, Lei and Wu, Jian and Bu, Hui and Xu, Xin and Du, Jun and Chen, Jingdong, “AISHELL-4,” Proc. Interspeech, 2021. [Link]

[3]

Yu, Fan and Zhang, Shiliang and Fu, Yihui and Xie, Lei and Zheng, Siqi and Du, Zhihao and Huang, Weilong and Guo, Pengcheng and Yan, Zhijie and Ma, Bin and Xu, Xin and Bu, Hui, “M2MeT 2.0,” Proc. ASRU, 2023. [Link]

[4]

Carletta, Jean and Ashby, Simone and Bourban, Sebastien and Flynn, Mike and Guillemot, Mael and Hain, Thomas and Kadlec, Jaroslav and Karaiskos, Vasilis and Kraaij, Wessel and Kronenthal, Melissa and Lathoud, Guillaume and Lincoln, Mike and Lisowska, Agnes and McCowan, Iain and Post, Wilfried and Reidsma, Dennis and Wellner, Pierre, “The AMI Meeting Corpus,” MLMI, Springer, 2005. [Link]