📢: The latest AI podcast model is here! Soul AI Lab open-sources the SOTA Multi Speaker Text To Speech model SoulX-Podcast.
SoulX-Podcast
📢: The latest AI podcast model is here! Soul AI Lab open-sources the SOTA Multi Speaker Text To Speech model SoulX-Podcast.
SoulX-Podcast
📢: The latest AI podcast model is here! Soul AI Lab open-sources the SOTA Multi Speaker Text To Speech model SoulX-Podcast.
SoulX-Podcast
SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity
Hanke Xie1,2,*, Haopeng Lin2,*, Wenxiao Cao2, Dake Guo1, Wenjie Tian1, Jun Wu2, Hanlin Wen2, Ruixuan Shang2, Hongmei Liu2, Zhiqi Jiang2, Yuepeng Jiang1, Wenxi Chen2,3, Ruiqi Yan2,3, Jiale Qian2, Yichao Yan2, Shunshun Yin2, Ming Tao2, Xie Chen3, Lei Xie1,‡, Xinsheng Wang2,‡
1 Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University, Xi’an, China
2 Soul AI Lab, China
3 X-LANCE Lab, Shanghai Jiao Tong University, China
📑 Paper    |   
🐙 GitHub    |   
🤗 HuggingFace
🎤 Demo Page    |   
💬 Contact Us
Abstract
Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional text-to-speech (TTS) tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.
Demo Video
Model Overview
SoulX-Data-Pipeline
In contrast to monologue speech, the processing of dialogue speech necessitates not only obtaining aligned transcripts but also distinguishing between speakers explicitly. As shown in Figure below, the overall workflow comprises speech enhancement, audio segmentation and speaker diarization, text transcription, and quality filtering. Additionally, to facilitate paralinguistic and dialectal controllability, further information is extracted and annotated
Multi-Speaker Podcast Generation
Demonstrate the naturalness and coherence of our model in multi-turn, multi-speaker podcast dialogue generation.
| Dialogue Script (Speaker Turns) | Reference Podcast Audio | SoulX-Podcast Generated Audio |
|---|---|---|
|
[S1] 嗯嗯,我想要再 call back 一下,你之前刚刚讲的是,在这个临床上面其实并不能够让它提效。这个是基于 Research 呢?还是基于政策在这个 Policy 这个 level 方面呢,不能提效。我不知道这个能不能帮我们 break down 一下?嗯。
[S2] 就一方面的话,其实还是技术本身。当然这个次技术,它是一个我觉得是有史以来最伟大的人工智能,在生物学的一个,就是进展和发现和一个提升。但是它还是会出现错误,还是会出现原子重叠的现象。所以这些错误在 biology 这个领域或者药物研发的领域,它是不能容错的,这是一个问题。 |
Speaker 1: Speaker 2: | |
|
[S1] 对对对,物物源其实还是挺不错的。就比如说你早,你晚上去可能就没货了,但你早晨去一般都是有货的。像我今天早晨,嗯,他们一看了我就冲进去了。然后基本上平时想买买不到东西全部都有货。
[S2] 对,说到那个厕纸啊,你说就美国人为什么开始囤厕纸啊?我一开始以为这是谣言。然后呢,其实美国人一直没有反应,直到这上上周,就是呃那个股市熔断啊,然后再加上就是呃,就是宣布,就是各州紧急状况这样子发生。然后呢进超市要排队,结果呢我有,我在去超市的路上看到一对,就是中年夫妇吧,然后两个人两只手都捆满了厕纸。然后我才知道原来这件事情是真的。 |
Speaker 1: Speaker 2: | |
|
[S1] I'm sorry for snapping. That was very unprofessional of me. Let's move to the recap of the match.
[S2] Right it was it was kind of a great one though I mean you gotta admit Alice even though your surrogate did not claim the victory it was really a legendary match almost [S1] It really was It was one for the books absolutely Survay and Cerceiv were supposed to meet uh in only four bouts, but there was a bit of a disturbance early on in the match. [S2] Yeah, somebody got onto the field. They were shouting something. I couldn't hear it, Alice. And honestly, I was just glad when the guards finally dragged them away. I just wanted to get back to the joust, but it did disturb the entire thing. And it riled the crowd up too, which was surprising. [S1] It's true, it later reports came out that the person who stormed the field to cause such a ruckus was a protester against Against your uncle, the emperor. [S2] Why would anybody want to protest against my uncle, the emperor? The single greatest living human person, maybe even greater than human person on the face of this planet or any other, you know. [S2] Of course, the emperor is benevolent and generous and A very very kind ruler. |
Speaker 1: Speaker 2: | |
|
[S1] OK, so the question I've been asking everybody to start with is how did you get into magic?
[S2] So, this is phenomenal, actually. I have a fun little tidbit, just recently. Isn't a few days ago, I got a little pop up on my Facebook, um, in my Facebook memories, if that makes sense. Where it actually reminded me of the exact day, literally the exact day that I learned how to play Magic. And it was 9 years ago, and I learned how to play Magic. I'm from Southern California. I had just moved to San Diego, and I didn't know anyone. I knew like four people in the entire city. |
Speaker 1: Speaker 2: |
Cross-Dialect Controls
Demonstrate the model's ability to generate speech in different Chinese dialects (such as Cantonese, Sichuanese, and Henanese), including dialect switching and voice color preservation.
| Dialogue Script | Target Dialect | Reference Audio (Source Speaker) | SoulX-Podcast Generated Audio |
|---|---|---|---|
|
[S1] 哈囉大家好啊,歡迎收聽我哋嘅節目。喂,我今日想問你樣嘢啊,你覺唔覺得,嗯,而家揸電動車,最煩,最煩嘅一樣嘢係咩啊?
[S2] 梗係充電啦。大佬啊,搵個位都已經好煩,搵到個位仲要喺度等,你話快極都要半個鐘一個鐘,真係,有時諗起都覺得好冇癮。 [S1] 係咪先。如果我而家同你講,充電可以快到同入油差唔多時間,你信唔信先?喂你平時喺油站入滿一缸油,要幾耐啊?五六分鐘? [S2] 差唔多啦,七八分鐘,點都走得啦。電車喎,可以做到咁快?你咪玩啦。 |
Cantonese (粤语) | Speaker 1: Speaker 2: | |
|
[S1]各位《巴适得板》的听众些,大家好噻!我是你们主持人晶晶。今儿天气硬是巴适,不晓得大家是在赶路嘛,还是茶都泡起咯,准备跟我们好生摆一哈龙门阵喃?
[S2]晶晶好哦,大家安逸噻!我是李老倌。你刚开口就川味十足,"摆龙门阵"几个字一甩出来,我鼻子头都闻到茶香跟火锅香咯! [S1]就是得嘛!李老倌,我前些天带个外地朋友切人民公园鹤鸣茶社坐了一哈。他硬是搞不醒豁,为啥子我们一堆人围到杯茶就可以吹一下午壳子,从隔壁子王嬢嬢娃儿耍朋友,扯到美国大选,中间还掺几盘斗地主。他说我们四川人简直是把"摸鱼"刻进骨子里头咯! [S2]哈哈,你那个朋友说得倒是有点儿趣,但他莫看到精髓噻。"摆龙门阵"哪是摸鱼嘛,这是我们川渝人特有的交际方式,更是一种活法。外省人天天说的"松弛感",根根儿就在这龙门阵里头。今天我们就要好生摆一哈,为啥子四川人活得这么舒坦。就先从茶馆这个老窝子说起,看它咋个成了我们四川人的魂儿! |
Sichuanese (四川话) | Speaker 1: Speaker 2: | |
|
[S1]哎,大家好啊,欢迎收听咱这一期嘞《瞎聊呗,就这么说》,我是恁嘞老朋友,燕子。
[S2]大家好,我是老张。燕子啊,今儿瞅瞅你这个劲儿,咋着,是有啥可得劲嘞事儿想跟咱唠唠? [S1]哎哟,老张,你咋恁懂我嘞!我跟你说啊,最近我刷手机,老是刷住些可逗嘞方言视频,特别是咱河南话,咦~我哩个乖乖,一听我都憋不住笑,咋说嘞,得劲儿哩很,跟回到家一样。 [S2]哈哈哈哈,你这回可算说到根儿上了!河南话,咱往大处说说,中原官话,它真嘞是有一股劲儿搁里头。它可不光是说话,它脊梁骨后头藏嘞,是咱一整套、鲜鲜活活嘞过法儿,一种活人嘞道理。 [S1]活人嘞道理?哎,这你这一说,我嘞兴致“腾”一下就上来啦!觉住咱这嗑儿,一下儿从搞笑视频蹿到文化顶上了啊。那你赶紧给我白话白话,这里头到底有啥道道儿?我特别想知道——为啥一提起咱河南人,好些人脑子里“蹦”出来嘞头一个词儿,就是实在?这个实在,骨子里到底是啥嘞? |
Henanese (河南话) | Speaker 1: Speaker 2: |
Paralinguistic Controls
Demonstrate the model's ability to control non-linguistic information (such as laughter, sigh, clearing throat, etc.) in speech.
| Text Input (with Non-Verbal Tags) | Target Style | Reference Audio | SoulX-Podcast Generated Audio |
|---|---|---|---|
|
[S1]哈喽,AI时代的冲浪先锋们!欢迎收听《AI生活进行时》。啊,一个充满了未来感,然后,还有一点点,<|laughter|>神经质的播客节目,我是主持人小希。
[S2]哎,大家好呀!我是能唠,爱唠,天天都想唠的唠嗑! [S1]最近活得特别赛博朋克哈!以前觉得AI是科幻片里的,<|sigh|> 现在,现在连我妈都用AI写广场舞文案了。 [S2]<|laughter|>这个例子很生动啊。是的,特别是生成式AI哈,感觉都要炸了! 诶,那我们今天就聊聊AI<|breathing|>AI是怎么走进我们的生活的哈! [S1]没错。 [S2]<|coughing|>,比如ChatGPT的写作能力啊,我有个程序员朋友,现在用ChatGPT三分钟<|breathing|>三分钟就能写出感情充沛的周报,<|laughter|>把老板都看傻了都。 |
Non-Verbal Sounds (Laughter, Sigh, Clearing Throat, etc.) |
Speaker 1: Speaker 2: |
Long-form Podcast: Coherence and Stability
Demonstrate the model's ability to generate long-form podcast conversations (over 60 minutes) with stable voice consistency and coherent emotional continuity.
Long-form Podcast Conversation (~60 Minutes)
Topic: Discussion of the Psychological Decoding. (View Full Script)
Reference Audio:
Speaker 1 Prompt:Speaker 2 Prompt:
SoulX-Podcast Generated Audio (Full):
PS: The transcription of the long demo audio is currently generated by our pipeline recognition system, and there might be discrepancies with the actual audio.