SoulX-FlashTalk
Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation

*Equal Contribution, Corresponding Author
AIGC Team, Soul AI Lab
China


Technique Report Code Demo

Abstract

SoulX-FlashTalk, a 14B-parameter system optimized for high-fidelity streaming. By employing Bidirectional Streaming Distillation , we retain intra-chunk bidirectional attention to preserve spatiotemporal correlations. This design significantly simplifies training. The model converges with only 1,000 steps of SFT and 200 steps of distillation, representing a 23x efficiency improvement. To ensure infinite stability, we incorporate a Multi-step Retrospective Self-Correction Mechanism. Combined with our full-stack acceleration suite, SoulX-FlashTalk becomes the first 14B-scale system to achieve a start-up latency of 0.87s and a real-time throughput of 32 FPS.

Real-time Interaction Demo

SoulX-FlashTalk supports real-time inference with minimal latency.

Long-term Generation Stability

Cartoon Generation

Multilingual Generation

Comparison with Other Methods

Ours

liveAvatar

Infinitetalk

Ditto

Method

To satisfy real-time inference under strict latency constraints, we employ a two-stage training strategy. Latency-Aware Spatiotemporal Adaptation adapts the model to reduced spatial resolutions and shorter frame sequences, while Self-Correcting Bidirectional Distillation further reduces sampling steps and removes classifier-free guidance. This two-stage procedure enables rapid model responses while preserving high generation quality.

Training Architecture

SoulX-FlashTalk is the first 14B-parameter framework that maintains a real-time throughput of 32 FPS while achieving a start-up latency of 0.87 seconds.

Streaming Strategy

Contributions

All contributors are listed in no particular order.

Project Sponsor: Ming Tao, Shunshun Yin

Project Leader: Siyuan Liu

Algorithm: Le Shen, Qian Qiao, Tan Yu

Deployment & Acceleration: Ke Zhou, Tianhang Yu, Yu Zhan

Data & Evaluation: Tan Yu, Tianhang Yu, Dingcheng Zhen

BibTeX

@article{soulx2025flashtalk,
      title={SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation}, 
      author={Le Shen and Qian Qiao and Tan Yu and Ke Zhou and Tianhang Yu and Yu Zhan and Zhenjie Wang and Ming Tao and Shunshun Yin and Siyuan Liu},
      year={2025},
      eprint={2512.23379},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.23379}, 
}