Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

Tan Yu^* Qian Qiao^*, Le Shen^* Ke Zhou
Jincheng Hu Dian Sheng Bo Hu Haoming Qin Jun Gao Changhai Zhou Shunshun Yin Siyuan Liu^#,

^*Equal Contribution, Corresponding Author, ^#Project Leader
AIGC Team, Soul AI Lab

Technical Report Code

Checkpoints

Dataset

Real-time Interaction Demo

SoulX-FlashHead is a unified 1.3B-parameter framework designed for high-fidelity, infinite-length, and real-time streaming portrait video generation. By integrating Streaming-Aware Spatiotemporal Pre-training and Oracle-Guided Bidirectional Distillation, it ensures robust feature extraction from short audio fragments while eliminating identity drift in long sequences. Trained on our 782-hour VividHead dataset, the model achieves state-of-the-art performance across HDTF and VFHQ benchmarks. Notably, our Lite variant delivers an ultra-fast inference speed of 96 FPS on a single RTX 4090, facilitating seamless, low-latency digital human interactions.

Overall Framework of SoulX-FlashHead.

(a) Stage 1: Streaming-Aware Spatiotemporal Pre-training. We employ a Temporal Audio Context Cache to stabilize feature extraction from short streaming audio and utilize channel-wise concatenation for robust reference image injection. (b) Stage 2: Oracle-Guided Bidirectional Distillation. To mitigate error accumulation, the Student generates autoregressively conditioned on its own historical predictions, while the Teacher utilizes Ground Truth motion frames as an "Oracle" guide. The model is optimized via a Stochastic Truncation Strategy using DMD and latent regression losses.

VividHead Dataset Construction

VividHead consists of 330,000 high-quality short clips (3s--60s) totaling 782 hours. Each sample features: (i) $512\times 512$ resolution image sequences; (ii) strictly time-aligned speech audio; and (iii) rich metadata including language, ethnicity, and age. We strictly limit samples to those containing a single visible speaker with an active head region.

Comparison with Other Methods

We further evaluate our model on the challenging long video generation task (60 seconds at 25 fps) and compare it with SOTA methods. Our model maintains high-fidelity generation capabilities throughout the entire duration, covering most practical application scenarios. The following video demonstrates the performance of different methods and the comparison of their FPS on a single NVIDIA RTX 4090 GPU. Specifically, our method achieves superior long-term consistency without error accumulation. Furthermore, compared to models based on abstract motion representations such as Ditto and SadTalker, our method demonstrates better lip-sync consistency. Notably, SadTalker fails to maintain the structural connection between the headgear and the subject during motion due to the lack of a holistic representation, whereas our method preserves robust holistic consistency.

Ours-Pro (10.81 FPS)

Ours-Lite (96.00 FPS)

SadTalker (2.17 FPS)

Ditto (45.04 FPS)

Hallo3 (0.16 FPS)

EchoMimic_V3 (0.81 FPS)

Short Video Results

Top: FlashHead-Pro; Bottom: FlashHead-Lite.

Long Video Results

Top: FlashHead-Pro; Bottom: FlashHead-Lite.

Ethics Statement

This research aims to advance digital human synthesis for beneficial applications. We confirm that all datasets utilized in this study are derived from publicly accessible academic repositories. The visual demonstrations presented in this report are fully synthetic and do not contain the Personally Identifiable Information (PII) of private individuals.
We acknowledge the dual-use nature of high-fidelity video generation technology and the potential risks associated with its misuse, such as the creation of deepfakes or the spread of misinformation. We firmly condemn any malicious application of this technology and advocate for the principles of Responsible AI. To mitigate these risks, we support the development of robust forgery detection algorithms and the implementation of invisible watermarking mechanisms to ensure content transparency and traceability. We remain committed to adhering to ethical guidelines and ensuring that our contributions promote the safe and positive evolution of the field.

Citations

BibTeX

@misc{yu2026soulxflashheadoracleguidedgenerationinfinite,
      title={SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads}, 
      author={Tan Yu and Qian Qiao and Le Shen and Ke Zhou and Jincheng Hu and Dian Sheng and Bo Hu and Haoming Qin and Jun Gao and Changhai Zhou and Shunshun Yin and Siyuan Liu},
      year={2026},
      eprint={2602.07449},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.07449}, 
}

This page was built using the modification version of Academic Project Page Template from vinthony. You are free to borrow the of this website, we just ask that you link back to this page in the footer. This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.