QinYu: A Family of High-Fidelity Zero-Shot TTS

Abstract

We introduce QinYu, a family of high-fidelity text-to-speech systems capable of generating speech 32kHz studio-quality speech with exceptional naturalness. For the audiobook scenario, we have achieved fine-grained control of emotions through text instructions described in natural language, which significantly enhances emotional expressiveness. Meanwhile, we have also researched text-to-timbre technology: by describing the gender, age, and personality of a desired timbre, we can generate the corresponding voice timbre, thus solving the problems of limited timbre options and difficulties in timbre matching. For the podcast dialogue scenario, we have implemented the ability for spontaneous colloquial expression (with automatic addition of pauses, hesitations, and moments of thinking) as well as enhanced paralinguistic expression, resulting in a more realistic and human-like effect.The goal of systems is to close the gap between synthetic and human speech across diverse TTS applications.

Instructed Voice Generation

Instruct	Generated

Text-to-Timbre

Category	Text-to-Timbre Prompt	Instruct	Generated

Full-cast Audiobook Generation

Text	Generated

Zero-Shot Auto-Oral Voice Generation

Text	Prompt	F5-TTS	CosyVoice2	QinYu-Cast

Podcast Generation

Prompt 1	Prompt 2	MOSS-TTSD	VibeVoice Large	QinYu-Cast

Disclaimer

The content presented above is solely for academic purposes and serves to showcase technical capabilities. It should be noted that some of the examples have been sourced from the internet. In the event that any of the content infringes upon your rights, we kindly request that you contact us promptly so that we can take appropriate measures to have it removed. We respect intellectual property rights and are committed to ensuring that all materials used are in compliance with relevant regulations.

BibTeX

@article{qinyu-2025,
  title={QinYu: A Family of High-Fidelity Zero-Shot TTS},
  author={QinYu Team},
  year={2025}
}