QinYu: A Family of High-Fidelity Zero-Shot TTS with High Naturalness, Spontaneous Colloquialism, and Emotional Control

Abstract

We introduce QinYu, a family of high-fidelity text-to-speech systems designed to deliver 32kHz studio-quality speech with exceptional naturalness—surpassing the 22kHz limit of most open-source TTS tools. Engineered for versatility, this family excels in spontaneous colloquialism for conversational scenarios (e.g., podcasts) and fine-grained emotional control for narrative contexts (e.g., audiobooks), with a standout ability to generate authentic paralinguistic elements like natural laughter and precise prosodic pauses. Its scenario-specific variants include QinYuCast, which automates colloquial artifacts (e.g., pauses, hesitations) for lifelike dialogue, and QinYuInstruct, which enables emotion specification via simple descriptors like "warm" or "excited." Future iterations will advance an "ALL-in-One" architecture built on a million-hour-scale base model, integrating controllable paralinguistic tagging, adjustable colloquialism strength, large-model enhancements, and novel voice generation—closing the gap between synthetic and human speech across diverse TTS applications.

Paralanguage Voice Generation

Text	Prompt	CosyVoice2	GPT-SoVITS	ChatTTS	QinYu

Instructed Voice Generation

Speaker	Instruct	Generated

Text-to-Timbre

Category	Text-to-Timbre Prompt	Instruct	Generated

Audiobook Generation

Text	Generated

Zero-Shot Auto-Oral Voice Generation

Text	Prompt	F5-TTS	CosyVoice2	QinYu-Cast

Podcast Generation

Prompt 1	Prompt 2	MOSS-TTSD	Doubao(豆包)	VibeVoice Large	QinYu-Cast

BibTeX

@article{qinyu-2025,
  title={QinYu: A Family of High-Fidelity Zero-Shot TTS with High Naturalness, Spontaneous Colloquialism, and Emotional Control},
  author={QinYu Team},
  year={2025}
}