QinYu: A Family of High-Fidelity Zero-Shot TTS with High Naturalness, Spontaneous Colloquialism, and Emotional Control

QinYu Team
TME Lyra Lab

Abstract

We introduce QinYu, a family of high-fidelity text-to-speech systems designed to deliver 32kHz studio-quality speech with exceptional naturalness—surpassing the 22kHz limit of most open-source TTS tools. Engineered for versatility, this family excels in spontaneous colloquialism for conversational scenarios (e.g., podcasts) and fine-grained emotional control for narrative contexts (e.g., audiobooks), with a standout ability to generate authentic paralinguistic elements like natural laughter and precise prosodic pauses. Its scenario-specific variants include QinYuCast, which automates colloquial artifacts (e.g., pauses, hesitations) for lifelike dialogue, and QinYuInstruct, which enables emotion specification via simple descriptors like "warm" or "excited." Future iterations will advance an "ALL-in-One" architecture built on a million-hour-scale base model, integrating controllable paralinguistic tagging, adjustable colloquialism strength, large-model enhancements, and novel voice generation—closing the gap between synthetic and human speech across diverse TTS applications.

Paralanguage Voice Generation

Text Prompt CosyVoice2 GPT-SoVITS ChatTTS QinYu

Instructed Voice Generation

Speaker Instruct Generated

Text-to-Timbre

Category Text-to-Timbre Prompt Instruct Generated

Audiobook Generation

Text Generated

Zero-Shot Auto-Oral Voice Generation

Text Prompt F5-TTS CosyVoice2 QinYu-Cast

Podcast Generation

Prompt 1 Prompt 2 MOSS-TTSD Doubao(豆包) VibeVoice Large QinYu-Cast

BibTeX

@article{qinyu-2025,
  title={QinYu: A Family of High-Fidelity Zero-Shot TTS with High Naturalness, Spontaneous Colloquialism, and Emotional Control},
  author={QinYu Team},
  year={2025}
}