Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  git
yl4579 b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b4c9b1e335
Initial commit
1 year ago
7d70481d9d
Update README.md
1 year ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
b10eb2cf52
training code single speaker
8 months ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs

Paper: https://arxiv.org/abs/2306.07691

Audio samples: https://styletts2.github.io/

Training

Under construction

Inference

Under construction

Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...