StyleTTS2: A Quest To Improve Zero-Shot Performance

Audio Mar 27, 2024

Introduction

Text-to-speech (TTS) technology is an ever-improving field, with every year seeing new and impressive improvements in naturalness and quality. StyleTTS2, debuting in the paper “StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models” marks a significant advancement in TTS technology, introducing an architecture that allows for efficient, yet natural TTS inference.

With StyleTTS2, users can provide a short 5-10 second reference voice, and the model will generate audio that remarkably captures the speaker's timbre, prosody, and pronunciation. Remarkably, this is achieved with a memory footprint of just 2GB of VRAM inferencing in around 2-3 seconds on an RTX 3050M, positioning StyleTTS2 as a promising option for use within the open-source community, or in situations where GPU computing is limited.

In this blog we'll discuss our experiences in finetuning StyleTTS2 to achieve a model with impressive zero-shot performance. We will discuss the challenges we faced, how we overcame them, our dataset processing pipeline and catastrophic model errors, which may appear in other finetunes of StyleTTS2 or any other large AI model.

StyleTTS 2 on DagsHub

Dataset Curation

Dataset Curation is particularly important in the StyleTTS2 model, as the wider the variety in voices (and the more voices you have), the better its 0-shot capabilities become. Despite this, StyleTTS2 is a rather robust model, able to be finetuned on only 30 minutes of audio on a single speaker to get a decent replica, with 4 hours of data achieving near perfection!

We utilized a diverse array of multi-speaker English TTS datasets in our training pipeline, comprising LibriSpeech-R-100, VCTK, and a quality curated version of the AniSpeech dataset by ShoukanLabs. This amalgamated data underwent conversion into a compatible format for use with the StyleTTS2 trainer. Leveraging DagsHub, our dataset was efficiently uploaded and stored, facilitating convenient accessibility across multiple cloud-based platforms. DagsHub also allowed for easy versioning of our dataset via DVC, so if a specific file or transcription was broken we were easily able to correct it.

See the dataset in DagsHub

Whilst the AniSpeech dataset features a wide range of expressiveness and unique speakers, it suffers largely from low quality data due to it's largely automated pipeline (around 90% automated). Whilst we got decent results using the original AniSpeech dataset it became very evident that the dataset in it's current state had to be cleaned to ensure a high quality model. We employed the use of WVMOS, a finetuned wav2vec2.0 model designed to predict the Mean Opinion Score (MOS), a human judged opinionated measure of quality, of audio.

We leveraged this model to predict the MOS scores of the AniSpeech dataset using the following code:

import os
from wvmos import get_wvmos
from tqdm import tqdm

model = get_wvmos(cuda=True)

wavstxt = input("Input wavs.txt: ").replace('"', "")


with open(wavstxt, 'r', encoding="utf8") as fh:
    lines = fh.readlines()

base_dir = os.path.dirname(wavstxt)

nlines = []
for line in tqdm(lines):
    datas = line.replace("\n", "").split("|")
    fpath = datas[0]
    full_path = os.path.join(base_dir, fpath)

    mos = model.calculate_one(full_path)  # infer MOS score for one audio

    if mos >= 2.5:
        nlines.append(line.replace("\n", ""))

transcript = "\n".join(nlines)
text_file = open(f"wavs_cleaned.txt", "w", encoding='utf-8')
n = text_file.write(transcript)
text_file.close()

We would discard audio files which had an MOS score under 2.5 (out of 5), which still retains enough data to be generalisable, yet still removes a majority of the poor quality data, and as a side effect also increased the accuracy of transcriptions as poor audio files would often have completely hallucinated transcriptions from whisper.

Here are some examples of datapoints removed entirely through automatic quality ratings (working with such a large dataset makes it difficult to manually review data, WVMOS may also treat voices with certain overlayed effects as low quality, despite being an inherent property of the reference speaker)

Training the model requires meticulously crafting the configuration file, providing paths to the dataset require for training, consisting of the Train_list, Val_list and OOD_Data. The OOD (Out of distribution) data is essential for adversarial training, as it allows for better generalisation for zero shot. Then theres the batch size, oddly enough, for finetuning StyleTTS2 the minimum batch size is 2 as opposed to the typical 1, this has been resolved in the pre-training code, but not the finetuning code we are using to train this model. Most importantly is the max_len size, ideally, the longer the max_len (the max length of audio passed at each training step) the better and more natural sounding your result will be, depending on dataset quality of course), you can convert the max length argument to seconds by multiplying the length by 300 and then dividing by the sample rate (i.e 500 * 300 / 24000). However, using higher max lengths become extremely expensive to train, as a max len of 500 (which is only 6.25 seconds) uses almost all the VRAM on a 24GB RTX 3090 at a 24Khz sample rate. To put this into perspective, if we were to train at 44Khz with the same max length we would only see 3.4 seconds of audio per step whilst using more VRAM. For more information regarding configuration for training you can find it here

We used a modified version of the original repo’s training script to better support accelerate, which you can find on our DagsHub repo here. This allows for efficient training with mixed precision, and proper experiment tracking with DagsHub tracking via mlflow. It also features some minor improvements in the training process that make it more user friendly to use.

See the project experiments on DagsHub

We trained on a single H100 80GB for 300 GPU hours, trained on a max_len of 1000 (12.5 seconds), and on approximately 7 days of audio with 629 speakers, whilst it's still trained on less speakers than the base model, we still see a significant improvement in the naturalness of prosody and timbre and zero shot performance across a wide range of tasks compared to the original base model, we believe this is due to a wider range of speakers, ranging from high and low pitched in a variety of accents and prosodies, expanding the vector space despite the lack of speakers. We then swapped to a single 3090 for an additional 600 hours with a max_len of 500 to extend our training time.

However, we encountered several issues during training, such as catastrophic forgetting, and completely unusable results. As such, we had to restart training multiple times, which heavily set back our progress, pytorch was also no stranger to undocumented CUDA errors. At first it was due to a lack of clean data, which was easily remedied thanks to DVC and DagsHub, allowing us to quickly swap out our dataset with a quality rated version, which had significantly better outputs, some of these results from early models can be found below.

See the model on DagsHub

Despite these shortcomings we were quickly able to resolve these issues. The NVIDIA H100 GPU is seemingly incompatible (at least, without an extreme effort) with StyleTTS2 for no apparent reason, and only AFTER scaling our dataset beyond our initial 48 hours used in extremely early training. This issue was resolved temporarily by using a lower max_len of around 500, towards the end of training on the H100 (only 1 or 1.5 epochs), before eventually swapping to a 3090, which seemingly has no issue with StyleTTS2, eventually producing the results below.

Interesting Findings

StyleTTS2, leveraging PL-BERT alongside style diffusion and adversarial training, exhibits some responsiveness to prompt engineering. Notably, during our model testing phase, we observed that TTS prompts closely aligned with the training data yield more natural-sounding outputs, enhancing speaker reference similarity. This finding was particularly evident when utilizing the AniSpeech dataset, where animated character reference voices demonstrated superior performance when presented with TTS prompts closely mirroring the transcripts of anime or cartoons. In contrast, generic prompts often resulted in lower-quality outputs, likely due to the influence of the style diffusion component has within the model architecture.

Swapping the english PL-BERT for a multilingual PL-BERT also has an interesting effect, using a multilingual PL-BERT on a non-multilingual StyleTTS2 model and giving it phonemes in another language allows synthesis of different languages on unfinetuned models, the unfinetuned models however do not possess any accents and pronunciation is really poor, but overall it's language capabilities still increase, likely a result of training on phonemes as opposed to words.

Combining speaker style references is also possible, and rather trivial! Simply averaging all the style embeddings you input is enough to output a blend between 2 or more people and can be done using this code:

ref_s1 = styletts.compute_style(audio_path1)
ref_s2 = styletts.compute_style(audio_path2)

weight2 = 100 - audio_weight

weights = [audio_weight, weight2]
values = [ref_s1, ref_s2]
weighted_average = sum(weight * value for weight, value in zip(weights, values)) / sum(weights)

ref_s = weighted_average

print(ref_s.size())

s_prev = None

Here's a sample of Donald Trump and David Attenborough merged with equal weighting to each voice.

The most important finding however, is the use of longer audio references. Whilst they take longer to compute the initial style vector, they increase the quality profoundly the longer and cleaner they are. Despite StyleTTS2 being a 24KHz model by default, higher sample rate references tend to be cleaner and possess a higher similarity to the target speaker, despite the fact it's still downsampled. This is likely caused by something similar to supersampling in graphics processing, but in this case with audio. Often times however, pairing StyleTTS2 with a speech enhancement model will output results with exceptional quality and may entirely eliminate the need for a clean reference audio. Perhaps in the future a speech enhancement model will be added to the pipeline officially.

Conclusion

Overall, StyleTTS2 holds significant promise for TTS/voice cloning technology in the open-source community in scenarios where GPU resources are limited. Continued research and development in this field are likely to further enhance the capabilities of text-to-speech technology, bringing us closer to human-level audio synthesis, and hopefully in the future we'll have an open-source alternative to proprietary voice cloning/TTS technologies!

Despite it's robustness it still needs a lot more training on a lot more data (likely more than 5000 hours of audio and an extremely large amount of speakers) to perform at it's best. This is where you, the reader, come in, if you have any large single speaker datasets you'd like to contribute, in any langauge, you can contribute to our Vokan dataset. A large community dataset that combines a bunch of smaller single speaker datasets to create one big multispeaker one. You can upload your uberduck or FakeYou compliant datasets via the Vokan bot on the ShoukanLabs Discord Server. The more data we have, the better the models we produce will be!

Citations:

@misc{li2023styletts,
      title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, 
      author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani},
      year={2023},
      eprint={2306.07691},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

@misc{zen2019libritts,
      title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech}, 
      author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
      year={2019},
      eprint={1904.02882},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Christophe Veaux,  Junichi Yamagishi, Kirsten MacDonald, 
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",  
The Centre for Speech Technology Research (CSTR), 
University of Edinburgh

Recommended for you

LLMs

Grokking Large Language Models

9 months ago • 7 min read

Audio

Build a Speaker Recognition Model Using MLOps Tools

a year ago • 7 min read

Direct Data Access

Streaming 40+ Open Source Audio Datasets for free with 3 lines of code

a year ago • 1 min read

How to choose MLOps tools (MLOps from first principles)

🍪 Machine Learning in the cookie-less era with Uri Goren

Top Computer Vision Generative Models in 2024

7 Best Machine Learning Workflow and Pipeline Orchestration Tools 2024

StyleTTS2: A Quest To Improve Zero-Shot Performance

Introduction

Dataset Curation

Interesting Findings

Conclusion

Citations:

Tags

ButterCream

Recommended for you

Grokking Large Language Models

Build a Speaker Recognition Model Using MLOps Tools

Streaming 40+ Open Source Audio Datasets for free with 3 lines of code

How to choose MLOps tools (MLOps from first principles)

🍪 Machine Learning in the cookie-less era with Uri Goren

Top Computer Vision Generative Models in 2024

7 Best Machine Learning Workflow and Pipeline Orchestration Tools 2024

Introduction

Dataset Curation

Interesting Findings

Conclusion

Citations:

Tags

Join DAGsHub

ButterCream

Recommended for you

Grokking Large Language Models

Build a Speaker Recognition Model Using MLOps Tools

Streaming 40+ Open Source Audio Datasets for free with 3 lines of code