Are you sure you want to delete this access key?
A Text-to-Speech Transformer in TensorFlow 2
Implementation of a non-autoregressive Transformer based neural network for Text-to-Speech (TTS).
This repo is based, among others, on the following papers:
Our pre-trained LJSpeech model is compatible with the pre-trained vocoders:
(older versions are available also for WaveRNN)
For quick inference with these vocoders, checkout the Vocoding branch
Being non-autoregressive, this Transformer model is:
These samples' spectrograms are converted using the pre-trained MelGAN vocoder.
Try it out on Colab:
Make sure you have:
Install espeak as phonemizer backend (for macOS use brew):
sudo apt-get install espeak
Then install the rest with pip:
pip install -r requirements.txt
Read the individual scripts for more command line arguments.
Use our pre-trained model (with Griffin-Lim) from command line with
python predict_tts.py -t "Please, say something."
Or in a python script
from data.audio import Audio
from model.factory import tts_ljspeech
model = tts_ljspeech()
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
You can specify the model step with the --step
flag (CL) or step
parameter (script).
Steps from 60000 to 100000 are available at a frequency of 5K steps (60000, 65000, ..., 95000, 100000).
IMPORTANT: make sure to checkout the correct repository version to use the API.
Currently 493be63453
You can directly use LJSpeech to create the training dataset.
config/training_config.yaml
to create MelGAN or HiFiGAN compatible models
data_config_wavernn.yaml
in config/training_config.yaml
to create models compatible with WaveRNNconfig/training_config.yaml
edit the paths to point at your dataset and log foldersPrepare a folder containing your metadata and wav files, for instance
|- dataset_folder/
| |- metadata.csv
| |- wavs/
| |- file1.wav
| |- ...
if metadata.csv
has the following format
wav_file_name|transcription
you can use the ljspeech preprocessor in data/metadata_readers.py
, otherwise add your own under the same file.
Make sure that:
data_name
field in training_config.yaml
.metadata_path
in training_config.yaml
Change the --config
argument based on the configuration of your choice.
python create_training_data.py --config config/training_config.yaml
This will populate the training data directory (default transformer_tts_data.ljspeech
).
python train_aligner.py --config config/training_config.yaml
First use the aligner model to create the durations dataset
python extract_durations.py --config config/training_config.yaml
this will add the durations.<session name>
as well as the char-wise pitch folders to the training data directory.
python train_tts.py --config config/training_config.yaml
training_config.yaml
--reset_dir
(both) or --reset_logs
, --reset_weights
tensorboard --logdir /logs/directory/
From command line with
python predict_tts.py -t "Please, say something." -p /path/to/weights/
Or in a python script
from model.models import ForwardTransformer
from data.audio import Audio
model = ForwardTransformer.load_model('/path/to/weights/')
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
Access the pre-trained models with the API call.
Old weights
MelGAN and WaveRNN: data normalization and samples' vocoders are from these repos.
Erogol and the Mozilla TTS team for the lively exchange on the topic.
See LICENSE for details.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?