Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

seamless_align_README.md 33 KB

You have to be logged in to leave a comment. Sign In

Seamless - Speech to Speech and Speech to Text Metadata

This document contains metadata information for reconstructing the dataset we used for training our models.

Format

The metadata format is similar to NLLB bitext format with some small differences.

The metadata files are tab separated, gzip files. Each file corresponds to one alignment direction.

File naming convention:

  • for text, we use 3 letters: e.g. fra, eng, tur
  • for audio, we use 2 letters and a 'A': e.g. frA, enA, trA

For example, the direction eng-trA corresponds to information for reconstructing English text with Turkish speech alignments. Similarly, enA-jpn corresponds to "English speech with Japanese text", and enA-frA corresponds to "English speech with French speech".

Each line has 11 columns.

For Audio, the columns correspond to:

- `cc_warc`: The warc file reference containing the public audio url
- `cc_sha`: not used
- `audio_speeh_segment_url`: space separated audio reference. See below.
- `cc_lineno`: not used
- `paragraph_digest`: expected duration of the whole audio file (without start/end frame trimming)
- `sentence_digest`: not used
- `text_lid_score`: not used
- `laser_score`: score of the alignment
- `direction`: direction, e.g. `enA-jpn`
- `side`: side, e.g. `enA` or `jpn`
- `line_no`: alignment number

audio_speech_segment_url is a space separated audio reference. It has the following format: <url> <start_frame> <end_frame>, where start_frame and end_frame correspond to the segment that needs to be extracted from the audio file that is referenced at <url>, resampled at 16000 Hz.

For text, the columns are similar to NLLB format (except being tab separated here):

  • If the metadata comes from Common Crawl:

    • cc_warc: the reference to the Common Crawl WET file
    • cc_sha: the document sha1 in the WET file
    • cc_document_url: the url of the document referenced in the WET file
    • cc_lineno: the line number in the document referenced in the WET file
    • paragraph_digest: xxhash.xxh3_64_intdigest of the paragraph
    • sentence_digest: xxhash.xxh3_64_intdigest of the sentence
    • text_lid_score: language identification score, when available
    • laser_score: score of the alignment
    • direction: direction, e.g. enA-jpn
    • side: side, e.g. enA or jpn
    • line_no: alignment number
  • If the metadata comes from other corpus:

    • corpus: corpus name
    • cc_sha: not used
    • cc_document_url: not used
    • lineno: line number in the document
    • paragraph_digest: xxhash.xxh3_64_intdigest of the paragraph
    • sentence_digest: xxhash.xxh3_64_intdigest of the sentence
    • text_lid_score: language identification score, when available
    • laser_score: score of the alignment
    • direction: direction, e.g. enA-jpn
    • side: side, e.g. enA or jpn
    • line_no: alignment number

Data

Update: 30 Nov 2023

We are publishing an extension of the previous speech to speech release.

afA-enA amA-enA arA-enA asA-enA azA-enA beA-enA bgA-enA bnA-enA bsA-enA caA-enA csA-enA cyA-enA daA-enA deA-enA elA-enA enA-esA enA-etA enA-fiA enA-frA enA-gaA enA-glA enA-guA enA-heA enA-hiA enA-hrA enA-huA enA-hyA enA-idA enA-isA enA-itA enA-jaA enA-jvA enA-kaA enA-kiA enA-kkA enA-knA enA-koA enA-kyA enA-lgA enA-loA enA-ltA enA-lvA enA-mkA enA-mlA enA-mnA enA-mrA enA-msA enA-mtA enA-neA enA-nlA enA-noA enA-orA enA-paA enA-pbA enA-plA enA-psA enA-ptA enA-rnA enA-ruA enA-sdA enA-skA enA-slA enA-srA enA-svA enA-swA enA-taA enA-teA enA-tgA enA-thA enA-trA enA-ukA enA-urA enA-uzA enA-viA enA-yoA enA-zhA


Update: 25 Sep 2023

We are publishing updated metadata with the expected duration of the original audio file in the column paragraph_digest (originally not used for audio).

arb-enA ben-enA cat-enA dan-enA enA-est enA-fin enA-jpn enA-mlt enA-nld enA-pol enA-por enA-ron enA-slk enA-swe enA-swh enA-tur enA-ukr enA-urd enA-vie arA-enA arA-eng beA-enA caA-enA caA-eng csA-enA csA-eng cyA-enA cyA-eng daA-enA daA-eng deA-enA deA-eng enA-esA enA-fiA enA-frA enA-hiA enA-idA enA-itA enA-knA enA-koA enA-mtA enA-nlA enA-plA enA-ptA enA-rnA enA-ruA enA-skA enA-svA enA-swA enA-taA enA-teA enA-tgA enA-thA enA-trA enA-ukA enA-urA enA-uzA enA-viA enA-zhA eng-esA eng-fiA eng-frA eng-hiA eng-idA eng-itA eng-knA eng-koA eng-mtA eng-nlA eng-plA eng-ptA eng-rnA eng-ruA eng-skA eng-swA eng-taA eng-teA eng-tgA eng-thA eng-trA eng-ukA eng-urA eng-uzA eng-viA eng-zhA


You can find the legacy metadata (without duration information) here:

Legacy Data

arb-enA ben-enA cat-enA dan-enA enA-est enA-fin enA-jpn enA-mlt enA-nld enA-pol enA-por enA-ron enA-slk enA-swe enA-swh enA-tur enA-ukr enA-urd enA-vie arA-enA arA-eng beA-enA caA-enA caA-eng csA-enA csA-eng cyA-enA cyA-eng daA-enA daA-eng deA-enA deA-eng enA-esA enA-fiA enA-frA enA-hiA enA-idA enA-itA enA-knA enA-koA enA-mtA enA-nlA enA-plA enA-ptA enA-rnA enA-ruA enA-skA enA-svA enA-swA enA-taA enA-teA enA-tgA enA-thA enA-trA enA-ukA enA-urA enA-uzA enA-viA enA-zhA eng-esA eng-fiA eng-frA eng-hiA eng-idA eng-itA eng-knA eng-koA eng-mtA eng-nlA eng-plA eng-ptA eng-rnA eng-ruA eng-skA eng-swA eng-taA eng-teA eng-tgA eng-thA eng-trA eng-ukA eng-urA eng-uzA eng-viA eng-zhA

Download script

You can use the wet_lines script to download and gather aligned text information from the metadata. This script can be found here.

Example usage:

zcat seamless.dataset.metadata.public.enA-swA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | wet_lines

Based on metadata information it receives from stdin, wet_lines will download the corpora, find the paragraph and print the input with an additional column which corresponds to the text of the paragraph.

In order to retrieve the sentences from these paragraphs, one can use the sentence splitter available here. It will print the input (metadata + paragraph) with an additional column which corresponds to the text of the sentence.

Reconstructing sentences from metadata:

xzcat metadatafile.xz | egrep ^crawl-data | wet_lines | python -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()"

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...