Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Pierre Andrews 07279891ea
Mutox Classifier Model (#332)
3 months ago
..
07279891ea
Mutox Classifier Model (#332)
3 months ago
07279891ea
Mutox Classifier Model (#332)
3 months ago
07279891ea
Mutox Classifier Model (#332)
3 months ago

README.md

You have to be logged in to leave a comment. Sign In

Tool to compute toxicity in speech (ASR-ETOX) and text (ETOX)

In this tool, we combine an ASR model (M4T or whisper) + the ETOX toxicity detection tool to compute a toxicity score for speech segments.

ETOX was developed as part of the NLLB project and provides a wordlist detection mechanism for 200 languages. By applying ASR on top of the ETOX detection, we can detect the toxicity in speech. You can find a description of the toxicity detection wordlists in the paper cited below.

ASR-ETOX Usage

The script works by taking a TSV as input. The TSV needs a header with column names, it can have multiple columns. By defaut the script will look at the audio for the name of the audio file to load, this can be overriden with --audio_column. The file path in the TSV can be absolute or relative to a root directory specified by --audio_root_dir. They can also be audiozip file formats with the appropriate byteoffset and length, e.g.: fleurs_en_us_ogg_16khz.zip:89474600:49079.

You can choose the ASR model to use, by default it will use seamlessM4T_v2_large. If you prefer to use whisper you can specify a --model_name that starts with whisper_ and finishes with the whisper model name (e.g. whisper_large).

Outputs

The output of the script is a new TSV file with three columns:

  • text the transcription
  • toxicity the number of toxic words detected
  • bad_words a list of toxic words, separated by ,

Sample Command

ASR-ETOX

  • using M4T:
python -m seamless_communication.cli.toxicity.asr_etox --lang deu --audio_column ref_tgt_audio s2t/en-xx/deu.tsv ~/etox.tsv
  • using Whisper:
python -m seamless_communication.cli.toxicity.asr_etox --model_name whisper_large --lang fra --audio_column ref_tgt_audio s2t/en-xx/fra.tsv ~/etox.test.tsv

ETOX

If you only care about getting the toxicity of text, you can use the etox.py script, with one text per line, specifying the language as the first argument.

cut -f 4 fleurs/s2t/en-xx/deu.tsv | python -m seamless_communication.cli.toxicity.etox deu > deu.toxicity.txt

You can also specify an input and output file:

python -m seamless_communication.cli.toxicity.etox deu deu.txt deu.toxicity.txt

Citation

If you use ETOX, ASR-ETOX and SeamlessM4T in your work, please cite:

@misc{costajussà2023toxicity,
      title={Toxicity in Multilingual Machine Translation at Scale},
      author={Marta R. Costa-jussà and Eric Smith and Christophe Ropers and Daniel Licht and Jean Maillard and Javier Ferrando and Carlos Escolano},
      year={2023},
      eprint={2210.03070},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

and

@article{seamlessm4t2023,
  title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
  author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye,  Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
  journal={ArXiv},
  year={2023}
}
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...