Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

text_utils.py 1.0 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
  1. # IPA Phonemizer: https://github.com/bootphon/phonemizer
  2. _pad = "$"
  3. _punctuation = ';:,.!?¡¿—…"«»“” '
  4. _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
  5. _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
  6. _vokan_symbols = "()♪😂💨😮‍🥱😱😡😭"
  7. # Export all symbols:
  8. symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa) + list(_vokan_symbols)
  9. dicts = {}
  10. for i in range(len((symbols))):
  11. dicts[symbols[i]] = i
  12. class TextCleaner:
  13. def __init__(self, dummy=None):
  14. self.word_index_dictionary = dicts
  15. print(len(dicts))
  16. def __call__(self, text):
  17. indexes = []
  18. for char in text:
  19. try:
  20. indexes.append(self.word_index_dictionary[char])
  21. except KeyError:
  22. print(text)
  23. return indexes
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...