Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

text_utils.py 1.1 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
  1. # IPA Phonemizer: https://github.com/bootphon/phonemizer
  2. _pad = "$"
  3. _punctuation = ';:,.!?¡¿—…"«»“” '
  4. _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
  5. _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
  6. _vokan_symbols = "()♪😂💨😮‍🥱😱😡😭1234567890`-"
  7. _ligature = "͡"
  8. # Export all symbols:
  9. symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa) + list(_vokan_symbols) + list(_ligature)
  10. dicts = {}
  11. for i in range(len((symbols))):
  12. dicts[symbols[i]] = i
  13. class TextCleaner:
  14. def __init__(self, dummy=None):
  15. self.word_index_dictionary = dicts
  16. print(len(dicts))
  17. def __call__(self, text):
  18. indexes = []
  19. for char in text:
  20. try:
  21. indexes.append(self.word_index_dictionary[char])
  22. except KeyError:
  23. print(text)
  24. return indexes
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...