Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

descriptive_analysis.py 1.8 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
  1. # ---
  2. # jupyter:
  3. # jupytext:
  4. # formats: py:light
  5. # text_representation:
  6. # extension: .py
  7. # format_name: light
  8. # format_version: '1.5'
  9. # jupytext_version: 1.13.6
  10. # kernelspec:
  11. # display_name: Python [conda env:text-data-class]
  12. # language: python
  13. # name: conda-env-text-data-class-py
  14. # ---
  15. import pandas as pd
  16. import re
  17. import nltk
  18. from nltk.corpus import wordnet as wn
  19. from nltk.stem import WordNetLemmatizer
  20. import yaml
  21. import janitor as pj
  22. import matplotlib.pyplot as plt
  23. import numpy as np
  24. # +
  25. ## Load up data
  26. df = pd.read_feather("data/cleaned_speeches")
  27. # +
  28. ## Bring in parameters
  29. with open("params.yaml", "r") as fd:
  30. params = yaml.safe_load(fd)
  31. punctuation = params['preprocessing']['punctuation']
  32. stopwords = params['preprocessing']['stopwords'] + nltk.corpus.stopwords.words('english')
  33. # +
  34. ## Tokenize and explode
  35. df['speech'] = df.speech.apply(nltk.tokenize.word_tokenize)
  36. df = df.explode('speech').reset_index()
  37. # +
  38. ## Remove stopwords (to capture compounds like the United Nations)
  39. df = df.filter_column_isin('speech',
  40. stopwords,
  41. complement = True)
  42. # +
  43. ## Lemmatize! We don't need tense, focus is on the topics we're covering,
  44. ## and this way we'll reduce the likliehood that we're conflating meanings.
  45. wnl = WordNetLemmatizer()
  46. df['speech'] = ' '.join([wnl.lemmatize(w) for w in df.speech]).split()
  47. # +
  48. ## Remove stopwords and punctuation
  49. df = df.filter_column_isin('speech',
  50. stopwords,
  51. complement = True)
  52. df = df.filter_column_isin('speech',
  53. punctuation,
  54. complement = True)
  55. df.speech.value_counts().head(30)
  56. # +
  57. df = df[['id', 'speaker', 'date', 'title', 'speech', 'decade']].reset_index(drop = True)
  58. df.to_feather(r'data/descriptive')
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...