Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

topic_model.py 754 B

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  1. #!/usr/bin/env python
  2. # coding: utf-8
  3. # %%
  4. #Use TF-IDF embeddings to train topic model
  5. from bertopic import BERTopic
  6. import numpy as np
  7. import pandas as pd
  8. import re
  9. from sklearn.feature_extraction.text import TfidfVectorizer
  10. #Load in data
  11. df = pd.read_csv('data/tweets.csv')
  12. #Drop tweets not in english
  13. df = df.loc[df['language'] == 'en']
  14. df['tweet'] = df['tweet'].str.replace(r'http\S+', '')
  15. df = df.loc[df['tweet'] != '']
  16. docs = df['tweet'].reset_index(drop=True)
  17. #Create vectorizer
  18. vectorizer = TfidfVectorizer(min_df=5)
  19. embeddings = vectorizer.fit_transform(docs)
  20. #Train our topic model using TF-IDF vectors
  21. topic_model = BERTopic()
  22. topics, probs = topic_model.fit_transform(docs, embeddings)
  23. #Save model
  24. topic_model.save('project_BERTopic')
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...