Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

merge_crawled_data.py 992 B

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
  1. """
  2. this file is used to merge the csv files in the crawled-data folder
  3. """
  4. from pathlib import Path
  5. import re
  6. import pandas as pd
  7. home_path = Path(__file__).parents[1]
  8. def merge_all_data(folder_path, website):
  9. dfs = []
  10. for file_path in Path(folder_path).iterdir():
  11. clsnm = file_path.name # Get the name of the cluster
  12. # print(clsnm)
  13. df = pd.read_csv(file_path)
  14. df["cluster"] = re.sub(r"\.csv$", "", clsnm)
  15. dfs.append(df)
  16. # 將所有DataFrame合併
  17. combined_df = pd.concat(dfs, ignore_index=True)
  18. export_path = home_path.joinpath("crawled_data/export/ALL")
  19. export_path.mkdir(parents=True, exist_ok=True)
  20. combined_df.to_csv(export_path.joinpath(f"{website}.csv"), index=False)
  21. if __name__ == "__main__":
  22. # 設定輸入資料夾的相對路徑
  23. website = "TCI_20240507"
  24. folder_path = home_path.joinpath(f"crawled_data/export/{website}")
  25. # 呼叫合併函數
  26. merge_all_data(folder_path, website)
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...