Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

split_data.py 1.2 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  1. # raw data split
  2. # save it in data/processed folder
  3. import os
  4. import argparse
  5. import pandas as pd
  6. from sklearn.model_selection import train_test_split
  7. from get_data import read_params
  8. def split_and_saved_data(config_path):
  9. # split and storing the file in respective path
  10. config = read_params(config_path)
  11. test_data_path = config["split_data"]["test_path"]
  12. train_data_path = config["split_data"]["train_path"]
  13. raw_data_path = config["load_data"]["raw_dataset_csv"]
  14. split_ratio = config["split_data"]["test_size"]
  15. random_state = config["base"]["random_state"]
  16. df = pd.read_csv(raw_data_path,sep=',')
  17. train,test = train_test_split(df,test_size=split_ratio,random_state=random_state)
  18. test.to_csv(test_data_path,sep=",",index=False,encoding = "utf-8")
  19. train.to_csv(train_data_path,sep=",",index=False,encoding = "utf-8")
  20. # we are using main in every py file just to check that py file execution
  21. if __name__ == "__main__":
  22. args = argparse.ArgumentParser()
  23. # we just reading the params.yaml
  24. args.add_argument("--config",default = "params.yaml")
  25. parsed_args = args.parse_args()
  26. split_and_saved_data(config_path = parsed_args.config)
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...