Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

download_xml.py 1.2 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
  1. """
  2. Download XML data.
  3. Routine Listings
  4. ----------------
  5. get_params()
  6. Get the DVC stage parameters.
  7. process_xml_to_tsv(input_path, output_path)
  8. Load and process XML file and save the data to TSV file.
  9. """
  10. import tarfile
  11. import dask
  12. import dask.distributed
  13. import requests
  14. import conf
  15. def get_params():
  16. """Get the DVC stage parameters."""
  17. return {}
  18. @dask.delayed
  19. def download_xml(output_folder_path):
  20. """Download XML data file."""
  21. url = 'https://s3-us-west-2.amazonaws.com/dvc-share/so/100K/Posts.xml.tgz'
  22. r = requests.get(url=url, stream=True)
  23. tgz_file_path = output_folder_path/url.split('/')[-1]
  24. if r.status_code == 200:
  25. with open(tgz_file_path, 'wb') as f:
  26. for chunk in r.iter_content(1024):
  27. f.write(chunk)
  28. tf = tarfile.open(tgz_file_path)
  29. tf.extractall(path=output_folder_path)
  30. if __name__ == '__main__':
  31. client = dask.distributed.Client('localhost:8786')
  32. dvc_stage_name = __file__.strip('.py')
  33. STAGE_OUTPUT_PATH = conf.data_dir/dvc_stage_name
  34. conf.remote_mkdir(STAGE_OUTPUT_PATH).compute()
  35. OUTPUT_DATASET_TSV_PATH = STAGE_OUTPUT_PATH/'Posts.xml'
  36. download_xml(STAGE_OUTPUT_PATH).compute()
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...