Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git github
Peter Fogh a7cdcf1e7a
Added link to DVC tutorial in README
5 years ago
f673f27853
Init DVC
5 years ago
3cee49b9fc
update README and conf.py
5 years ago
83c1e1db7c
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
83c1e1db7c
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
83c1e1db7c
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
83c1e1db7c
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
a7cdcf1e7a
Added link to DVC tutorial in README
5 years ago
83c1e1db7c
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
83c1e1db7c
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

DVC and Dask use case

This repository contains the description and code for setting up DVC to use a remote computer server using dask. Note that this use case relay on the original DVC tutorial and its code found here https://dvc.org/doc/tutorial.

How to set up the use case

The use case have the following prerequisites:

  1. A remote server with:
    1. SSH installed.
    2. A unix user you have the username and password for.
    3. A folder for your remote shared DVC cache, my is at /scratch/dvc_data_cache/.
    4. A folder for your remote DVC data directories, my is at /scratch/dvc_users/[REMOTE_USERNAME]/.
    5. Dask scheduler installed and open at port 8786.
  2. A local SSH keyfile (ssh-keygen), which have been copied to the remote server: ssh-copy-id [REMOTE_USERNAME]@[REMOTE_IP].
  3. An open SSH port-forward to the dask scheduler from our local to the remote machine: ssh -L 8786:[REMOTE_USERNAME]@[REMOTE_IP]:8786 [REMOTE_USERNAME]@[REMOTE_IP].
  4. Set up local DVC development repository (following https://dvc.org/doc/user-guide/contributing/) with a conda environment:
    1. Fork https://github.com/iterative/dvc on Github.
    2. git clone git@github.com:<GITHUB_USERNAME>/dvc.git
    3. cd dvc
    4. conda create -n py36_open_source_dvc python=3.6
    5. conda activate py36_open_source_dvc
    6. pip install -r requirements.txt
    7. pip install -r tests/requirements.txt
    8. pip install -e .
    9. pip install pre-commit
    10. pre-commit install
    11. which dvc should say [HOME]/anaconda3/envs/py36_open_source_dvc/bin/dvc and dvc --version should say the exact version available in you local DVC development repository.
  5. Configure you local DVC globally for you local machine, note that my I call my remote server "ahsoka":
    1. conda activate py36_open_source_dvc
    2. dvc config core.analytics false --global
    3. dvc remote add ahsoka ssh://[REMOTE_IP]/scratch/dvc_users/[REMOTE_USERNAME]/ --global
    4. dvc remote modify ahsoka user [REMOTE_USERNAME] --global
    5. dvc remote modify ahsoka port 22 --global
    6. dvc remote modify ahsoka keyfile [PATH_TO_YOUR_PUBLIC_SSH_KEY] --global
    7. dvc remote add ahsoka_cache ssh://[REMOTE_IP]/scratch/dvc_data_cache --global
    8. dvc remote modify ahsoka_cache user [REMOTE_USERNAME] --global
    9. dvc remote modify ahsoka_cache port 22 --global
    10. dvc remote modify ahsoka_cache keyfile [PATH_TO_YOUR_PUBLIC_SSH_KEY] --global
    11. dvc config cache.ssh ahsoka_cache --global

This use case of DVC and Dask has been set up as follow:

  1. On the remote server do the following, to create the remote DVC data directory for this project (i.e. this use case):
    1. cd scratch/dvc_users/[REMOTE_USERNAME]
    2. mkdir dvc_dask_use_case
    3. mkdir data
    4. wget -P data/ https://s3-us-west-2.amazonaws.com/dvc-share/so/100K/Posts.xml.tgz
    5. tar zxf data/Posts.xml.tgz -C data/
  2. Clone this test repository from my Github: git clone git@github.com:PeterFogh/dvc_dask_use_case.git
  3. Install the Conda environment for this repository - note the new enviroment must point to your local DVC development repository:
    1. conda env create -f conda_env.yml, which have been create by the following commands (executed the 16-03-2019):
      1. conda create --name py36_open_source_dvc_dask_use_case --clone py36_open_source_dvc
      2. conda install -n py36_open_source_dvc_dask_use_case dask scikit-learn
      3. conda env export -n py36_open_source_dvc_dask_use_case > conda_env.yml
    2. Check dvc version matches your development repository version: conda activate py36_open_source_dvc && which dvc && dvc --version and conda activate py36_open_source_dvc_dask_use_case && which dvc && dvc --version
  4. Reproduce the DVC pipeline: dvc repro - which have been specified by the following DVC stages:
    1. conda activate py36_open_source_dvc_dask_use_case
    2. dvc run -d code/xml_to_tsv.py -d code/conf.py -d remote://ahsoka/dvc_dask_use_case/data/Posts.xml -o remote://ahsoka/dvc_dask_use_case/data/Posts.tsv python code/xml_to_tsv.p
    3. dvc run -d code/split_train_test.py -d code/conf.py -d remote://ahsoka/dvc_dask_use_case/data/Posts.tsv -o remote://ahsoka/dvc_dask_use_case/data/Posts-test.tsv -o remote://ahsoka/dvc_dask_use_case/data/Posts-train.tsv python code/split_train_test.py 0.33 20180319
    4. dvc run -d code/featurization.py -d code/conf.py -d remote://ahsoka/dvc_dask_use_case/data/Posts-train.tsv -d remote://ahsoka/dvc_dask_use_case/data/Posts-test.tsv -o remote://ahsoka/dvc_dask_use_case/data/matrix-train.p -o remote://ahsoka/dvc_dask_use_case/data/matrix-test.p python code/featurization.py
    5. dvc run -d code/train_model.py -d code/conf.py -d remote://ahsoka/dvc_dask_use_case/data/matrix-train.p -o remote://ahsoka/dvc_dask_use_case/data/model.p python code/train_model.py 20180319
    6. dvc run -d code/evaluate.py -d code/conf.py -d remote://ahsoka/dvc_dask_use_case/data/model.p -d remote://ahsoka/dvc_dask_use_case/data/matrix-test.p -m eval.txt -f Dvcfile python code/evaluate.py
  5. Show DVC metrics dvc metrics show -a.
Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...