Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git github
Peter Fogh ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
f673f27853
Init DVC
5 years ago
83c1e1db7c
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
334635fd4a
Added conda_env.yml use to install it
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
ddede7a848
Moved code and data on ahsoka to the main folder and renamed the .dvc files to the stage python file.
5 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

DVC and Dask use case

This repository contains the description and code for setting up DVC to use a remote computer server using dask. Note that this use case relay on the original DVC tutorial and its code found here https://dvc.org/doc/tutorial.

How to set up the use case

Prerequisites

The use case have the following prerequisites:

  1. A remote server with:
    1. SSH installed.
    2. A unix user you have the username and password for.
    3. A folder for your remote shared DVC cache, my is at /scratch/dvc_data_cache/.
    4. A folder for your remote DVC data directories, my is at /scratch/dvc_users/[REMOTE_USERNAME]/.
    5. Dask scheduler installed and open at port 8786.
  2. A local SSH keyfile (ssh-keygen), which have been copied to the remote server: ssh-copy-id [REMOTE_USERNAME]@[REMOTE_IP].
  3. An open SSH port-forward to the dask scheduler from our local to the remote machine: ssh -L 8786:[REMOTE_USERNAME]@[REMOTE_IP]:8786 [REMOTE_USERNAME]@[REMOTE_IP].
  4. Set up local DVC development repository (following https://dvc.org/doc/user-guide/contributing/) with a conda environment:
    1. Fork https://github.com/iterative/dvc on Github.
    2. git clone git@github.com:<GITHUB_USERNAME>/dvc.git
    3. cd dvc
    4. conda create -n py36_open_source_dvc python=3.6
    5. conda activate py36_open_source_dvc
    6. pip install -r requirements.txt
    7. pip install -r tests/requirements.txt
    8. pip install -e .
    9. pip install pre-commit
    10. pre-commit install
    11. which dvc should say [HOME]/anaconda3/envs/py36_open_source_dvc/bin/dvc and dvc --version should say the exact version available in you local DVC development repository.
  5. Configure you local DVC globally for you local machine, note that I call my remote server "ahsoka":
    1. conda activate py36_open_source_dvc
    2. dvc remote add ahsoka ssh://[REMOTE_IP]/scratch/dvc_users/[REMOTE_USERNAME]/ --global
    3. dvc remote modify ahsoka user [REMOTE_USERNAME] --global
    4. dvc remote modify ahsoka port 22 --global
    5. dvc remote modify ahsoka keyfile [PATH_TO_YOUR_PUBLIC_SSH_KEY] --global
    6. dvc remote add ahsoka_cache ssh://[REMOTE_IP]/scratch/dvc_data_cache --global
    7. dvc remote modify ahsoka_cache user [REMOTE_USERNAME] --global
    8. dvc remote modify ahsoka_cache port 22 --global
    9. dvc remote modify ahsoka_cache keyfile [PATH_TO_YOUR_PUBLIC_SSH_KEY] --global
    10. dvc config cache.ssh ahsoka_cache --global

Use case

This use case of DVC and Dask has been set up as follow.

On your remote server do the following:

  1. To create the remote DVC data directory for this project (i.e. this use case):
    1. cd scratch/dvc_users/[REMOTE_USERNAME]
    2. mkdir dvc_dask_use_case
    3. wget -P ./ https://s3-us-west-2.amazonaws.com/dvc-share/so/100K/Posts.xml.tgz
    4. tar zxf ./Posts.xml.tgz -C ./

On your local machine do the following:

  1. Clone this test repository from my Github: git clone git@github.com:PeterFogh/dvc_dask_use_case.git
  2. Install the Conda environment for this repository - note the new enviroment must point to your local DVC development repository:
    1. conda env create -f conda_env.yml, which have been create by the following commands (executed the 16-03-2019):
      1. conda create --name py36_open_source_dvc_dask_use_case --clone py36_open_source_dvc
      2. conda install -n py36_open_source_dvc_dask_use_case dask scikit-learn
      3. conda env export -n py36_open_source_dvc_dask_use_case > conda_env.yml
    2. Check dvc version matches your development repository version: conda activate py36_open_source_dvc && which dvc && dvc --version and conda activate py36_open_source_dvc_dask_use_case && which dvc && dvc --version
  3. Reproduce the DVC pipeline: dvc repro - which have been specified by the following DVC stages:
    1. conda activate py36_open_source_dvc_dask_use_case
    2. dvc run -d xml_to_tsv.py -d conf.py -d remote://ahsoka/dvc_dask_use_case/Posts.xml -o remote://ahsoka/dvc_dask_use_case/Posts.tsv -f xml_to_tsv.dvc python xml_to_tsv.py
    3. dvc run -d split_train_test.py -d conf.py -d remote://ahsoka/dvc_dask_use_case/Posts.tsv -o remote://ahsoka/dvc_dask_use_case/Posts-test.tsv -o remote://ahsoka/dvc_dask_use_case/Posts-train.tsv -f split_train_test.dvc python split_train_test.py
    4. dvc run -d featurization.py -d conf.py -d remote://ahsoka/dvc_dask_use_case/Posts-train.tsv -d remote://ahsoka/dvc_dask_use_case/Posts-test.tsv -o remote://ahsoka/dvc_dask_use_case/matrix-train.p -o remote://ahsoka/dvc_dask_use_case/matrix-test.p -f featurization.dvc python featurization.py
    5. dvc run -d train_model.py -d conf.py -d remote://ahsoka/dvc_dask_use_case/matrix-train.p -o remote://ahsoka/dvc_dask_use_case/model.p -f train_model.dvc python train_model.py
    6. dvc run -d evaluate.py -d conf.py -d remote://ahsoka/dvc_dask_use_case/model.p -d remote://ahsoka/dvc_dask_use_case/matrix-test.p -m eval.txt -f Dvcfile python evaluate.py
  4. Show DVC metrics dvc metrics show -a.
Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...