Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git github
Peter Fogh a7cdcf1e7a
Added link to DVC tutorial in README
5 years ago
Init DVC
5 years ago
update README and
5 years ago
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
Added link to DVC tutorial in README
5 years ago
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
Process to TSV, Separate dataset into train and test-sets, transform to data to feature metrics, trained the model, and evaluated the model
5 years ago
Storage Buckets
Data Pipeline
DVC Managed File
Git Managed File
Stage File
External File

You have to be logged in to leave a comment. Sign In

DVC and Dask use case

This repository contains the description and code for setting up DVC to use a remote computer server using dask. Note that this use case relay on the original DVC tutorial and its code found here

How to set up the use case

The use case have the following prerequisites:

  1. A remote server with:
    1. SSH installed.
    2. A unix user you have the username and password for.
    3. A folder for your remote shared DVC cache, my is at /scratch/dvc_data_cache/.
    4. A folder for your remote DVC data directories, my is at /scratch/dvc_users/[REMOTE_USERNAME]/.
    5. Dask scheduler installed and open at port 8786.
  2. A local SSH keyfile (ssh-keygen), which have been copied to the remote server: ssh-copy-id [REMOTE_USERNAME]@[REMOTE_IP].
  3. An open SSH port-forward to the dask scheduler from our local to the remote machine: ssh -L 8786:[REMOTE_USERNAME]@[REMOTE_IP]:8786 [REMOTE_USERNAME]@[REMOTE_IP].
  4. Set up local DVC development repository (following with a conda environment:
    1. Fork on Github.
    2. git clone<GITHUB_USERNAME>/dvc.git
    3. cd dvc
    4. conda create -n py36_open_source_dvc python=3.6
    5. conda activate py36_open_source_dvc
    6. pip install -r requirements.txt
    7. pip install -r tests/requirements.txt
    8. pip install -e .
    9. pip install pre-commit
    10. pre-commit install
    11. which dvc should say [HOME]/anaconda3/envs/py36_open_source_dvc/bin/dvc and dvc --version should say the exact version available in you local DVC development repository.
  5. Configure you local DVC globally for you local machine, note that my I call my remote server "ahsoka":
    1. conda activate py36_open_source_dvc
    2. dvc config false --global
    3. dvc remote add ahsoka ssh://[REMOTE_IP]/scratch/dvc_users/[REMOTE_USERNAME]/ --global
    4. dvc remote modify ahsoka user [REMOTE_USERNAME] --global
    5. dvc remote modify ahsoka port 22 --global
    6. dvc remote modify ahsoka keyfile [PATH_TO_YOUR_PUBLIC_SSH_KEY] --global
    7. dvc remote add ahsoka_cache ssh://[REMOTE_IP]/scratch/dvc_data_cache --global
    8. dvc remote modify ahsoka_cache user [REMOTE_USERNAME] --global
    9. dvc remote modify ahsoka_cache port 22 --global
    10. dvc remote modify ahsoka_cache keyfile [PATH_TO_YOUR_PUBLIC_SSH_KEY] --global
    11. dvc config cache.ssh ahsoka_cache --global

This use case of DVC and Dask has been set up as follow:

  1. On the remote server do the following, to create the remote DVC data directory for this project (i.e. this use case):
    1. cd scratch/dvc_users/[REMOTE_USERNAME]
    2. mkdir dvc_dask_use_case
    3. mkdir data
    4. wget -P data/
    5. tar zxf data/Posts.xml.tgz -C data/
  2. Clone this test repository from my Github: git clone
  3. Install the Conda environment for this repository - note the new enviroment must point to your local DVC development repository:
    1. conda env create -f conda_env.yml, which have been create by the following commands (executed the 16-03-2019):
      1. conda create --name py36_open_source_dvc_dask_use_case --clone py36_open_source_dvc
      2. conda install -n py36_open_source_dvc_dask_use_case dask scikit-learn
      3. conda env export -n py36_open_source_dvc_dask_use_case > conda_env.yml
    2. Check dvc version matches your development repository version: conda activate py36_open_source_dvc && which dvc && dvc --version and conda activate py36_open_source_dvc_dask_use_case && which dvc && dvc --version
  4. Reproduce the DVC pipeline: dvc repro - which have been specified by the following DVC stages:
    1. conda activate py36_open_source_dvc_dask_use_case
    2. dvc run -d code/ -d code/ -d remote://ahsoka/dvc_dask_use_case/data/Posts.xml -o remote://ahsoka/dvc_dask_use_case/data/Posts.tsv python code/xml_to_tsv.p
    3. dvc run -d code/ -d code/ -d remote://ahsoka/dvc_dask_use_case/data/Posts.tsv -o remote://ahsoka/dvc_dask_use_case/data/Posts-test.tsv -o remote://ahsoka/dvc_dask_use_case/data/Posts-train.tsv python code/ 0.33 20180319
    4. dvc run -d code/ -d code/ -d remote://ahsoka/dvc_dask_use_case/data/Posts-train.tsv -d remote://ahsoka/dvc_dask_use_case/data/Posts-test.tsv -o remote://ahsoka/dvc_dask_use_case/data/matrix-train.p -o remote://ahsoka/dvc_dask_use_case/data/matrix-test.p python code/
    5. dvc run -d code/ -d code/ -d remote://ahsoka/dvc_dask_use_case/data/matrix-train.p -o remote://ahsoka/dvc_dask_use_case/data/model.p python code/ 20180319
    6. dvc run -d code/ -d code/ -d remote://ahsoka/dvc_dask_use_case/data/model.p -d remote://ahsoka/dvc_dask_use_case/data/matrix-test.p -m eval.txt -f Dvcfile python code/
  5. Show DVC metrics dvc metrics show -a.

Press p or to see the previous file or, n or to see the next file


No description

Collaborators 1

