Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git github
Peter Fogh d621076749
Added Initial README file
5 years ago
f673f27853
Init DVC
5 years ago
4a2f1b06ca
Download code
5 years ago
4a2f1b06ca
Download code
5 years ago
d621076749
Added Initial README file
5 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

DVC and Dask setup test

This repository contains the code for setting up DVC to use a remote computer server using dask.

Setup

The setup have the following prerequisites:

  1. A remote server with:
    1. SSH installed.
    2. A unix user you have the username and password for.
    3. A folder for you remote DVC cache, my is at /scratch/dvc_data_cache/.
    4. A folder for the remote shared DVC cache, my is at /scratch/dvc_users/[REMOTE_USERNAME]/.
    5. Dask scheduler installed and open at port 8786.
  2. A local SSH keyfile (ssh-keygen), which have been copied to the remote server: ssh-copy-id [REMOTE_USERNAME]@[REMOTE_IP].
  3. An open SSH port-forward to the dask scheduler from our local to the remote machine: ssh -L 8786:[REMOTE_USERNAME]@[REMOTE_IP]:8786 [REMOTE_USERNAME]@[REMOTE_IP].
  4. Setup local DVC development repository (following https://dvc.org/doc/user-guide/contributing/) with a conda environment:
    1. Fork https://github.com/iterative/dvc on Github.
    2. git clone git@github.com:<GITHUB_USERNAME>/dvc.git
    3. cd dvc
    4. conda create -n py36_open_source_dvc python=3.6
    5. conda activate py36_open_source_dvc
    6. pip install -r requirements.txt
    7. pip install -r tests/requirements.txt
    8. pip install -e .
    9. pip install pre-commit
    10. pre-commit install
    11. conda install -n py36_open_source_dvc dask
    12. which dvc should say [HOME]/anaconda3/envs/py36_open_source_dvc/bin/dvc and dvc --version should say the exact version available in you local DVC development repository.
  5. Configure you local DVC globally for you local machine, note that my I call my remote server "ahsoka":
    1. conda activate py36_open_source_dvc
    2. dvc config core.analytics false --global
    3. dvc remote add ahsoka ssh://[REMOTE_IP]/scratch/dvc_users/[REMOTE_USERNAME]/ --global
    4. dvc remote modify ahsoka user [REMOTE_USERNAME] --global
    5. dvc remote modify ahsoka port 22 --global
    6. dvc remote modify ahsoka keyfile [PATH_TO_YOUR_PUBLIC_SSH_KEY] --global
    7. dvc remote add ahsoka_cache ssh://[REMOTE_IP]/scratch/dvc_data_cache --global
    8. dvc remote modify ahsoka_cache user [REMOTE_USERNAME] --global
    9. dvc remote modify ahsoka_cache port 22 --global
    10. dvc remote modify ahsoka_cache keyfile [PATH_TO_YOUR_PUBLIC_SSH_KEY] --global
    11. dvc config cache.ssh ahsoka_cache --global

The setup of this DVC and Dask test have been performed as follow:

  1. On the remote server do the following:
    1. cd scratch/dvc_users/[REMOTE_USERNAME]
    2. mkdir dvc_test_classify
    3. mkdir data
    4. wget -P data/ https://s3-us-west-2.amazonaws.com/dvc-share/so/100K/Posts.xml.tgz
    5. tar zxf data/Posts.xml.tgz -C data/
  2. Clone this test repository from my Github: git clone git@github.com:PeterFogh/dvc_test_classify.git
Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...