Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

1_Basics.md 3.4 KB

You have to be logged in to leave a comment. Sign In

1 - DVC Basics

1. DVC Core Feature is data versioning

Data versioning is enabled by replacing large files, dataset directories, machine learning models, etc. with small metafiles (easy to handle with Git). These placeholders point to the original data, which is decoupled from source code management.

DVC Principle

Let's try it ! Execute the following python code:

import pandas as pd
import numpy as np

N = int(1e6)

pd.DataFrame({"col_a": range(N), "col_b": np.random.random(N)}).to_csv(
    "big_csv_file.csv"
)

File big_csv_file.csv is about ~30MB. Let's add it with dvc:

dvc add big_csv_file.csv

dvc add

It creates two files:

  1. .gitignored so that big_csv_file.csv is not tracked with Git;
  2. big_csv_file.csv.dvc: dvc metafile containing a reference to the file -> this one should be tracked with Git (git add).

If you cat big_csv_file.csv.dvc:

outs:
- md5: 15c5bb5aecb5fb33e43291322514d1da  # Hash should be different
  size: 33047824
  path: big_csv_file.csv

You can check the hash md5sum big_csv_file.csv (or md5 on MacOS).

Track the data:

git add big_csv_file.csv.dvc .gitignore
git commit -m "Track big_csv_file.csv with DVC"

Useful commands:

  • dvc status big_csv_file.csv: to check if the file big_csv_file.csv if up-to-date with the metafile big_csv_file.csv.dvc
  • dvc commit big_csv_file.csv: to update the metafile (try it by modifying the csv file)

2. The Local Cache Directory

What if we "lose" the CSV file? rm big_csv_file.csv

  • recover the file: dvc checkout big_csv_file.csv.dvc

dvc checkout

What's happening under the hood:

"The DVC cache is a content-addressable storage": it means that even if files have different names with same content, it gets cached only once.

cp big_csv_file.csv big_csv_file_bis.csv
dvc add big_csv_file_bis.csv
git add big_csv_file_bis.csv.dvc .gitignore
git commit -m "Track another file with DVC"

It works with directories!

mkdir big_dir
mv big_csv_file*.csv big_dir
dvc add big_dir
git add big_dir.dvc .gitignore
git commit -m "Track a directory with DVC"

If you cat big_dir.dvc:

outs:
- md5: 20756b033f98bc1767c575a0e8e15417.dir  # Hash should be different
  size: 66095648
  nfiles: 2
  path: big_dir

If you look at 20756b033f98bc1767c575a0e8e15417.dir in the cache:

[
  {
    "md5": "15c5bb5aecb5fb33e43291322514d1da",
    "relpath": "big_csv_file.csv"
  },
  {
    "md5": "15c5bb5aecb5fb33e43291322514d1da",
    "relpath": "big_csv_file_bis.csv"
  }
]

The .dvcignore file

There are files you'll never want to be tracked e.g., .DS_Store files (MacOs).

touch big_dir/.DS_Store  # Some file you do not want to track
dvc status big_dir

DVC status

The .dvcignore file is like .gitignore: just add .DS_Store in it does the trick.

3. The Remote Storage

  • the DVC config file
  • :link: The config file
  • to update the local cache from remote storage: dvc fetch
  • dvc pull = dvc fetch + dvc checkout
  • to push to remote storage: dvc push

Next: :two: Pipelines

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...