Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

1_Basics.md 2.9 KB

You have to be logged in to leave a comment. Sign In

1 - DVC Basics

1. DVC Core Feature is data versioning

Data versioning is enabled by replacing large files, dataset directories, machine learning models, etc. with small metafiles (easy to handle with Git). These placeholders point to the original data, which is decoupled from source code management.

DVC Principle

Let's try it:

import pandas as pd
import numpy as np

N = int(1e6)

pd.DataFrame({"col_a": range(N), "col_b": np.random.random(N)}).to_csv(
    "huge_csv_file.csv"
)

File huge_csv_file.csv is about ~30MB. Let's add it with dvc:

dvc add huge_csv_file.csv

dvc add

It creates two files:

  1. .gitignored so that huge_csv_file.csv is not tracked with Git;
  2. huge_csv_file.csv.dvc: dvc metafile containing a reference to the file -> this one should be tracked with Git (git add).

If you cat huge_csv_file.csv.dvc:

outs:
- md5: 15c5bb5aecb5fb33e43291322514d1da  # Hash should be different
  size: 33047824
  path: huge_csv_file.csv

You can check the hash md5sum huge_csv_file.csv (or md5 on MacOS).

2. The Local Cache Directory

What if we "lose" the CSV file? rm huge_csv_file.csv

  • recover the file: dvc checkout huge_csv_file.csv.dvc

dvc checkout

What's happening under the hood:

"The DVC cache is a content-addressable storage": it means that even if files have different names with same content, it gets cached only once.

cp huge_csv_file.csv huge_csv_file_bis.csv
dvc add huge_csv_file_bis.csv

It works with directories!

mkdir huge_dir
mv huge_csv_file*.csv huge_dir
dvc add huge_dir

If you cat huge_dir.dvc:

outs:
- md5: 20756b033f98bc1767c575a0e8e15417.dir  # Hash should be different
  size: 66095648
  nfiles: 2
  path: huge_dir

If you look at 20756b033f98bc1767c575a0e8e15417.dir in the cache:

[
  {
    "md5": "15c5bb5aecb5fb33e43291322514d1da",
    "relpath": "huge_csv_file.csv"
  },
  {
    "md5": "15c5bb5aecb5fb33e43291322514d1da",
    "relpath": "huge_csv_file_bis.csv"
  }
]

The .dvcignore file

There are files you'll never want to be tracked e.g., .DS_Store files (MacOs).

touch huge_dir/.DS_Store  # Some file you do not want to track
dvc status huge_dir

DVC status

The .dvcignore file is like .gitignore: just add .DS_Store in it does the trick.

3. The Remote Storage

  • the DVC config file
  • :link: The config file
  • to update the local cache from remote storage: dvc fetch
  • dvc pull = dvc fetch + dvc checkout
  • to push to remote storage: dvc push

Next: :two: Pipelines

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...