Are you sure you want to delete this access key?
Data versioning is enabled by replacing large files, dataset directories, machine learning models, etc. with small metafiles (easy to handle with Git). These placeholders point to the original data, which is decoupled from source code management.
Let's try it ! Execute the following python code:
import pandas as pd
import numpy as np
N = int(1e6)
pd.DataFrame({"col_a": range(N), "col_b": np.random.random(N)}).to_csv(
"big_csv_file.csv"
)
File big_csv_file.csv
is about ~30MB. Let's add it with dvc:
dvc add big_csv_file.csv
It creates two files:
.gitignored
so that big_csv_file.csv
is not tracked with Git;big_csv_file.csv.dvc
: dvc metafile containing a reference to the file -> this one should be tracked with Git (git add
).If you cat big_csv_file.csv.dvc
:
outs:
- md5: 15c5bb5aecb5fb33e43291322514d1da # Hash should be different
size: 33047824
path: big_csv_file.csv
You can check the hash md5sum big_csv_file.csv
(or md5
on MacOS).
Track the data:
git add big_csv_file.csv.dvc .gitignore
git commit -m "Track big_csv_file.csv with DVC"
Useful commands:
dvc status big_csv_file.csv
: to check if the file big_csv_file.csv
if up-to-date with the metafile big_csv_file.csv.dvc
dvc commit big_csv_file.csv
: to update the metafile (try it by modifying the csv file)What if we "lose" the CSV file? rm big_csv_file.csv
dvc checkout big_csv_file.csv.dvc
What's happening under the hood:
"The DVC cache is a content-addressable storage": it means that even if files have different names with same content, it gets cached only once.
cp big_csv_file.csv big_csv_file_bis.csv
dvc add big_csv_file_bis.csv
git add big_csv_file_bis.csv.dvc .gitignore
git commit -m "Track another file with DVC"
It works with directories!
mkdir big_dir
mv big_csv_file*.csv big_dir
dvc add big_dir
git add big_dir.dvc .gitignore
git commit -m "Track a directory with DVC"
If you cat big_dir.dvc
:
outs:
- md5: 20756b033f98bc1767c575a0e8e15417.dir # Hash should be different
size: 66095648
nfiles: 2
path: big_dir
If you look at 20756b033f98bc1767c575a0e8e15417.dir
in the cache:
[
{
"md5": "15c5bb5aecb5fb33e43291322514d1da",
"relpath": "big_csv_file.csv"
},
{
"md5": "15c5bb5aecb5fb33e43291322514d1da",
"relpath": "big_csv_file_bis.csv"
}
]
.dvcignore
fileThere are files you'll never want to be tracked e.g., .DS_Store
files (MacOs).
touch big_dir/.DS_Store # Some file you do not want to track
dvc status big_dir
The .dvcignore
file is like .gitignore
: just add .DS_Store
in it does the trick.
dvc fetch
dvc pull
= dvc fetch
+ dvc checkout
dvc push
Next: :two: Pipelines
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?