Are you sure you want to delete this access key?
Let's write a simple pipeline:
huge_csv_file.csv
- with a fixed seed for reproducibilityCreate a generate_huge_csv_file.py
script:
import hashlib
import pandas as pd
import numpy as np
# 1. Inputs
YOUR_NAME = "antoinet" # Should be unique in the audience
N_DATA = int(1e6)
HUGE_FILE_NAME = "huge_csv_file.csv"
# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N_DATA), "col_b": np.random.random(N_DATA)})
# 3. Write result
data_df.to_csv(HUGE_FILE_NAME, index=False)
python generate_huge_csv_file.py
md5sum huge_csv_file.csv
git add
/ git commit
the scriptNow, let's execute the script with DVC !
dvc run --help
dvc run \
--name generate_huge_csv_file \
--deps generate_huge_csv_file.py \
--outs huge_csv_file.csv \
python generate_huge_csv_file.py
It does:
dvc.yaml
: description of the pipelinedvc.lock
: metafile for the deps and outs of the pipeline.gitignore
: add outs file to avoid them being tracked with gitSave the pipeline execution:
git add dvc.yaml dvc.lock .gitignore
git commit -m "Add my first pipeline"
:bulb: Bonus: enable auto staging with
dvc config core.autostage true
Create a compute_metrics.py
script:
import json
import pandas as pd
# 1. Inputs
HUGE_FILE_NAME = "huge_csv_file.csv"
METRICS_FILE_NAME = "metrics.json"
# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)
# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())
# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
json.dump({"metrics": huge_metrics}, metrics_file)
python compute_metrics.py
git add
/ git commit
the scriptdvc run \
--name compute_metrics \
--deps compute_metrics.py \
--deps huge_csv_file.csv \
--outs metrics.json \
python compute_metrics.py
Save the new stage:
git add dvc.yaml dvc.lock .gitignore
git commit -m "Add compute_metrics stage to the pipeline"
dvc repro
dvc repro --force
dvc repro dvc.yaml:compute_metrics
dvc repro --single-item dvc.yaml:compute_metrics
Create a params.yaml
file:
huge_file_name: huge_csv_file.csv
metrics_file_name: metrics.json
your_name: antoinet
n_data: 1000000
To inject it in the scripts, create a params.py
file:
import yaml
with open("params.yaml") as params_file:
params = yaml.safe_load(params_file)
YOUR_NAME = params["your_name"]
N_DATA = params["n_data"]
HUGE_FILE_NAME = params["huge_file_name"]
METRICS_FILE_NAME = params["metrics_file_name"]
Then inject parameters in scripts:
generate_huge_csv_file.py
:import hashlib
import pandas as pd
import numpy as np
# 1. Inputs
from params import YOUR_NAME, N_DATA, HUGE_FILE_NAME
# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N_DATA), "col_b": np.random.random(N_DATA)})
# 3. Write result
data_df.to_csv(HUGE_FILE_NAME, index=False)
compute_metrics.py
:import json
import pandas as pd
# 1. Inputs
from params import HUGE_FILE_NAME, METRICS_FILE_NAME
# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)
# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())
# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
json.dump({"metrics": huge_metrics}, metrics_file)
Then, we need to modify the dvc.yaml
file:
params
key to the stages. When a parameter changes, it triggers the execution of the stage like a regular deps
;outs
filenames: the single source of truth is the params.yaml
file.stages:
generate_huge_csv_file:
cmd: python generate_huge_csv_file.py
params:
- huge_file_name
- your_name
- n_data
deps:
- generate_huge_csv_file.py
outs:
- ${huge_file_name}
compute_metrics:
cmd: python compute_metrics.py
params:
- huge_file_name
- metrics_file_name
deps:
- compute_metrics.py
- ${huge_file_name}
outs:
- ${metrics_file_name}
git add params.yaml params.py dvc.yaml *.py
git commit -m "Add parameters to the pipeline"
dvc repro
then git add
/ git commit
dvc params diff
Modify the compute_metrics
stage in dvc.yaml
:
stages:
generate_huge_csv_file: (...)
compute_metrics:
cmd: python compute_metrics.py
params:
- huge_file_name
- metrics_file_name
deps:
- compute_metrics.py
- ${huge_file_name}
metrics:
- ${metrics_file_name}
git add
/ git commit
dvc metrics show
dvc metrics diff
DVC has two concepts for metrics, that represent different results of machine learning training or data processing:
Plots the huge file:
dvc plots show -x col_a -y col_b huge_csv_file.csv
Compare plots:
dvc plots diff -t smooth -x col_a -y col_b --targets huge_csv_file.csv -- HEAD^
File huge_csv_file.csv
is probably too big: reduce the n_data
parameter !
rm dvc.lock huge_csv_file.csv metrics.json
dvc repro
.dvc/cache/runs/
deps
and params
are declarative: do not forget them or the cache might be corrupteddvc repro --no-run-cache
Next: :three: Experiments
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?