Are you sure you want to delete this access key?
Let's write a simple pipeline:
huge_csv_file.csv
- with a fixed seed for reproducibilityCreate a generate_huge_csv_file.py
script:
import hashlib
import pandas as pd
import numpy as np
# 1. Inputs
YOUR_NAME = "antoinet" # Should be unique in the audience
N = int(1e6)
HUGE_FILE_NAME = "huge_csv_file.csv"
# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N), "col_b": np.random.random(N)})
# 3. Write result
data_df.to_csv(HUGE_FILE_NAME)
python generate_huge_csv_file.py
md5sum huge_csv_file.csv
Now, let's execute the script with DVC !
dvc run --help
dvc run \
--name generate_huge_csv_file \
--deps generate_huge_csv_file.py \
--outs huge_csv_file.csv \
python generate_huge_csv_file.py
Create a compute_metrics.py
script:
import json
import pandas as pd
# 1. Inputs
HUGE_FILE_NAME = "huge_csv_file.csv"
METRICS_FILE_NAME = "metrics.json"
# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)
# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())
# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
json.dump({"metrics": huge_metrics}, metrics_file)
python compute_metrics.py
dvc run \
--name compute_metrics \
--deps compute_metrics.py \
--deps huge_csv_file.csv \
--outs metrics.json \
python compute_metrics.py
dvc repro
Create a params.yaml
file:
huge_file_name: huge_csv_file.csv
metrics_file_name: metrics.json
your_name: antoinet
n_data: 1000000
To inject it in the scripts, create a params.py
file:
import yaml
with open("params.yaml") as params_file:
params = yaml.safe_load(params_file)
YOUR_NAME = params["your_name"]
N_DATA = params["n_data"]
HUGE_FILE_NAME = params["huge_file_name"]
METRICS_FILE_NAME = params["metrics_file_name"]
Then inject parameters in scripts:
generate_huge_csv_file.py
:import hashlib
import pandas as pd
import numpy as np
# 1. Inputs
from params import YOUR_NAME, N_DATA, HUGE_FILE_NAME
# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N_DATA), "col_b": np.random.random(N_DATA)})
# 3. Write result
data_df.to_csv(HUGE_FILE_NAME)
compute_metrics.py
:import json
import pandas as pd
# 1. Inputs
from params import HUGE_FILE_NAME, METRICS_FILE_NAME
# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)
# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())
# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
json.dump({"metrics": huge_metrics}, metrics_file)
Then, we need to modify the dvc.yaml
file to declare params:
stages:
generate_huge_csv_file:
cmd: python generate_huge_csv_file.py
params:
- huge_file_name
- your_name
- n_data
deps:
- generate_huge_csv_file.py
outs:
- huge_csv_file.csv
compute_metrics:
cmd: python compute_metrics.py
params:
- huge_file_name
- metrics_file_name
deps:
- compute_metrics.py
- huge_csv_file.csv
outs:
- metrics.json
dvc repro
dvc params diff
Modify the compute_metrics
stage in dvc.yaml
:
stages:
generate_huge_csv_file: (...)
compute_metrics:
cmd: python compute_metrics.py
params:
- huge_file_name
- metrics_file_name
deps:
- compute_metrics.py
- huge_csv_file.csv
metrics:
- metrics.json
dvc metrics show
dvc metrics diff
rm dvc.lock huge_csv_file.csv metrics.json
dvc repro
.dvc/cache/runs/
dvc repro --no-run-cache
Next: :three: Experiments
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?