Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

2_Pipelines.md 4.6 KB

You have to be logged in to leave a comment. Sign In

2 - Pipelines

1. A simple pipeline

Let's write a simple pipeline: dvc-simple-pipeline

  1. A script that generates huge_csv_file.csv - with a fixed seed for reproducibility
  2. A script that compute metrics

1.1 Generate the CSV file

Create a generate_huge_csv_file.py script:

import hashlib

import pandas as pd
import numpy as np

# 1. Inputs
YOUR_NAME = "antoinet"  # Should be unique in the audience
N = int(1e6)
HUGE_FILE_NAME = "huge_csv_file.csv"

# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N), "col_b": np.random.random(N)})

# 3. Write result
data_df.to_csv(HUGE_FILE_NAME)
  • try the script: python generate_huge_csv_file.py
  • check the reproducibility: execute several times and md5sum huge_csv_file.csv

Now, let's execute the script with DVC !

  • dvc run --help
dvc run \
--name generate_huge_csv_file \
--deps generate_huge_csv_file.py \
--outs huge_csv_file.csv \
python generate_huge_csv_file.py

1.2 Compute metrics

Create a compute_metrics.py script:

import json
import pandas as pd

# 1. Inputs
HUGE_FILE_NAME = "huge_csv_file.csv"
METRICS_FILE_NAME = "metrics.json"

# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)

# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())

# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
    json.dump({"metrics": huge_metrics}, metrics_file)
  • try the script: python compute_metrics.py
  • add the new stage to the DVC pipeline:
dvc run \
--name compute_metrics \
--deps compute_metrics.py \
--deps huge_csv_file.csv \
--outs metrics.json \
python compute_metrics.py
  • reproduce the pipeline: dvc repro

2. Improve the pipeline !

2.1 Factorize parameters

Create a params.yaml file:

huge_file_name: huge_csv_file.csv
metrics_file_name: metrics.json
your_name: antoinet
n_data: 1000000

To inject it in the scripts, create a params.py file:

import yaml

with open("params.yaml") as params_file:
    params = yaml.safe_load(params_file)

YOUR_NAME = params["your_name"]
N_DATA = params["n_data"]
HUGE_FILE_NAME = params["huge_file_name"]
METRICS_FILE_NAME = params["metrics_file_name"]

Then inject parameters in scripts:

  • generate_huge_csv_file.py:
import hashlib

import pandas as pd
import numpy as np

# 1. Inputs
from params import YOUR_NAME, N_DATA, HUGE_FILE_NAME

# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N_DATA), "col_b": np.random.random(N_DATA)})

# 3. Write result
data_df.to_csv(HUGE_FILE_NAME)
  • compute_metrics.py:
import json
import pandas as pd

# 1. Inputs
from params import HUGE_FILE_NAME, METRICS_FILE_NAME

# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)

# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())

# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
    json.dump({"metrics": huge_metrics}, metrics_file)

Then, we need to modify the dvc.yaml file to declare params:

stages:
  generate_huge_csv_file:
    cmd: python generate_huge_csv_file.py
    params:
    - huge_file_name
    - your_name
    - n_data
    deps:
    - generate_huge_csv_file.py
    outs:
    - huge_csv_file.csv
  compute_metrics:
    cmd: python compute_metrics.py
    params:
    - huge_file_name
    - metrics_file_name
    deps:
    - compute_metrics.py
    - huge_csv_file.csv
    outs:
    - metrics.json
  • run dvc repro
  • see parameter values: dvc params diff
  • try to change parameters value and relaunch pipeline

2.2 Track metrics.json as DVC Metrics

Modify the compute_metrics stage in dvc.yaml:

stages:
  generate_huge_csv_file: (...)
  compute_metrics:
    cmd: python compute_metrics.py
    params:
    - huge_file_name
    - metrics_file_name
    deps:
    - compute_metrics.py
    - huge_csv_file.csv
    metrics:
    - metrics.json
  • Show metrics: dvc metrics show
  • Diff metrics: dvc metrics diff

3. DVC pipeline under the hood

  • remove lock and output files: rm dvc.lock huge_csv_file.csv metrics.json
  • repro the pipeline: dvc repro

dvc-run-cache

  • dvc "runs" are stored in .dvc/cache/runs/
  • to avoid using cache: dvc repro --no-run-cache

Next: :three: Experiments

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...