You have to be logged in to leave a comment.

2 - Pipelines

0. Prerequisites

:zero: Setup your machine
Read the :link: Data Pipelines documentation

1. A simple pipeline

Let's write a simple pipeline:

A script that generates huge_csv_file.csv - with a fixed seed for reproducibility
A script that compute metrics

1.1 Generate the CSV file

Create a generate_huge_csv_file.py script:

import hashlib

import pandas as pd
import numpy as np

# 1. Inputs
YOUR_NAME = "antoinet"  # Should be unique in the audience
N_DATA = int(1e6)
HUGE_FILE_NAME = "huge_csv_file.csv"

# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N_DATA), "col_b": np.random.random(N_DATA)})

# 3. Write result
data_df.to_csv(HUGE_FILE_NAME, index=False)

try the script: python generate_huge_csv_file.py
check the reproducibility: execute several times and md5sum huge_csv_file.csv
git add / git commit the script

Now, let's execute the script with DVC !

dvc run --help

dvc run \
--name generate_huge_csv_file \
--deps generate_huge_csv_file.py \
--outs huge_csv_file.csv \
python generate_huge_csv_file.py

It does:

create dvc.yaml: description of the pipeline
create dvc.lock: metafile for the deps and outs of the pipeline
update .gitignore: add outs file to avoid them being tracked with git

Save the pipeline execution:

git add dvc.yaml dvc.lock .gitignore
git commit -m "Add my first pipeline"

:bulb: Bonus: enable auto staging with dvc config core.autostage true

1.2 Compute metrics

Create a compute_metrics.py script:

import json
import pandas as pd

# 1. Inputs
HUGE_FILE_NAME = "huge_csv_file.csv"
METRICS_FILE_NAME = "metrics.json"

# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)

# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())

# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
    json.dump({"metrics": huge_metrics}, metrics_file)

try the script: python compute_metrics.py
git add / git commit the script
add the new stage to the DVC pipeline:

dvc run \
--name compute_metrics \
--deps compute_metrics.py \
--deps huge_csv_file.csv \
--outs metrics.json \
python compute_metrics.py

Save the new stage:

git add dvc.yaml dvc.lock .gitignore
git commit -m "Add compute_metrics stage to the pipeline"

1.3 Re-execute the pipeline

reproduce the pipeline: dvc repro
force reproduce the pipeline: dvc repro --force
execute a stage: dvc repro dvc.yaml:compute_metrics
execute a single stage: dvc repro --single-item dvc.yaml:compute_metrics

2. Improve the pipeline !

2.1 Factorize parameters

Create a params.yaml file:

huge_file_name: huge_csv_file.csv
metrics_file_name: metrics.json
your_name: antoinet
n_data: 1000000

To inject it in the scripts, create a params.py file:

import yaml

with open("params.yaml") as params_file:
    params = yaml.safe_load(params_file)

YOUR_NAME = params["your_name"]
N_DATA = params["n_data"]
HUGE_FILE_NAME = params["huge_file_name"]
METRICS_FILE_NAME = params["metrics_file_name"]

Then inject parameters in scripts:

generate_huge_csv_file.py:

import hashlib

import pandas as pd
import numpy as np

# 1. Inputs
from params import YOUR_NAME, N_DATA, HUGE_FILE_NAME

# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N_DATA), "col_b": np.random.random(N_DATA)})

# 3. Write result
data_df.to_csv(HUGE_FILE_NAME, index=False)

compute_metrics.py:

import json
import pandas as pd

# 1. Inputs
from params import HUGE_FILE_NAME, METRICS_FILE_NAME

# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)

# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())

# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
    json.dump({"metrics": huge_metrics}, metrics_file)

Then, we need to modify the dvc.yaml file:

to declare params: add params key to the stages. When a parameter changes, it triggers the execution of the stage like a regular deps;
to "parametrize" outs filenames: the single source of truth is the params.yaml file.

stages:
  generate_huge_csv_file:
    cmd: python generate_huge_csv_file.py
    params:
    - huge_file_name
    - your_name
    - n_data
    deps:
    - generate_huge_csv_file.py
    outs:
    - ${huge_file_name}
  compute_metrics:
    cmd: python compute_metrics.py
    params:
    - huge_file_name
    - metrics_file_name
    deps:
    - compute_metrics.py
    - ${huge_file_name}
    outs:
    - ${metrics_file_name}

save the changes

git add params.yaml params.py dvc.yaml *.py
git commit -m "Add parameters to the pipeline"

run dvc repro then git add / git commit
see parameter values: dvc params diff
try to change parameters value and relaunch pipeline

2.2 Track metrics.json as DVC Metrics

:link: DVC metrics

Modify the compute_metrics stage in dvc.yaml:

stages:
  generate_huge_csv_file: (...)
  compute_metrics:
    cmd: python compute_metrics.py
    params:
    - huge_file_name
    - metrics_file_name
    deps:
    - compute_metrics.py
    - ${huge_file_name}
    metrics:
    - ${metrics_file_name}

save the change: git add / git commit
Show metrics: dvc metrics show
Diff metrics: dvc metrics diff

2.3 Plots !

:link: DVC plots

DVC has two concepts for metrics, that represent different results of machine learning training or data processing:

dvc metrics represent scalar numbers such as AUC, true positive rate, etc.
dvc plots can be used to visualize data series such as AUC curves, loss functions, confusion matrices, etc.

Plots the huge file:

dvc plots show -x col_a -y col_b huge_csv_file.csv

Compare plots:

dvc plots diff -t smooth -x col_a -y col_b --targets huge_csv_file.csv -- HEAD^

File huge_csv_file.csv is probably too big: reduce the n_data parameter !

3. DVC pipeline under the hood

remove lock and output files: rm dvc.lock huge_csv_file.csv metrics.json
repro the pipeline: dvc repro

dvc "runs" are stored in .dvc/cache/runs/
see :link: DVC run-cache documentation
:warning: deps and params are declarative: do not forget them or the cache might be corrupted
to avoid using cache: dvc repro --no-run-cache

Next: :three: Experiments

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

2_Pipelines.md 6.8 KB

Permalink History Raw

2 - Pipelines

0. Prerequisites

1. A simple pipeline

1.1 Generate the CSV file

1.2 Compute metrics

1.3 Re-execute the pipeline

2. Improve the pipeline !

2.1 Factorize parameters

2.2 Track metrics.json as DVC Metrics

2.3 Plots !

3. DVC pipeline under the hood

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Dean / dvc-streamlit-example mirror of https://github.com/sicara/dvc-streamlit-example.git

2_Pipelines.md 6.8 KB Permalink History Raw

2 - Pipelines

0. Prerequisites

1. A simple pipeline

1.1 Generate the CSV file

1.2 Compute metrics

1.3 Re-execute the pipeline

2. Improve the pipeline !

2.1 Factorize parameters

2.2 Track metrics.json as DVC Metrics

2.3 Plots !

3. DVC pipeline under the hood

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Dean
/
dvc-streamlit-example
mirror of https://github.com/sicara/dvc-streamlit-example.git

2_Pipelines.md 6.8 KB

Permalink History Raw