You have to be logged in to leave a comment.

2 - Pipelines

:link: Data Pipelines

1. A simple pipeline

Let's write a simple pipeline:

A script that generates huge_csv_file.csv - with a fixed seed for reproducibility
A script that compute metrics

1.1 Generate the CSV file

Create a generate_huge_csv_file.py script:

import hashlib

import pandas as pd
import numpy as np

# 1. Inputs
YOUR_NAME = "antoinet"  # Should be unique in the audience
N = int(1e6)
HUGE_FILE_NAME = "huge_csv_file.csv"

# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N), "col_b": np.random.random(N)})

# 3. Write result
data_df.to_csv(HUGE_FILE_NAME)

try the script: python generate_huge_csv_file.py
check the reproducibility: execute several times and md5sum huge_csv_file.csv

Now, let's execute the script with DVC !

dvc run --help

dvc run \
--name generate_huge_csv_file \
--deps generate_huge_csv_file.py \
--outs huge_csv_file.csv \
python generate_huge_csv_file.py

1.2 Compute metrics

Create a compute_metrics.py script:

import json
import pandas as pd

# 1. Inputs
HUGE_FILE_NAME = "huge_csv_file.csv"
METRICS_FILE_NAME = "metrics.json"

# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)

# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())

# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
    json.dump({"metrics": huge_metrics}, metrics_file)

try the script: python compute_metrics.py
add the new stage to the DVC pipeline:

dvc run \
--name compute_metrics \
--deps compute_metrics.py \
--deps huge_csv_file.csv \
--outs metrics.json \
python compute_metrics.py

reproduce the pipeline: dvc repro

2. Improve the pipeline !

2.1 Factorize parameters

Create a params.yaml file:

huge_file_name: huge_csv_file.csv
metrics_file_name: metrics.json
your_name: antoinet
n_data: 1000000

To inject it in the scripts, create a params.py file:

import yaml

with open("params.yaml") as params_file:
    params = yaml.safe_load(params_file)

YOUR_NAME = params["your_name"]
N_DATA = params["n_data"]
HUGE_FILE_NAME = params["huge_file_name"]
METRICS_FILE_NAME = params["metrics_file_name"]

Then inject parameters in scripts:

generate_huge_csv_file.py:

import hashlib

import pandas as pd
import numpy as np

# 1. Inputs
from params import YOUR_NAME, N_DATA, HUGE_FILE_NAME

# 2. Generate the file
seed = int(hashlib.md5(YOUR_NAME.encode()).hexdigest(), 16) % 2**32
np.random.seed(seed)
data_df = pd.DataFrame({"col_a": range(N_DATA), "col_b": np.random.random(N_DATA)})

# 3. Write result
data_df.to_csv(HUGE_FILE_NAME)

compute_metrics.py:

import json
import pandas as pd

# 1. Inputs
from params import HUGE_FILE_NAME, METRICS_FILE_NAME

# 2. Open input files
data_df = pd.read_csv(HUGE_FILE_NAME)

# 3. Process the data
huge_metrics = 10e3 * (0.5 - data_df.col_b.mean())

# 4. Write result
with open(METRICS_FILE_NAME, "w") as metrics_file:
    json.dump({"metrics": huge_metrics}, metrics_file)

Then, we need to modify the dvc.yaml file to declare params:

stages:
  generate_huge_csv_file:
    cmd: python generate_huge_csv_file.py
    params:
    - huge_file_name
    - your_name
    - n_data
    deps:
    - generate_huge_csv_file.py
    outs:
    - huge_csv_file.csv
  compute_metrics:
    cmd: python compute_metrics.py
    params:
    - huge_file_name
    - metrics_file_name
    deps:
    - compute_metrics.py
    - huge_csv_file.csv
    outs:
    - metrics.json

run dvc repro
see parameter values: dvc params diff
try to change parameters value and relaunch pipeline

2.2 Track metrics.json as DVC Metrics

:link: DVC metrics

Modify the compute_metrics stage in dvc.yaml:

stages:
  generate_huge_csv_file: (...)
  compute_metrics:
    cmd: python compute_metrics.py
    params:
    - huge_file_name
    - metrics_file_name
    deps:
    - compute_metrics.py
    - huge_csv_file.csv
    metrics:
    - metrics.json

Show metrics: dvc metrics show
Diff metrics: dvc metrics diff

3. DVC pipeline under the hood

remove lock and output files: rm dvc.lock huge_csv_file.csv metrics.json
repro the pipeline: dvc repro

dvc "runs" are stored in .dvc/cache/runs/
to avoid using cache: dvc repro --no-run-cache

Next: :three: Experiments

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

2_Pipelines.md 4.6 KB

Permalink History Raw

2 - Pipelines

1. A simple pipeline

1.1 Generate the CSV file

1.2 Compute metrics

2. Improve the pipeline !

2.1 Factorize parameters

2.2 Track metrics.json as DVC Metrics

3. DVC pipeline under the hood

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Dean / dvc-streamlit-example mirror of https://github.com/sicara/dvc-streamlit-example.git

2_Pipelines.md 4.6 KB Permalink History Raw

2 - Pipelines

1. A simple pipeline

1.1 Generate the CSV file

1.2 Compute metrics

2. Improve the pipeline !

2.1 Factorize parameters

2.2 Track metrics.json as DVC Metrics

3. DVC pipeline under the hood

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Dean
/
dvc-streamlit-example
mirror of https://github.com/sicara/dvc-streamlit-example.git

2_Pipelines.md 4.6 KB

Permalink History Raw