Tutorial Overview¶

Creating an awesome project using DVC and DagsHub¶

This tutorial covers creating a model to classify images of hand-written digits (0 to 9) using MNIST as the data-set. This problem is often considered a "Hello, World" for machine learning, and is therefore relatively simple.

The focus of the tutorial is to show how we use DVC in order to version our data pipeline, the benefits that it brings to our workflow and the advantages of using DagsHub as a repo for our projects and as a pipeline visualization tool.
If you want to learn about using DagsHub to track and visualize reproducible experiments, we suggest that you first go to our other tutorial.

_{Samples from the MNIST test data set (source: Josef Steppan on Wikimedia Commons)}

DVC?¶

Short for Data Version Control, it's a tool that solves the versioning and reproducibility problems in the data science and machine learning fields. It does so by enabling data versioning as well as pipeline versioning, which in turn enables experiment reproducibility and easier collaboration. DVC is built to work synergistically¹ alongside Git, which is still used as the backbone for file versioning.

Too slow for you?¶

Here is a link to the complete code repo. You can go over it or use the code as you wish.

See the project on DagsHub

The tutorial will guide you, step-by-step, to create this repo.

We're almost sure that's a real word ↩