Skip to content

Tutorial Overview

This tutorial covers the basics of using DVC and DagsHub to create version-controlled data science projects, as well as tracking experiments with MLflow.

You will learn about data exploration, tracking experiment parameters and metrics, comparing experiments, and more...

Detecting questions about Machine Learning

In this tutorial, we'll create a model to predict whether a question on the Cross Validated Stack Exchange concerns Machine Learning or not.

This kind of prediction can be useful if we want to recommend to a user to add the machine-learning tag to their question for example, which can make it more likely they will get an answer.

This task is simple and clean enough for a tutorial but leaves room for experimentation with feature engineering, data enrichment, and model selection.

The tutorial is divided into several "levels", each of which demonstrates another workflow improvement. It's designed so that you learn something useful at each "level", even if the level after that is less to your liking, and you choose to stop early.

The levels are:

  1. Data Exploration - Getting the data and trying to understand it, otherwise known as doing exploratory data analysis
  2. Setup - Creating a DagsHub account and project.
  3. Data Versioning - Using DVC to keep track of data and model versions.
  4. Experiments - Using MLflow to track experiments.



Screenshot Delicious statistics ๐Ÿ˜‹ (source: Cross Validated)

Too slow for you?

Here is a link to the complete code repo. You can go over it or use the code as you wish.

See the project on DagsHub