Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  snowflake Data Domain:  tabular Integration:  dvc git mlflow
a76e14a3f9
training price prediction model with metrics logging, dvc pipeline and sql queries logging
2 years ago
a76e14a3f9
training price prediction model with metrics logging, dvc pipeline and sql queries logging
2 years ago
a76e14a3f9
training price prediction model with metrics logging, dvc pipeline and sql queries logging
2 years ago
src
6758dd891a
uses sql queries v1.0
2 years ago
a76e14a3f9
training price prediction model with metrics logging, dvc pipeline and sql queries logging
2 years ago
2b9cb5f754
add .DS_Store to .gitignore
2 years ago
e340b60296
minor format change in readme
2 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

dagshub-snowflake symbol

Lego Set Price Prediction with Snowflake

Introduction

Many machine learning projects begin with querying data from a huge database. For time and space efficiency, it is nowadays common practice that people host databases with huge amount of information, instead of on their local machines, on a cloud-based data warehouse platform like Snowflake. This project explores how people could integrate Dagshub tracking feature with querying data from a Snowflake database when building a machine learning project on Dagshub.

Prerequisites

To run the code in this project, you need to install the following packages:

  • mlflow
  • sklearn
  • requests
  • pillow
  • snowflake python connector: the installation guide can be found here.

Building a Snowflake Database

In this project I use a public dataset with schemas obtained from Kaggle, which collects different features and price of Lego sets in the market. Then the data files are uploaded to an AWS S3 bucket.

To host a database on Snowflake using datasets on AWS S3 buckets, you first need to configure your Snowflake account for AWS S3 integration, and then load data into Snowflake. A detailed documentation for configuration and data loading is available here for you to follow along.

After loading the data into Snowflake, you can create a database and start managing your data using SQL statements. For this project, I create a table for each csv file and build the schemas based on what is provided in the Kaggle dataset.

Querying Data from Snowflake Database

The goal of this machine learning project is to predict the price of a published Lego set using its numerous features provided in the database.

The first step is to query some relative information from the Snowflake database. This involves writing SQL queries and saving the query output as csv files. With Dagshub tracking features, we could track the sql queries as texts by Git and track the query output as csv files by DVC, making it easier to revert to a previous version of SQL queries and output tables if necessary.

Training Model on Queried Dataframes

In the model training process, I collect the category, theme, publishing year, number of parts, and average set color in RGB format of each Lego set, and model these features against the price. Using a simple Lasso regression model, we achieve a R2 score of 0.488 on an independent test set. This is merely a baseline model for the task. Since the primary objective of this project is to illustrate the process of integrating the process of accessing data from Snowflake database with tracking features on Dagshub, we will not move forward for a better-performing model.

Conclusion

This projects shows a standard workflow of using a cloud database hosted on Snowflake to build a ML model with DVC and MLflow tracking SQL queries and the tabular data queried from the Snowflake database.

Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...