Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git github
Ashesh Deep b4b15836ed
Added scikit-learn
1 year ago
255f721349
First cut
1 year ago
425db604ae
Update cml.yaml
1 year ago
255f721349
First cut
1 year ago
255f721349
First cut
1 year ago
255f721349
First cut
1 year ago
255f721349
First cut
1 year ago
255f721349
First cut
1 year ago
src
1132acdd09
Finally fixed the prediction module
1 year ago
255f721349
First cut
1 year ago
255f721349
First cut
1 year ago
255f721349
First cut
1 year ago
255f721349
First cut
1 year ago
22fd1c6e08
Update README.md
1 year ago
255f721349
First cut
1 year ago
dcd895d142
Update dvc.yaml
1 year ago
b4b15836ed
Added scikit-learn
1 year ago
255f721349
First cut
1 year ago
255f721349
First cut
1 year ago
255f721349
First cut
1 year ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

Logo

PSet 2 MLOps project with data science Cookie-Cutter, DVC, CML, AWS S3 and Github actions for IISc

This project caters to complete MLOps cycle of model training, deployment and serving with the help of full MLOps stacks using technologies like: DVC, CML, AWS S3 storage and Github-Actions. The project structure is created with datascience cookie-cutter starter project. It uses simple sklearn based neural network library and performs Multi Layer Perceptron Classifier based model training. The model is deployed to AWS S3 storage with test and production environment. Subsequently, this mode is retrieved from AWS S3 bucket and consumed to make prediction on feature dataset. The entire MLOps cycle is automated.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Results
  4. Project Structure
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

The goal of this project to use full MLOps stack and create a complete forward and backward cycle.

This project caters to complete MLOps cycle of model training, deployment and serving with the help of full MLOps stacks using technologies like: DVC, CML, AWS S3 storage and Github-Actions. The project structure is created with datascience cookie-cutter starter project. It uses simple sklearn based neural network library and performs Multi Layer Perceptron Classifier based model training. The model is deployed to AWS S3 storage with test and production environment. Subsequently, this mode is retrieved from AWS S3 bucket and consumed to make prediction on feature dataset. The entire MLOps cycle is automated.

The tools used in achieving this goal are presented in the figure below, along with the flow of control/data.

Tool chain

Data Pipeline:

PSET2 IISc Project

https://dagshub.com/asheshd/pset2_cc?filter=data

(back to top)

The chain and the tools are as follow:

Dataset Access

The datset is created with sklearn random make_classisfication with 100000 sets.

Integration using GitHub, DVC, CML and GitHub Actions and Training

GitHub is used for version control of the project's core ML training and model generation codes. Additionally, it is also maintaining the code to serve the updated model to the web application.

CML is used for continuous integration and continuous deployment and DVC is used for data management in this project.

GitHub Actions is the main orchestrator in the flow. Whenever the repositories are updated, it runs the corresponding action scripts (.yaml). Whenever the main project is updated, the GitHub Actions workflow establishes an environment with python3 and installs all the requirements from the requirements.txt file. The requirements include libraries like sklearn, keras, modelstore, boto3, and other common libraries. sklearn is needed to execute the script for training the model with MLCP classifier on cloud. Boto3 and modelstore are required to generate the model in a format that AWS S3 supports for model hosting. Additionally, the keys required to make seamless access to AWS are defined in the yaml file while they are stored as secrets.

GitHub Actions with CML.yaml also triggers the back-end project for the web application on every update. In order to setup the environment for calling the model from web application another set of requirements are installed which are same as the previous set of requirements. GitHub actions runs the script to call streamlit web application.

Model Deployment to AWS S3 bucket

The model is deployed to the Amazon AWS s3 bucket where it is launched to be consumed in any service. The modelstore library provides simple APIs to save and load the ML model from any authorised tool.

Model querying and serving on web

The web hosting application resides is called from GitHub Actions. The underlying script also loads the S3 model from AWS. Streamlit tool provides a fast and easy framework with many easy-to-use APIs that can be called from a python script. In this project, it is called from the GitHub project. It provides a simple and fast UI to query the underlying model and see the results immediately on the page.

CML.Yaml

name: PSET2 CML Runner

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - uses: actions/setup-python@v2
        with:
          python-version: '3.x'

      - name: cml
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          pip install -r requirements.txt
          
          # Pull dataset with DVC 
          dvc pull data
          
          # Reproduce pipeline if any changes detected in dependencies
          dvc repro 
          
          # Use DVC metrics diff to compare metrics to main
          git fetch --prune --unshallow
          dvc metrics diff --show-md $DEFAULT_BRANCH >> report.md
          
          # Add figure to report
          dvc plots diff --target reports/loss.csv --show-vega $DEFAULT_BRANCH > vega.json
                   
          vl2png vega.json -s 1.5 > reports/plot.png
          
          cml publish --md reports/plot.png >> report.md

          cml send-comment report.md

DVC.yaml

stages:
  train:
    cmd: python src/models/train_model.py
    deps:
      - data
      - src/models/train_model.py
  predict:
    cmd: python src/models/predict_model.py
    deps:
    - data
    - src/models/predict_model.py
    metrics:
    - reports/metrics.json:
        cache: false
    plots:
    - reports/loss.csv:
        cache: false

https://github.com/asheshd/pset2_cc - A project to serve model for AWS and make prediction with MLCP method.

Built With

(back to top)

Getting Started

Follow the below steps to get started with the this MLOps project.

Prerequisites

Here is a list of python packages required to run this project.

  • python
    numpy
    pandas
    dvc
    boto3
    modelstore
    sklearn
    matplotlib
    click
    Sphinx
    coverage
    awscli
    flake8
    python-dotenv>=0.5.1
    

Installation

  1. Clone the repo

    git clone https://github.com/asheshd/pset2_cc.git
    
  2. Install Python packages

    pip install -r requirements.txt
    
  3. Run make dataset

    python src/data/make_dataset.py
    
  4. Run model training

    python src/models/train_model.py
    
  5. Run model prediction

    python src/models/predict_model.py
    

(back to top)

Results

The following screenshot shows the metric data from the model training and prediction. The improvement in accuracy, precision and recall can be clearly seen in the indicator.

Metrics

For more details, please refer to the Web App

(back to top)

Project Structure

The project structure of this MLOPs pset2_cc project is as follows: -

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data from sklearn make_classification
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features classification for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models with Multi layered Perceptron of sklearn Neural network and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Project Link: https://github.com/asheshd/pset2_cc

(back to top)

Acknowledgments

  • Prof. Sashikumaar Ganeshan
  • Thivin Anandh

(back to top)

Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...