Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  git github
732126a73d
__init__.py updated
1 year ago
8846d63885
feature store script completed
1 year ago
67e3390f4f
model registry script completed
1 year ago
dad66c93f7
data preprocessing and inferencing completed
1 year ago
f920301672
email_service script completed
1 year ago
732126a73d
__init__.py updated
1 year ago
503212c1a7
prometheus script completed
1 year ago
9a5c1042cf
Notebook added
1 year ago
src
60719e5558
Src folder script completed
1 year ago
8bf6aaa90d
utils.py script completed
1 year ago
01b01aaf23
Initial commit
1 year ago
01b01aaf23
Initial commit
1 year ago
d7b8a7dc7a
README.mb updated
1 year ago
f7393a41c1
config.yaml file script completed
1 year ago
7237cc855a
initial_setup added
1 year ago
196accb144
jenkins and job_script added
1 year ago
196accb144
jenkins and job_script added
1 year ago
c996d09ebf
requirements.txt updated
1 year ago
b7d1ddd34f
Setup.py added
1 year ago
9c99e32e4a
update_endpoint added
1 year ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

Project MLOps: Drug Discovery - Binary Classification

✨Project summary:

This ML project focused on binary classification with the aim of predicting the permeability of compounds in PAMPA assay based on their SMILES strings. The project was designed to classify the compounds as having either high permeability (1) or low-to-moderate permeability (0). The dataset used in the project contained a set of SMILES strings and their corresponding permeability values.

To identify the best-performing algorithm combination, the project evaluated 18 potential algorithms, including traditional machine learning algorithms, as well as PyG and DeepPurpose, which are two of the most common GNN-ML frameworks used in the field of drug discovery. A comprehensive description of each step was provided in the project files. Performance evaluation was carried out using standard metrics, including accuracy, precision, and recall.

The primary objective of the project was to develop a highly accurate model capable of predicting the permeability of compounds in PAMPA assay. This model would serve as a crucial tool in guiding drug discovery and development efforts. In the second stage, an end-to-end machine learning architecture was created, which incorporated model training, testing, and operationalization. Additionally, infrastructure and endpoint monitoring were included for the ADMET_PAMPA_NCATS dataset.

Here is a summary of the information related to this database:

PAMPA Dataset info
Dataset Description PAMPA (parallel artificial membrane permeability assay) is a commonly employed assay to evaluate drug permeability across the cellular membrane. PAMPA is a non-cell-based, low-cost and high-throughput alternative to cellular models. Although PAMPA does not model active and efflux transporters, it still provides permeability values that are useful for absorption prediction because the majority of drugs are absorbed by passive diffusion through the membrane.
Dataset Statistics: NCATS set - 2035 compounds; Approved drugs set - 142 drugs.
References [1] Siramshetty, V.B., Shah, P., et al. “Validating ADME QSAR Models Using Marketed Drugs.” SLAS Discovery 2021 Dec;26(10):1326-1336. doi: 10.1177/24725552211017520.
Dataset License Not Specified. CC BY 4.0.

🚀 Stage 1: ML algorithms Results

ss

Here is a summary of the final results:

Framework Algorithms F1_score ROC-AUC PR-AUC
xgboost+scikit-learn XGBClassifier_results 92% 0.57 0.85
DeepPurpose DGL_GIN_ContextPred_results 91% 0.76 0.92
DeepPurpose Transformer_results 91% 0.57 0.87
PyG GCN-GraphConv 91.3% 0.75 0.90

The metrics used for this project include:

  • F1_Score: The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.
  • ROC-AUC: The area under the ROC curve (AUC) results were considered excellent for AUC values between 0.9-1, good for AUC values between 0.8-0.9, fair for AUC values between 0.7-0.8, poor for AUC values between 0.6-0.7 and failed for AUC values between 0.5-0.6.
  • PR-AUC: It is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. In a perfect classifier, AUC-PR =1.The higher on y-axis your curve is the better your model performance.

According to the results, the best algorithms in order of performance are:

  • 1)DGL_GIN_ContextPred
  • 2)GCN-GraphConv
  • 3)Transformer models
  • 4)XGBClassifier
NOTE: All the files have been designed end-to-end to run on Colab + GPU

🚀 Stage 02: MLOps

Goal: Create an end-to-end machine learning architecture that includes model training, testing, and operationalization, as well as infrastructure and endpoint monitoring.

✨ Architecture

test drawio (2)

🔥 Technologies Used

1. Python 
2. shell scripting 
3. aws cloud Provider 
4. Prometheus And grafana
5. FastApi for endpoint 
6. S3bucket - as feature store and model registry 
7. CI-CD tool Jenkins

👷 Initial Setup

conda create --prefix ./env python=3.9 -y
conda activate ./env 
#OR
source activate ./env 
pip install -r requirements.txt 

💭 Setup S3 bucket

1. Feature Store s3 bucket with lambda call on put event
2. Model Registry - Testing 
                  - production

🔅 Configuration for jenkins

image image

To enable Jenkins to access your GitHub repository automatically whenever there is a push, you need to install Jenkins on EC2 and set up a webhook. Once this is done, you can create three distinct jobs for training, testing, and deployment, each of which should contain the jenkins-jobs-script. I've created separate scripts for all three jobs to help you get started.

Create a master pipeline to run different train,test and deploy.

📐 Develop Lambda Trigger

image image

Create Lambda Trigger on S3 Feature store that will be activated on a put event. Python3.7 should be used in the Lambda as it already has the request library installed. Use the Lambda Trigger to remotely activate the Master pipeline, which will then execute all of the stages.

📊 Configuration File

Maintain Configuration file. Changes required in 
- Feature-Store
- Preprocessed dataset
- Model Registry 
- Email Params
    - Please put gmail application key in it else you will get error
- Ml_Model_params

✏️ Configuration for Prometheus

image

Install prometheus on Ec2 machine. In configuration file add scrape job set in endpoints.

 
  - job_name: "python_endpoint"
  
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:5000"]
      
  - job_name: "wmi_exporter"
  
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9182"]

📉 Configuration for Grafana

image

Install grafana and it will run on port 3000 by default.
Configure prometheus in it and create monetoring dash board.

❄️ END

Free free to improve this project and remove issues if you find any as nothing is perfect.
Tip!

Press p or to see the previous file or, n or to see the next file

About

This is an end-to-end MLOps architecture that includes model training, testing, and operationalization, as well as infrastructure and endpoint monitoring for the ADMET_PAMPA_NCATS dataset.

Collaborators 1

Comments

Loading...