Are you sure you want to delete this access key?
This ML project focused on binary classification with the aim of predicting the permeability of compounds in PAMPA assay based on their SMILES strings. The project was designed to classify the compounds as having either high permeability (1) or low-to-moderate permeability (0). The dataset used in the project contained a set of SMILES strings and their corresponding permeability values.
To identify the best-performing algorithm combination, the project evaluated 18 potential algorithms, including traditional machine learning algorithms, as well as PyG and DeepPurpose, which are two of the most common GNN-ML frameworks used in the field of drug discovery. A comprehensive description of each step was provided in the project files. Performance evaluation was carried out using standard metrics, including accuracy, precision, and recall.
The primary objective of the project was to develop a highly accurate model capable of predicting the permeability of compounds in PAMPA assay. This model would serve as a crucial tool in guiding drug discovery and development efforts. In the second stage, an end-to-end machine learning architecture was created, which incorporated model training, testing, and operationalization. Additionally, infrastructure and endpoint monitoring were included for the ADMET_PAMPA_NCATS dataset.
Here is a summary of the information related to this database:
PAMPA Dataset | info |
---|---|
Dataset Description | PAMPA (parallel artificial membrane permeability assay) is a commonly employed assay to evaluate drug permeability across the cellular membrane. PAMPA is a non-cell-based, low-cost and high-throughput alternative to cellular models. Although PAMPA does not model active and efflux transporters, it still provides permeability values that are useful for absorption prediction because the majority of drugs are absorbed by passive diffusion through the membrane. |
Dataset Statistics: | NCATS set - 2035 compounds; Approved drugs set - 142 drugs. |
References | [1] Siramshetty, V.B., Shah, P., et al. “Validating ADME QSAR Models Using Marketed Drugs.” SLAS Discovery 2021 Dec;26(10):1326-1336. doi: 10.1177/24725552211017520. |
Dataset License | Not Specified. CC BY 4.0. |
Here is a summary of the final results:
Framework | Algorithms | F1_score | ROC-AUC | PR-AUC |
---|---|---|---|---|
xgboost+scikit-learn | XGBClassifier_results | 92% | 0.57 | 0.85 |
DeepPurpose | DGL_GIN_ContextPred_results | 91% | 0.76 | 0.92 |
DeepPurpose | Transformer_results | 91% | 0.57 | 0.87 |
PyG | GCN-GraphConv | 91.3% | 0.75 | 0.90 |
The metrics used for this project include:
According to the results, the best algorithms in order of performance are:
NOTE: All the files have been designed end-to-end to run on Colab + GPU
Goal: Create an end-to-end machine learning architecture that includes model training, testing, and operationalization, as well as infrastructure and endpoint monitoring.
1. Python
2. shell scripting
3. aws cloud Provider
4. Prometheus And grafana
5. FastApi for endpoint
6. S3bucket - as feature store and model registry
7. CI-CD tool Jenkins
conda create --prefix ./env python=3.9 -y
conda activate ./env
#OR
source activate ./env
pip install -r requirements.txt
1. Feature Store s3 bucket with lambda call on put event
2. Model Registry - Testing
- production
To enable Jenkins to access your GitHub repository automatically whenever there is a push, you need to install Jenkins on EC2 and set up a webhook. Once this is done, you can create three distinct jobs for training, testing, and deployment, each of which should contain the jenkins-jobs-script. I've created separate scripts for all three jobs to help you get started.
Create a master pipeline to run different train,test and deploy.
Create Lambda Trigger on S3 Feature store that will be activated on a put event. Python3.7 should be used in the Lambda as it already has the request library installed. Use the Lambda Trigger to remotely activate the Master pipeline, which will then execute all of the stages.
Maintain Configuration file. Changes required in
- Feature-Store
- Preprocessed dataset
- Model Registry
- Email Params
- Please put gmail application key in it else you will get error
- Ml_Model_params
Install prometheus on Ec2 machine. In configuration file add scrape job set in endpoints.
- job_name: "python_endpoint"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:5000"]
- job_name: "wmi_exporter"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9182"]
Install grafana and it will run on port 3000 by default.
Configure prometheus in it and create monetoring dash board.
Free free to improve this project and remove issues if you find any as nothing is perfect.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?