Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
This repository like a double-edged sword serves two purposes:
If you are using the code or data from this project, please do not forget to reference our paper. If you have any questions regarding the data, the code, or the paper, feel free to contact Systems Biology of Aging Group or open an issue on github.
On this figure we illustrate the core elements of the Cross-Species ML pipeline:
For downloading and preparing the indexes of reference genomes and transcriptomes species-notebooks repository can be used.
For RNA-Seq processing of samples quantification pipeline can be used.
For uploading Compara orthology data as well as quantified data of our samples to GraphDB database species-notebooks repository can be used.
To reproduce stage I and II models current yspecies repository can be used (see documentation below) There are dedicated notebooks devoted to those stages:
Linear models are implemented in cross-species-linear-models repository Bayesian networks analysis and multilevel Bayesian linear modelling are available at: bayesian_networks_and_bayesian_linear_modeling repository
In the same time, results of both of these models can be pulled by DVC in the current yspecies repository
To generate a ranked table current yspecies repository can be used (see documentation below) There is a dedicated results_intersections notebook devoted to generating ranked tables.
To reproduce this stage you can use stage_three_shap_selection notebook notebook in the notebooks folders
In the data folder one keeps input, interim and output data.
Before you start running anything do not forget to dvc pull the data and after commiting do not forget to dvc push it!
The pipeline is run by running dvc stages (see dvc.yaml file)
Most of the analysis is written in jupyter notebooks in the notebooks folder.
Each stage runs (and source controls input-outputs) corresponding notebooks using papermill software (which also stores output of the notebooks to data/notebooks)
First you have to create a Conda environment or Micromamba environment for the project: Micromamba is a superior alternative to Conda with very similar API.
To create environment you can do:
micromamba create --file environment.yaml
micromamba activate yspecies
If any errors occur when setting up please, read known issues on the bottom of README.md If the problem is not mentioned there - feel free to open a github issue.
Then you have to pull the data with DVC, for this you should activate yspecies environment, and then:
dvc pull
NOTE: we keep the data at GoogleDrive, so on the first run of dvc pull
it may give you a link to allow access to your GoogleDrive to download the project data, like this:
We are grateful for @shcheklein and @dmpetrov for their help with DVC configuration.
After authentication, you can run any of the pipelines with:
dvc repro
or can run jupyter notebooks to explore notebooks on your own (see running notebooks section)
DVC stages are in dvc.yaml file, to run dvc stage just use dvc repro <stage_name>:
dvc repro
Most of the stages also produce notebooks together with files in the output
There are several key notebooks in the projects. All notebooks can be run either from jupyter (by jupyter lab notebooks) or command-line by dvc repro.
You can run notebooks manually by activating yspecies environment and running:
jupyter lab notebooks
and then running the notebook of our choice. However, keep in mind that notebooks depend on each other. In particular, select_samples notebook generates the data for all others.
Most of the code is packed into classes. The workflow is build on top of scikitlean Pipelines. For the in-depth description of the pipeline read Cross-Species paper.
Yspecies package has the following modules:
The code in yspecies folder is a conda package that is used inside notebooks. There is also an option to use a conda version of the package
One of the key classes is ExpressionDataset class:
e = ExpressionDataset("5_tissues", expressions, genes, samples)
e
It allows indexing by genes:
e[["ENSG00000073921", "ENSG00000139687"]]
#or
e.by_genes[["ENSG00000073921", "ENSG00000139687"]]
By samples:
e.by_samples[["SRR2308103","SRR1981979"]]
Both:
e[["ENSG00000073921", "ENSG00000139687"],["SRR2308103","SRR1981979"]]
ExpressionDataset class has by_genes and by_samples properties which allow indexing and filtering. For instance filtering only blood tissue:
e.by_samples.filter(lambda s: s["tissue"]=="Blood")
The class is also Jupyter-friendly with repr_html() method implemented
Key logic from the start until partitioning of the data according to sorted stratification
Classes with data:
Transformers:
This module is responsible for ShapBased selection
Classes with data:
Auxilary classes:
Transformers:
Module that contains final results
Here we list workarounds for some typical problems connected with running the repository:
Such error emerges when g++ is not installed: The workaround is simple:
sudo apt install g++
Failures to download the files: if one or more files were not downloaded, re-run dvc pull again!
Windows and MAC-specific errors.
Even though yspecies seems to work on MAC and windows, we used Linux as our main operating system and did not test it thoroughly on Windows and Mac, so feel free to report any issues with them.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?