Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Welcome to the code repository for the project conducted under Data Science for Social Good- UK 2022 (DSSGx UK), for our partner: Buckinghamshire Council. The repository will focus on documenting:
@Vanshika - this should be written out in full sentences and explained
Assuming that the data provided by the user are of the following types:
In addition, we want to allow data on characteristics and ks2. This has not been supported yet, but would fill in features we are passing to the model for training.
Points to remember:
snake_case
. You may need to add more columns to the renaming dictionary if your columns have changed or are different.
You can find the renaming dictionary in the src
directory in the [TYPE]_utils.py
file where [TYPE]
refers to whatever your data type is. Note ks2 and
characteristics will not currently show up in there.month/year of birth
column with numeric month and year values in the form [MONTH]/[YEAR]
. We don't use the date of
birth column. You can safely remove it if there is concern about data sensitivity.This part will change slightly depending on what operating system you are using.
@Vanshika - please rewrite directions in backtick bash block (see below)
> cd # Navigate to the desired parent directory using this (change directory) command.
> ls # You can run this command to see the contents of a directory
> git clone https://github.com/DSSGxUK/s22_buckinghamshire.git # This will download the code repository to the current folder
> cd s22_buckinghamshire # This navigates to the repository folder
> python -m venv venv # Creating a python virtual environment
> .\venv\Scripts\activate # Running the virtual environment.
# If you get an error that says '... s22_buckinghamshire\venv\Scripts\Activate.ps1'
# cannot be loaded because running scripts is disabled on this system. For more information,
# see about_Execution_Policies at 'https:/go.microsoft.com/fwlink/?LinkID=135170',
# then we need to enable execution of signed scripts.
# We can do this by running 'Set-ExecutionPolicy RemoteSigned -Scope CurrentUser'.
> python.exe -m pip install --upgrade pip # Update pip if necessary
> pip install -r .\requirements.txt # Install required python packages
We've published synthetic data (data that does not come from any real person) to dagshub so you can play around with the pipeline. To retrieve it, please run the following
dvc remote add origin https://dagshub.com/abhmul/s22_buckinghamshire.dvc
dvc pull -r origin
Please follow the below steps before running the workflows:
cd .\scripts\
@Vanshika - please write directions like below
To run the whole pipeline you can just run:
dvc repro
Alternatively, you could run the individual steps:
# Generate datasets for modelling
dvc repro -s --glob generate_modeling_*
# Run cross validation and hyper parameter search
dvc repro -s --glob cv_*
# Model Evaluation
dvc repro -s --glob evaluate_model_*
# Generate datasets for predictions and final output
dvc repro -s --glob prediction_*
# Generate datasets for modelling
dvc repro -s --glob generate_modeling_*
# Model Evaluation
dvc repro -s --glob evaluate_model_*
# Generate datasets for predictions and final output
dvc repro --glob prediction_*
# Generate datasets for modelling
dvc repro -s --glob generate_modeling_*
# Retrain model
dvc repro --glob retrain_*
# Model Evaluation
dvc repro -s --glob evaluate_model_*
# Generate datasets for predictions and final output
dvc repro --glob prediction_*
Below is a brief overview of what each stage within a workflow is doing:
Generate datasets for modelling
Run cross validation and hyper parameter search
Retrain Model
Model Evaluation
Generate datasets for predictions and final output
@Vanshika - we need directions on how to rerun the hyperparameter search from a checkpoint. To do this, the user has to change the LOAD_CHECKPOINTS value in the params.yaml file to true. When they rerun the pipeline with new data, they should set this to false otherwise it will use an old checkpoint for new data.
The Measures table(named as Measures_table) contains some measured valued we need to display on powerBI visualisations. We can easily create new measure in PowerBI. You will need to implement these measures (name and formula are given):
We also need to create few new columns for PowerBI. These are as follows along with the formula:
Column Name: MentalHealthFlag
Column Name: Age
Column Name: Attendance%
Column Name: Gender
Column Name: Parent/Preg%
Column Name: Gender
NOTE: replace fake_test_dataset with the actual file name which contains the predictions
@Vanshika this should not be in the final product
Some initial notes as we move towards a README:
In addition, we want to allow data on characteristics and ks2. This has not been supported yet,
but would fill in features we are passing to the model for training.
2. If files come in as excel (.xlsx), they must be one sheet per file. We do not support multiple sheets per
excel file. It may just be easier to export your files directly as CSV.
3. Currently columns are renamed to snake_case
. You may need to add more columns to the renaming dictionary if
your columns have changed or are different. You can find the renaming dictionary in the src
directory in
the [TYPE]_utils.py
file where [TYPE]
refers to whatever your data type is. Note ks2 and characteristics
will not currently show up in there.
4. We assume that the CCIS datasets have a month/year of birth
column with numeric month and year values in the form
[MONTH]/[YEAR]
. We don't use the date of birth column. You can safely remove it if there is concern about data sensitivity.
NOTE: No column names can include "__"! This is a special reserved character for our code.
TODO for presentation
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?