Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Welcome to the code repository for the project conducted under Data Science for Social Good- UK 2022 (DSSGx UK), for our partner: Buckinghamshire Council. The aim of the project was to identify students, before Year 11, at risk of becoming NEET after they complete their GCSEs. This readme will focus on documenting:
s22_buckinghamshire
βββ .dvc
β βββ .gitignore
β βββ .config # links to the dagshub repo
βββ communications
β βββ upn-different-activity-codes.md
β βββ upns_with_questions.md
βββ data
β βββ interim # stores canonicalised, annotated and merged csv files
β β βββ attendance_canonicalized_csv
β β β βββ .gitignore
β β βββ ccis_canonicalized_csv
β β β βββ .gitignore
β β βββ census_canonicalized_csv
β β β βββ .gitignore
β β βββ ks4_canonicalized_csv
β β β βββ .gitignore
| | βββ .gitignore
| βββ processed
| | βββ .gitignore
| βββ raw # stores the original files, won't appear until you pull synthetic data
β β βββ attendance_original_csv
β β βββ ccis_original_csv
β β βββ census_original_csv
β β βββ ks4_original_csv
β βββ raw.dvc
| βββ .gitignore
βββ example_data # Example data to check schema
β ββββraw
β β ββββsecondary_schools_original.csv
β β ββββattendance_original_csv
β β β βββ attendance_original_jan15.csv
β β β βββ attendance_original_jan22.csv
β β β βββ attendance_original_may15.csv
β β β βββ attendance_original_may22.csv
β β β βββ attendance_original_oct15.csv
β β β βββ attendance_original_oct21.csv
β β ββββccis_original_csv
β β β βββ ccis_original_mar16.csv
β β β βββ ccis_original_mar22.csv
β β ββββcensus_original_csv
β β β βββ census_original_jan17.csv
β β β βββ census_original_jan22.csv
β β ββββcharacteristics_original_csv
β β β βββ characteristics_original_mar22.csv
β β ββββks2_original_csv
β β β βββ ks2_original_sep20.csv
β β ββββks4_original_csv
β β β βββ ks4_original_sep15.csv
β βββ βββ βββ ks4_original_sep20.csv
βββ logs
β βββ .gitignore
βββ metrics # contains metrics and results related values
β βββlgbm1_single.csv # We've kept this one preloaded
β βββlgbm2_single.csv # Gets created when you run the hyperparam search
βββ models
β βββ final # final model for prediction
β β βββ model_single.pkl # Gets created when you retrain the model
β βββ interim
βββ notebooks
β βββ convert_synthetic.ipynb
β βββ view_csv.ipynb
βββ plots # stores different plots and charts
β βββattendance_percent_box_plot.png
β βββcommon_neet_traces.png
β βββconsequent_antecedent.png
β βββlgb1_feature_importance.png
β βββneet_cons_ante.png
β βββneet_infrequent_traces.png
β βββneet_process_map_97.html
β βββnot_known_common_traces.png
β βββnot_known_proess_map_98.html
β βββnotknown_consequent_antecedent.png
β βββpipline.dot
β βββpipline.dot.svg
β βββpipline.md
β βββpipline.png
β βββprocess_map.html
β βββroni_catch_model_miss_feature_importance.png
β βββsankey.png
β βββsankey_debug.png
β βββsankey_eet.png
β βββsankey_full.png
β βββsankey_neet.png
β βββsankey_unknown.png
β βββ unknown_infrequent_traces.png
βββ results
β βββ .gitignore
βββ scripts #python code files for different purposes
β βββ data
β β βββ additional_data.py
β β βββannotate_attendance_data.py
β β βββannotate_census_data.py
β β βββannotate_ks4_data.py
β β βββannotate_neet_data.py
β β βββattendance_premerge.py
β β βββbuild_ccis_ks_eda_dataset.py
β β βββcanonicalize_data.py
β β βββcensus_premerge.py
β β βββfeature_selection.py
β β βββfeature_selection_for_predictions.py
β β βββks2_filter.py
β β βββmerge_data.py
β β βββmerge_multi_upn.py
β β βββmulti_upn_categorical.py
β β βββmultiple_to_single.py
β β βββneet_premerge.py
β β βββsplit_covid_years.py
β | βββ xl_to_csv.py
β βββ misc
β β βββ .Rhistory
β β βββ bupar-analysis.r
β β βββ compute_intersections.py
β β βββ plot_sankey.py
β βββ model
β β βββ optimization_and_cv.py
β β βββ predict.py
β β βββ retrain.py
β β βββ roni_tool.py
β β βββ split_data.py
β β βββ test.py
β βββ dvc.lock
β βββ dvc.yaml
β βββ generate_params.py
β βββ params.yaml
βββ src
β βββ constants
β β βββ __init__.py
β β βββ attendance_constants.py
β β βββ ccis_constants.py
β β βββ census_constants.py
β β βββ ks_constants.py
β β βββ modeling_constants.py
β β βββ school_info_constants.py
β β βββ script_argument_constants.py
β β βββ shared_constants.py
β βββ cv
β β βββ __init__.py
β β βββ cross_validator.py
β β βββ processing.py
β β βββ search_spaces.py
β β βββ utils.py
β βββ params
β β βββ __init__.py
β β βββ data_pipeline_arguments.py
β β βββ data_pipeline_params.py
β β βββ filepaths.py
β β βββ model_pipeline_arguments.py
β β βββ model_pipeline_params.py
β βββ aggregation_utils.py
β βββ attendance_utils.py
β βββ ccis_utils.py
β βββ data_utils.py
β βββ debug_utils.py
β βββ error_utils.py
β βββ file_utils.py
β βββ log_utils.py
β βββ merge_utils.py
β βββ py_utils.py
β βββ roni.py
βββ tests
β βββ scripts/data
β β βββ test_annotate_neet_data.py
β β βββ test_split_data.py
β βββ src
β β βββ test_cv_utils.py
β β βββ test_data_utils.py
β β βββ test_file_utils.py
β βββ .gitignore
βββ .dvcignore
βββ .gitignore
βββ README.md
βββ requirements.R
βββ requirements.txt
βββ setup.py
data
: This folder contains two sub-folders : interim
and raw
. After running the pipeline an additional processed
folder will also be present. The original dataset files are stored in their dataset sub-folder within the raw
folder e.g. raw/attendance_original_csv
will contain the original csv files for attendance datasets. These original files will go through the data pipeline and will generate additional files which will be canonicalized (standardised formatting), annotated and merged across years, which will be stored in interim
sub-folder. The processed
subfolder will contain the final datasets ready to be used for modeling.
metrics
: This folder contains outputs from the hyperparameter search (lgbm1_single.csv
), roni tool performance results (roni_test_results.csv
) and our model performance results (single_test_results.csv
) on the test dataset.
models
: This folder contains pickle files of the models. There are two sub-folders: interim
and final
. interim
holds the checkpoints. You can find more details about these in the Reloading the hyperparameter search from a checkpoint section. The final, retrained best model can be found in models/final/model_single.pkl
.
results
: After running the pipeline, this folder will contain the final output CSV files: predictions.csv
, unknown_predictions.csv
, unidentified_students_single.csv
, unidentified_unknowns_single.csv
. These files are outlined in more detail below under Outputs from running the pipeline
scripts
: This folder contains the dvc.yaml
file that outlines the different stages and steps of the pipeline. It also includes two main sub-folders: data
and model
. The data
sub-folder contains python scripts that prepare the interim datasets and final datasets used for modeling. The model
sub-folder contains scripts that split the final dataset into train and test datasets, runs the cross-validation and hyperparameter search, re-trains the model, calculates roni scores and uses the trained model to generate predictions for current/unknown students.
src
: This folder contains helper functions (found in the *_utils.py
scripts) and also contains scripts that can set different parameters. There are three sub-folders: constants
, cv
and params
. The cv
sub-folder contains helper functions for the cross-validation and hyper-parameter search stage. It also contains dictionaries of the hyper-parameter search spaces in search_spaces.py
. The constants
folder contains parameters for the pipeline that are unlikely to need to change, whereas the params
sub-folder contains parameters for the pipeline that may need/want to be changed. The *_arguments.py
scripts in this sub-folder include the arguments that are sent to the dvc.yaml
pipeline.
Assuming that the data provided by the user are of the following types:
In addition, we allow data on characteristics and ks2, since these can be used as features for the model before the student enters year 11.
Points to remember:
snake_case
(lowercase with spaces as _). We suggest first you try to make your column names match the schemas in the example_data/raw
folder. If you instead want to change the columns the code can process, you'll need to add entries to the renaming dictionary. You can find the renaming dictionary in the src/constants/[TYPE]_constants.py
file where [TYPE]
refers to whatever your data type is. The ks2 renaming dictionary is in src/constants/ks_constants.py
and the characteristics renaming dictionary is the same as the CCIS renaming dictionary in src/constants/ccis_constants.py
. After renaming columns please run python ./generate_params.py
from your scripts
folder. See Changing Any Other Parameters for more details on how to do this.month/year of birth
column with numeric month and year values in the form [MONTH]/[YEAR]
. We don't use the date of
birth column. You can safely remove it if there is concern about data sensitivity.This part will change slightly depending on what operating system you are using.
> cd [DESIRED_PARENT_DIRECTORY] # Navigate to the desired parent directory using
# this (change directory) command. Fill in
# [DESIRED_PARENT_DIRECTORY] with your desired
# directory.
> ls # You can run this command to see the contents
# of a directory
> git clone https://github.com/DSSGxUK/s22_buckinghamshire.git # This will download the code repository to the
# current folder
> cd s22_buckinghamshire # This navigates to the repository folder
> python -m venv venv # Creating a python virtual environment
> .\venv\Scripts\activate # Running the virtual environment.
# If you get an error that says '...
# s22_buckinghamshire\venv\Scripts\Activate.ps1'
# cannot be loaded because running scripts is
# disabled on this system. For more information,
# see about_Execution_Policies at 'https:/go.
# microsoft.com/fwlink/?LinkID=135170',
# then we need to enable execution of signed
# scripts.
# Set-ExecutionPolicy RemoteSigned -Scope CurrentUser # We can do this by running the command to the
# left without the #. You may need
# administrator privelages to do this.
> python.exe -m pip install --upgrade pip # Update pip if necessary
> python.exe -m pip install -r .\requirements.txt # Install required python packages
> dvc config --system core.analytics false # Turn off DVC anonymized analytics
> cd [DESIRED_PARENT_DIRECTORY] # Navigate to the desired parent directory using
# this (change directory) command. Fill in
# [DESIRED_PARENT_DIRECTORY] with your desired
# directory.
> ls # You can run this command to see the contents
# of a directory
> git clone https://github.com/DSSGxUK/s22_buckinghamshire.git # This will download the code repository to the
# current folder
> cd s22_buckinghamshire # This navigates to the repository folder
> python -m venv venv # Creating a python virtual environment
> source ./venv/bin/activate # Running the virtual environment.
> pip install --upgrade pip # Update pip if necessary
> pip install -r .\requirements.txt # Install required python packages
> dvc config --system core.analytics false # Turn off DVC anonymized analytics
We've published synthetic data to dagshub so you can play around with the pipeline. This data is randomly generated and any unique IDs are random strings.
To retrieve it, please run the following
dvc remote add origin https://dagshub.com/abhmul/s22_buckinghamshire.dvc -f
dvc pull -r origin
If you are a council with your own data, these datasets will need to be saved in the data/raw
directory as csv files in the correct formats with the correct column names.
Before adding your data, please run the steps above to get the synthetic data. Then run the following steps
cd data/raw
dvc remove remove origin # Run this without # if you downloaded the synthetic data
Get-ChildItem * -Include *.csv -Recurse | Remove-Item # Run this without # to remove any synthetic data have
# Please note this deletes all csv files in `data/raw`
Get-ChildItem * -Include *.csv.dvc -Recurse | Remove-Item # This deletes the dvc tracking files for the synthetic data
cd data/raw
dvc remove remove origin # Run this without # if you downloaded the synthetic data
rm **/*.csv && rm *.csv # Run this without # to remove any synthetic data have
# Please note this deletes all csv files in `data/raw`
rm **/*.csv.dvc && rm *.csv.dvc # This deletes the dvc tracking files for the synthetic data
For an example of what the schema of the datasets and folder structure should look like, we've kept snippets of synthetic data for you to compare against in the example_data/raw
folder.
Within the data/raw
directory are 6 folders that correspond to the different datasets listed above under Assumptions:
attendance_original_csv
ccis_original_csv
census_original_csv
ks4_original_csv
characteristics_original_csv
ks2_original_csv
The datasets in these directories should be named [TYPE]_original_[DATE].csv
where [TYPE]
refers to the dataset (attendance, ccis, census, ks4, characteristics, ks2) and [DATE]
refers to the month and year the dataset was submitted (e.g. attendance_original_jan21.csv
corresponds to autumn 2021 attendance data, which is submitted in January). [DATE]
should be written as the first 3 letters of the month and the last 2 digits of the year e.g. jan21
, sep19
.
CSV files in characteristics_original_csv
and ks2_original_csv
contain columns from the CCIS and KS4 datasets, respectively, and should be populated with data from current Year 7-10 students that we want to generate predictions for. These are separate datasets as current Year 7-10 students are not present in the CCIS and KS4 datasets until Year 11 onwards, but there are important features within these datasests that will improve the model performance.
In addition you should add a csv file called data/raw/secondary_schools_original.csv
, so the code knows what schools the establishment numbers in the data correspond to. See the file in example_data/raw
for how your csv should look.
Once you've added your data to the data/raw
folder, you should be good to go.
You may want to incorporate more data from later years as you collect it. Simply follow the procedure outlined above, and the pipeline will pick it up.
We currently do not support addition of new columns for modeling. The code should work fine if you add new columns but it will not use them in modeling.
Please follow the below steps before running the workflows:
cd .\scripts\
Running the whole pipeline includes a hyper parameter search which can take a few hours to complete. If you do not wish to run this stage, please follow the instructions under Output predictions on new data without re-running the hyper parameter search.
To run the whole pipeline you can run:
dvc repro
Alternatively, you could run the steps individually:
# Generate datasets for modelling
dvc repro -s --glob generate_modeling_*
# Run cross validation and hyper parameter search
dvc repro -s --glob cv_*
# Model Evaluation
dvc repro -s --glob model_evaluation_*
# Generate datasets for predictions and final output
dvc repro -s --glob prediction_*
Following these steps re-trains the model with new data using the previous best hyper-parameters.
# Generate datasets for modelling
dvc repro -s --glob generate_modeling_*
# Model Evaluation
dvc repro -s --glob model_evaluation_*
# Generate datasets for predictions and final output
dvc repro -s --glob prediction_*
Below is a brief overview of what each stage within a workflow is doing:
Generate datasets for modelling
Run cross validation and hyper parameter search
Model Evaluation
Generate datasets for predictions and final output
Because the hyperparameter search takes a long time, we have built support for checkpoint progress. If for some reason the run does not complete, you can pick it up from where it left off rather than restarting it. To do this you will need to complete the following steps:
src/params/model_pipeline_params.py
.LOAD_CHECKPOINTS
. Change its value to True
.scripts
folder (you may already be there if you were running the pipeline), run python generate_params.py
. This will register the change in parameters for the pipeline.dvc repro -s --glob cv_*
.Please make sure to reset LOAD_CHECKPOINTS
to False
(and rerun python generate_params.py
) when you want to research for hyperparameter with new data. Otherwise the search will use the old checkpoint and not rerun.
If you feel comfortable with diving into the code and wish to change additional parameters, you need to do the following:
src/params
.python generate_params.py
from the scripts
folder.If you do not complete step (2) the pipeline will not register your changes.
These files can be found in the results/
directory after running the pipeline.
predictions.csv
: Dataset used for modeling with additional columns containing predictions and probabilities for current students in Year 7-10unknown_predictions.csv
: Dataset used for modeling with additional columns containing predictions and probabilities for students with unknown destinationsunidentified_students_single.csv
: List of school students that had too much missing data and could not be used in the modelunidentified_unknowns_single.csv
: List of students with unknown destinations that had too much missing data and could not be used in the modelThe datasets to be loaded into the Power BI dahsboard after running the pipeline are found here:
results/predictions.csv
results/unknown_predictions.csv
results/unidentified_students_single.csv
data/interim/neet_annotated.csv
data/interim/census_annotated.csv
data/interim/attendance_exact.csv
The first two contain predictions for current school students in Years 7-10 and for current unknown students, respectively.
unidentified_students_single.csv
contains unidentified current school students for which predictions could not be generated due to too much missing data.
The final three files found in the data/interim
folder contain neet, census and attendance data from previous years of students. These three datasets are for a separate page of the power bi dashboard that looks at certain factor trends over the years.
The Measures table(named as Measures_table) contains some measured valued we need to display on powerBI visualisations. We can easily create new measure in PowerBI. You will need to implement these measures (name and formula are given):
Att<85%
HighRisk
LevelOfNeed_2%
MentalHealth%
Pregnant/Parent%
SEND%
SupByYOT%
unidentified%
We also need to create few new columns for PowerBI. These are as follows along with the formula:
Column Name: MentalHealthFlag
Column Name: Age
Column Name: Attendance%
Column Name: Gender
Column Name: Parent/Preg%
Column Name: Gender
NOTE: replace fake_test_dataset with the actual file name which contains the predictions
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?