Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git github
Emmanuel 8a820a1815
refactor: fix cmd required to uninstall kernel from jupyter environment
1 year ago
d8f3c870d8
feature: add DVC config, in order to handle the large file of the models (hosted on DagsHub)
1 year ago
66a05a52c0
refactor: change api_export.csv name for convienience (soutenance)
1 year ago
e40bcf119e
refactor: add medias required for the Streamlit app
1 year ago
d8f3c870d8
feature: add DVC config, in order to handle the large file of the models (hosted on DagsHub)
1 year ago
d8f3c870d8
feature: add DVC config, in order to handle the large file of the models (hosted on DagsHub)
1 year ago
89a2113b9a
refactor: tried to re-interpret the NMF topics just in case.. but they are still cryptic
1 year ago
0f5f9bb2b6
feature: now collect reviews with rating 1 or 2 only (and optionally all ratings)
1 year ago
e26c10c8a4
refactor: add several score reports about the various models (SIFT features clustering | CNN features clustering | CNN #1 confusion matrix | CNN #2 confusion matrix)
1 year ago
a214c19c88
feature: display tsne projections in Streamlit app for the SIFT pipeline
1 year ago
8a820a1815
refactor: fix cmd required to uninstall kernel from jupyter environment
1 year ago
346d7ff9c5
refactor: remove a libs that seems to block the installation process
1 year ago
eb70d9338d
feature: add progress bar to the API scrapping tool
1 year ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

Topic Modelling & Image Classification

("Améliorez le produit IA de votre start-up")

[This project is part of the AI Engineer cursus on OpenClassrooms]

We are provided with a dataset from Yelp containing 6.990.280 reviews wrote by 1.987.897 users about 150.346 businesses of any kind. We can also use 200 100 pictures of foods, drinks, interiors, exteriors and menus taken by the users.

The purpose of this project is to train on the 3 following features:

  • detect topics of dissatisfaction in the comments published on the platform.
  • automatically label the photos posted on the platform.
  • collect new reviews using the Yelp API.
  1. At first, we will conduce an EDA for the Topic Modelling and try LDA and NMF modelling (01_Topic_Modelling.ipynb)
    Topic Modelling steps
  2. Then we will conduce an EDA for the feature extraction and Image Classification problem and optionnaly try to implement a model (04_Image_Classification.ipynb)
    Bag of Visual Words steps
  3. After that, will develop a python script to gather new reviews from the Yelp API (02_API_Scrapping.py)
  4. And finally, we will procude a custom dashbord to present results from points 1 & 2 (03_Streamlit.py)

Running the notebooks online

As the notebooks use hyperlinks for the navigation, and because this doesn't work on GitHub, they are also avaible on nbviewer.org for convenience.

Running the notebook locally

In order to use this project locally, you will need to have Python and Jupyter notebook installed. Once done, we can set the environment by using the following commands:

First,

let's duplicate the project github repository

>>> git clone https://github.com/Valkea/OC_AI_06
>>> cd OC_AI_06

Secondly,

let's download the datasets and put them in the 'data' folder:

data/yelp_dataset/yelp_academic_dataset_business.json data/yelp_dataset/yelp_academic_dataset_review.json data/yelp_photos/photos.json' data/yelp_photos/photos/[the photos].jpg

Thirdly,

let's create a virtual environment and install the required Python libraries

(Linux or Mac)

>>> python3 -m venv venvP6
>>> source venvP6/bin/activate
>>> pip install -r requirements.txt

(Windows):

>>> py -m venv venvP6
>>> .\venvP6\Scripts\activate
>>> py -m pip install -r requirements.txt

Finally,

let's configure and run the virtual environment for Jupyter notebook

Install jupyter kernel for the virtual environment using the following command:

>>> pip install ipykernel
>>> python -m ipykernel install --user --name=venvP6

Install specific tools that aren't installed using the requirements.txt file

REQUIRED: let's install the spacy model used in this project

>>> python -m spacy download en_core_web_sm

OPTIONAL: if you encounter a problem after the Streamlit installation, the following line might solve it:

>>> echo fs.inotify.max_user_watches=524288 | sudo tee /etc/sysctl.d/40-max-user-watches.conf && sudo sysctl --system

OPTIONAL: if tf-lite is not correctly installed, it might be a good idea to try the following line:

>>> pip install --extra-index-url https://google-coral.github.io/py-repo/ tflite_runtime

Select the installed kernel

In order to run the various notebooks, you will need to use the virtual environnement created above. So once the notebooks are opened (see below), prior to running it, follow this step: alt text

Run the jupyter notebooks

  1. in order to see the Topic Modelling notebook, run:
>>> jupyter notebook 01_Topic_Modelling.ipynb
  1. then, to see the Image Classification notebook, use the following command:
>>> jupyter notebook 02_Image_Classification.ipynb
  1. and after that, one can explore the interactive Streamlit app using:
>>> streamlit run 03_Streamlit.py

⚠️ Unfortunately, to run the streamlit app with all features, we need the models and as they are quite heavy they aren't pushed to GitHub.

So in order to run this app, there are three solutions:

Clone the models from DagsHub using DVC (you need to install DVC prior to using the following command line):

>>> dvc remote add origin https://dagshub.com/Valkea/OC_AI_06.dvc
>>> dvc pull -r origin

or

Run both notebooks to export the models (they will automatically export), and convert the VGG16 produced model to a TF-lite model.

>>> cd models
>>> python model_convert_to_TFLite.py vgg16_clf1.h5 vgg16_clf1.epochXX-categorical_accuracyX.XX.hdf5 # replace XX with the best produced model values
>>> mv vgg16_clf1.tflite vgg_15_clf.tflite
>>> cd ..

or

Run the streamlit app without the models... the app will work but some feature will be missing.

  1. finally, one can collect new reviews using the following script:
>>> python 02_API_Scrapping.py # this will gather "200" new reviews from "France" and save them to "data/api_export.csv"
or
>>> python 02_API_Scrapping.py --help # this will return the full list of available parameters
or
>>> python 02_API_Scrapping.py -n=100 -l=Chicago -p=data/my_file.csv -v=3 # This is an example

Uninstalling the venv kernel

Once done with the project, the kernel can be listed and removed using the following commands:

>>> jupyter kernelspec list
>>> jupyter kernelspec uninstall venvp6
Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...