volf52 Projects

volf52 synced new reference v0.3.0 to volf52/tr_news_cat from mirror

3 years ago

volf52 synced commits to main at volf52/tr_news_cat from mirror

fe3e40ac67 Merge from Dev: Update Docker config, use gunicorn and remove Pytorch Squashed commit of the following: commit 5c84c4a73ceee599259e06891f9883de6db476fd Author: Arslan <rslnkrmt2552@gmail.com> Date: Sat Mar 20 15:55:31 2021 +0300 Bump version to 0.3.0 commit 0af31d0759290d8963ebbe4af682b1625374ccd3 Author: Arslan <rslnkrmt2552@gmail.com> Date: Sat Mar 20 15:47:23 2021 +0300 Update Dockerfile: 3-stage build and smaller context - The Dockerfile now has 3 build stages to improve our ability to use the layer compression and cache mechanism. The first stage downloads and installs the system packages, and will change extremely rarely. The second stage installs the python packages, and will change more frequently than the first stage, but still much rarely compared to the last stage. The final stage copies the package code, pulls the artifacts using DVC, installs the root as a package (using poetry), and then deletes the dvc and git files. - The .dockerignore file has also been improved to minimize the build context sent to the docker build server. One of the main factors for this change was to be able to use cloud container builds, e.g. on Docker Hub (or Github, Gitlab, Heroku, AWS). - The application is now run using gunicorn (ASGI server) in async mode, with default 2 uvicorn workers and 8 threads per worker. commit eec7a68f51edb2bcfa54df6b20845ecde3d9d74f Author: Arslan <rslnkrmt2552@gmail.com> Date: Sat Mar 20 15:35:16 2021 +0300 Update unicorn, remove Pytorch and add DVC as dev dep commit 614420f014dccd4f004e3be4449c5b93fa9faa4a Author: Arslan <rslnkrmt2552@gmail.com> Date: Fri Mar 19 14:44:23 2021 +0300 Add Dockerfile, and related build instructions in README commit 850984aa79b844a819fd87ff525ce2a5d9b42f31 Author: Arslan <rslnkrmt2552@gmail.com> Date: Fri Mar 19 14:43:41 2021 +0300 Disable runtime directory validation for data, metrics and artifacts dir

3 years ago

volf52 synced commits to dev at volf52/tr_news_cat from mirror

5c84c4a73c Bump version to 0.3.0
0af31d0759 Update Dockerfile: 3-stage build and smaller context - The Dockerfile now has 3 build stages to improve our ability to use the layer compression and cache mechanism. The first stage downloads and installs the system packages, and will change extremely rarely. The second stage installs the python packages, and will change more frequently than the first stage, but still much rarely compared to the last stage. The final stage copies the package code, pulls the artifacts using DVC, installs the root as a package (using poetry), and then deletes the dvc and git files. - The .dockerignore file has also been improved to minimize the build context sent to the docker build server. One of the main factors for this change was to be able to use cloud container builds, e.g. on Docker Hub (or Github, Gitlab, Heroku, AWS). - The application is now run using gunicorn (ASGI server) in async mode, with default 2 uvicorn workers and 8 threads per worker.
eec7a68f51 Update unicorn, remove Pytorch and add DVC as dev dep
View comparison for these 3 commits »

3 years ago

volf52 synced commits to main at volf52/tr_news_cat from mirror

4bd0544299 Add Dockerfile, and related build instructions in README

3 years ago

volf52 synced commits to dev at volf52/tr_news_cat from mirror

614420f014 Add Dockerfile, and related build instructions in README
850984aa79 Disable runtime directory validation for data, metrics and artifacts dir
View comparison for these 2 commits »

3 years ago

volf52 synced new reference feature_api_skeleton to volf52/tr_news_cat from mirror

3 years ago

volf52 synced commits to dev at volf52/tr_news_cat from mirror

d19230f825 Bump version 0.2.0
f12b682392 Feature: Logistic Classifier (with Tf-Idf for embedding) (PR #3 squash merge) * Remove ml.config as dep for 'clean_and_split' * Tfidf vectorizer with 25k features * Track experiments using DVC exp feature * Remove clean_and_split and embed_tfidf stages - The earlier stages are not required to train the classifiers, only the output of the last stage - Any time we need to change those outputs (the vector files), we can simply switch to the relevant branch and re-run the pipeline using `dvc checkout` followed by `dvc repro`. - The outputs of the last stage will be tracked using .dvc files instead of as pipeline outputs. * Set the outputs of embed_tfidf stage as vanilla DVC tracked files * Remove raw_data as well - The final version of this branch should be merged directly with dev from now on. - The changes (specfically data tracking) to the data-pipe branch should be reflected in just the embedding branches. - The data tracking changes to the embedding branches should reflex only in the classifier branches. - In short, each branch should be tracking only the data it needs for its pipeline to work. If it's using the output from some other branch (like ml_clf_logistic uses the vectorized data from ml_embed_tfidf, the pipeline metadata should be removed, and the outputs should be tracked using vanilla DVC. - Any change in the outputs (e.g. as a result of running the embed_tfidf stage on ml_embed_tfidf branch) can be produced and committed there, and when switching to the *_clf_* branch, we can update the dvc tracked metadata using `dvc add`). * Restore the Tfidf vectorizer for use in pipeline * Replace npy files with npz files following change in embed_tfidf stage * Update config to be the same as the earlier embed stage * Can train a logistic classifier now * Logistic Ex 1: Default parameters * Remove unnecessary stages, artifacts and data. Track model manually * Tfidf Logistic Classifier. Works great * Remove training and testing data, as that's not required for model deployment
View comparison for these 2 commits »

3 years ago

volf52 synced commits to ml_embed_tfidf at volf52/tr_news_cat from mirror

ff3aa02f98 Modify embed_tfidf stage to use save_npz - Saving the trainX, testX and validX sparse matrices (scipy.sparse.csr_matrix) using np.save causes a problem. So, when we save an object using np.save (e.g. np.save(x)), it first makes a call to turn that object into an np ndarray (using np.array(x)). In the case of a csr_matrix, or a non-iterable Python object, this will return a 0-dimensional array, which will cause problems for sklearn funcs. - There is a way to get back those underlying objects, by calling zero_dim_array.item(). But it's better to use scipy.sparse.[save/load]_npz, as it stores the sparse array in a compressed format.

3 years ago

volf52 synced commits to ml_clf_logistic at volf52/tr_news_cat from mirror

79b8e5a0ef Logistic Ex 1: Default parameters
e9449990ad Can train a logistic classifier now
100f20ed02 Update config to be the same as the earlier embed stage
d52f1417ab Replace npy files with npz files following change in embed_tfidf stage
d813bf1d99 Restore the Tfidf vectorizer for use in pipeline
View comparison for these 5 commits »

3 years ago

volf52 synced commits to ml_clf_logistic at volf52/tr_news_cat from mirror

5352427162 Remove raw_data as well - The final version of this branch should be merged directly with dev from now on. - The changes (specfically data tracking) to the data-pipe branch should be reflected in just the embedding branches. - The data tracking changes to the embedding branches should reflex only in the classifier branches. - In short, each branch should be tracking only the data it needs for its pipeline to work. If it's using the output from some other branch (like ml_clf_logistic uses the vectorized data from ml_embed_tfidf, the pipeline metadata should be removed, and the outputs should be tracked using vanilla DVC. - Any change in the outputs (e.g. as a result of running the embed_tfidf stage on ml_embed_tfidf branch) can be produced and committed there, and when switching to the *_clf_* branch, we can update the dvc tracked metadata using `dvc add`).

3 years ago

volf52 synced new reference ml_clf_logistic to volf52/tr_news_cat from mirror

3 years ago

volf52 synced commits to ml_embed_tfidf at volf52/tr_news_cat from mirror

5972fe96ad Tfidf vectorizer with 25k features
32568526b3 Remove ml.config as dep for 'clean_and_split'
View comparison for these 2 commits »

3 years ago

volf52 synced commits to ml_embed_tfidf at volf52/tr_news_cat from mirror

57312ace5a Merge PR #1 - Add API base/skeleton Feature (Web) - Add API skeleton with random predictor as proxy
20c7d124c1 remove embarassing TODO
f352145576 Add api base/skeleton
4fd725a465 Req Add: uvicorn, gunicorn, python-multipart
View comparison for these 4 commits »

3 years ago

volf52 synced commits to main at volf52/tr_news_cat from mirror

1e6de09d04 Merge pull request #2 from volf52/dev Random Classifier
57312ace5a Merge PR #1 - Add API base/skeleton Feature (Web) - Add API skeleton with random predictor as proxy
20c7d124c1 remove embarassing TODO
f352145576 Add api base/skeleton
4fd725a465 Req Add: uvicorn, gunicorn, python-multipart
View comparison for these 17 commits »

3 years ago

volf52 synced and deleted reference feature_api_skeleton at volf52/tr_news_cat from mirror

3 years ago

volf52 synced new reference feature_api_skeleton to volf52/tr_news_cat from mirror

3 years ago

volf52 synced commits to dev at volf52/tr_news_cat from mirror

d1bcfc1347 Add data clean pipeline
41976c47e8 Add stage to clean data using regex
View comparison for these 2 commits »

3 years ago

volf52 synced commits to dev at volf52/tr_news_cat from mirror

8e0c153441 Add all requirements now Adding requirements one by one will result in a bad commit graph, as I found out working today. Better to add all the requirements I know the application will need now, as early in the common commit history as I can.
2041b0aa7d Add all requirements (I can think of) at once
3835ed1044 Add base data and track it with DVC. Add associated python requirements Squashed commit of the following: commit 3fb8ee4aeae70050fd1acc31afd1af1e784d2bd8 Author: Arslan <rslnkrmt2552@gmail.com> Date: Wed Mar 17 19:50:14 2021 +0300 Add numpy and pandas commit c526eb38fe51442784633538fa1603c78cbfc67f Author: Arslan <rslnkrmt2552@gmail.com> Date: Wed Mar 17 19:38:38 2021 +0300 Configure DagsHub as the remote storage commit 93344ba22a44d22d0b9a9b47828685e1a1b1eff4 Author: Arslan <rslnkrmt2552@gmail.com> Date: Wed Mar 17 19:33:39 2021 +0300 Add initial raw_data commit 788691ecb93eb4851a9d25b20fb8706fda048dd2 Author: Arslan <rslnkrmt2552@gmail.com> Date: Wed Mar 17 19:24:34 2021 +0300 Init DVC
3fb8ee4aea Add numpy and pandas
c526eb38fe Configure DagsHub as the remote storage
View comparison for these 7 commits »

3 years ago

volf52 synced commits to data-pipe at volf52/tr_news_cat from mirror

36ed0a1d4d Configure DagsHub as the remote storage

3 years ago

volf52 synced new reference data-pipe to volf52/tr_news_cat from mirror

3 years ago