Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git github
Alexander Levin 78e469d2ff
Logreg trained on the new regex (graph_v2)
3 years ago
2f0a4b49a1
DVC Init
3 years ago
82b810b974
last checkpoints added
3 years ago
97c5b91e60
code2vec folder added
3 years ago
78e469d2ff
Logreg trained on the new regex (graph_v2)
3 years ago
bdb584c714
Get rid of csv in Git LFS
3 years ago
b1fcb7eafc
gitignore updated
3 years ago
a4b1957697
in-code comments classification added
3 years ago
a1740b38d7
Update README.md
3 years ago
45a0bcfcdb
changed names
3 years ago
45a0bcfcdb
changed names
3 years ago
c11ff2b3b2
Kaggle datasets has been updated from the latest parsing
3 years ago
fc9e9f9c28
ramazyant files added
3 years ago
fc9e9f9c28
ramazyant files added
3 years ago
78e469d2ff
Logreg trained on the new regex (graph_v2)
3 years ago
10d7a0e2a4
Logreg trained on the new regex (graph_v2)
3 years ago
37eaaa3400
models added
3 years ago
8ba024a8c5
no message
4 years ago
10d7a0e2a4
Logreg trained on the new regex (graph_v2)
3 years ago
ca45532636
bugs fixed; variables names changed;
3 years ago
78e469d2ff
Logreg trained on the new regex (graph_v2)
3 years ago
65746108fc
pipelines optimized
3 years ago
440cab265f
TFIDF_DIR fixed
3 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

Source Code Classification

This is an old repo of NL2ML-project of the Laboratory of Big Data Analysis of Higher School of Economics (HSE LAMBDA).

The project page - https://www.notion.so/NL2ML-Corpus-1ed964c08eb049b383c73b9728c3a231

The repo is currently migrating to the HSE LAMBDA GitLab - https://gitlab.com/lambda-hse/nl2ml

Project Goals:

The current short-term goal is to build a model that will be able to classify a source code chunk and to specify where the detected class is exactly in the chunk (tag segmentation).

The global goal is to build a model that will be able to generate code using a text of the task in english.

Contents:

nl2ml_notebook_parser.py - script for parsing Kaggle notebooks and process them to JSON/CSV/Pandas.

bert_distances.ipynb - notebook with expiremints concerning sense of distance between BERT embeddings where input tokens were tokenized source code chunks.

bert_classifier.ipynb - notebook with preprocessing and training pipeline.

regex.ipynb - notebook with creating labels for code chunks with regex

logreg_classifier.ipynb.ipynb - notebook with building logreg on the regex labels with tf-idf

Tip!

Press p or to see the previous file or, n or to see the next file

About

nl2ml

Collaborators 3

Comments

Loading...