Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Commit History
Message Author SHA1 Date
remove "test" remote   Hlib 3 years ago
zipping devanbu small corpus: include train, valid, test, demo folders directly to the root of the zip (without parent folders)   hlibbabii 3 years ago
add stage for zipping devanbu small corpus   hlibbabii 3 years ago
add another gdrive remote to test why pulling of remote cache fails   hlibbabii 3 years ago
add allamanis cropus extraction to pipeline   hlib 3 years ago
Merge branch 'master' of https://github.com/giganticode/datasets   hlib 3 years ago
add stage for computing the stats for devanbu small corpus   hlib 3 years ago
improments to pre-processing stage: track also the resulting vocab; use a separate venv to run codeprep; extract codeprpe version with yq   hlib 3 years ago
add stage for extracting devanbu small corpus   hlib 3 years ago
rename extract-25k-vocab-corpus.sh to be able to reuse it to extract other corpora   hlib 3 years ago
add 25k-vocab corpus preprocessing stage   hlib 3 years ago
lock allamanis corpus download stage not to re-download te corpus every time   hlib 3 years ago
add extraction of 25k-vocab corpus (for lm-context experiments) to pipeline   hlib 3 years ago
remove adding 25k-vocab corpus (added without means to reproduce it)   hlib 3 years ago
add 25k-vocab (for closed-vocab lm-context experiments) dataset metadata   hlib 3 years ago
add google drive remote   hlib 3 years ago
add downloading gigatoken corpus zip to pipeline   hlib 3 years ago
init dvc   hlib 3 years ago
Initial commit   Hlib 3 years ago