.kaggle

b51e35ee9b

Update parser/.kaggle/kaggle.json

2 years ago

auto_features

974713be6a

add docstrings search

2 years ago

kaggle_scraping

96aefca308

Data update + minor changes

2 years ago

remote

a1c9c09a83

upd: fix parser after yet another kaggle update

3 years ago

clean_data.py

d09f310789

parser & data updated

2 years ago

collect_kernels_from_competitions.py

d09f310789

parser & data updated

2 years ago

competition_collector.py

d09f310789

parser & data updated

2 years ago

competition_kernels.sh

7a7ecd2a92

upd: small tweaks to collect new kernel refs

3 years ago

competition_tags.py

d09f310789

parser & data updated

2 years ago

kernel_collector.py

d09f310789

parser & data updated

2 years ago

kernel_parser.py

96aefca308

Data update + minor changes

2 years ago

readme.md

a6bb3f2c73

Update readme.md

1 year ago

requirements.txt

96aefca308

Data update + minor changes

2 years ago

unite_kernel_lists_w_year.py

96aefca308

Data update + minor changes

2 years ago

You have to be logged in to leave a comment.

Parser

All public Kaggle kernels are distributed under Apache 2.0 license.

Terminology

Kaggle has a lot of references for its objects. Let's clarify the terminology.

A link is a full URL, e.g. https://www.kaggle.com/data-scientist/cool-solution
A slug is a human-readable identificator. For most Kaggle objects it is simply the part of the URL after https://www.kaggle.com/. The slug of the kernel in previous example is data-scientist/cool-solution
A numerical ID is a number that Kaggle uses as an internal reference.

File descriptions

clean_data.py goes through every code block in the labeled dataset, removes comments, formats code to conform to the PEP8 standard, removes service labels. This script has 2 CLI arguments:
- path to the labeled dataset
- path to the knowledge graph
collect_kernels_from_competitions.py traverses popular competitions that are available through Kaggle API and collects the slugs of the connected kernels.

The script does the following: - Collects the links to the Kaggle competitions to ‘competitions_ref.csv’ - Runs competition_kernels.sh, which collect kernels of every competition - Collects .csv files with the Kaggle kernels links to ./kernel_lists directory
competition_kernels.sh collects the kernel slugs for a single competition. This script is used in collect_kernels_from_competitions.py. This script has 1 CLI argument which is competition slug.
competition_collector.py collects the competition slugs from the Kaggle search engine. This script has 3 CLI arguments:
- --competition_count is the maximum amount of collected competition slugs
- --process_id is the id of the worker
- --total_processes is the total amount of workers
competition_tags.py extracts competition tags. The tags include metric as well as metadata like data type or subject.
kernel_collector.py collects kernel slugs from the Kaggle search engine. This script has 2 CLI arguments:
- --kernel_count is the maximum amount of collected kernel slugs
- --process_id is the id of the worker. The script assumes that there are 3 workers in total.
kernel_parser.py takes the kernel slugs from csv file in KERNEL_FILENAME variable, then it loads each kernel and splits it into code blocks. This script has 1 CLI argument:
- --process_id is the id of the worker. The script assumes that there are 3 workers in total.
Input: table with kernel links Output: table with each kernel code blocks, code blocks ids and kernels metadata:"kaggle_score", "kaggle_comments", "kaggle_upvotes", "kaggle_link", "kaggle_id", "kaggle_dataset”.
unite_kernel_lists.py takes all csv files from a folder and appends them to the main csv file which is defined in MERGED_LIST variable.

Kernels collection and parsing running

Kaggle kernels links collection

Collection of kernels links to kernel_lists directory.

mkdir kernel_lists

python collect_kernels_from_competitions.py

The script does the following:

Collects the links to the Kaggle competitions to ‘competitions_ref.csv’
Runs competition_kernels.sh, which collect kernels of every competition
Collects .csv files with the Kaggle kernels links to kernel_lists directory

Combining table data into one .csv table

python unite_kernel_lists_w_year.py

Combining table data into one .csv table

python unite_kernel_lists_w_year.py

Input: kernel_lists directory Output: .csv table with link to the kernel and submission year

Kaggle kernels parsing

python kernel_parser.py --process_id $process_id 1>./parser.err 2>./parser.log

process_id = {0,1,2}

The parser uses the BeautifulSoup4 library, as well as tqdm for a beautiful and convenient progress bar. to collect references to the data used, Selenium was required. Selenium uses the Google Chrome webdriver.

Tip!

Press p or to see the previous file or, n or to see the next file

readme.md

Parser

Terminology

File descriptions

Kernels collection and parsing running

Kaggle kernels links collection

Kaggle kernels parsing

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

levin / nl2ml mirror of https://gitlab.com/lambda-hse/nl2ml

readme.md

Parser

Terminology

File descriptions

Kernels collection and parsing running

Kaggle kernels links collection

Kaggle kernels parsing

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

levin
/
nl2ml
mirror of https://gitlab.com/lambda-hse/nl2ml