Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Ekaterina a6bb3f2c73
Update readme.md
1 year ago
..
b51e35ee9b
Update parser/.kaggle/kaggle.json
2 years ago
974713be6a
add docstrings search
2 years ago
96aefca308
Data update + minor changes
2 years ago
a1c9c09a83
upd: fix parser after yet another kaggle update
3 years ago
d09f310789
parser & data updated
2 years ago
d09f310789
parser & data updated
2 years ago
d09f310789
parser & data updated
2 years ago
7a7ecd2a92
upd: small tweaks to collect new kernel refs
3 years ago
d09f310789
parser & data updated
2 years ago
d09f310789
parser & data updated
2 years ago
96aefca308
Data update + minor changes
2 years ago
a6bb3f2c73
Update readme.md
1 year ago
96aefca308
Data update + minor changes
2 years ago
96aefca308
Data update + minor changes
2 years ago

readme.md

You have to be logged in to leave a comment. Sign In

Parser

All public Kaggle kernels are distributed under Apache 2.0 license.

Terminology

Kaggle has a lot of references for its objects. Let's clarify the terminology.

  • A link is a full URL, e.g. https://www.kaggle.com/data-scientist/cool-solution
  • A slug is a human-readable identificator. For most Kaggle objects it is simply the part of the URL after https://www.kaggle.com/. The slug of the kernel in previous example is data-scientist/cool-solution
  • A numerical ID is a number that Kaggle uses as an internal reference.

File descriptions

  • clean_data.py goes through every code block in the labeled dataset, removes comments, formats code to conform to the PEP8 standard, removes service labels. This script has 2 CLI arguments:

    • path to the labeled dataset
    • path to the knowledge graph
  • collect_kernels_from_competitions.py traverses popular competitions that are available through Kaggle API and collects the slugs of the connected kernels.

    The script does the following: - Collects the links to the Kaggle competitions to ‘competitions_ref.csv’ - Runs competition_kernels.sh, which collect kernels of every competition - Collects .csv files with the Kaggle kernels links to ./kernel_lists directory

  • competition_kernels.sh collects the kernel slugs for a single competition. This script is used in collect_kernels_from_competitions.py. This script has 1 CLI argument which is competition slug.

  • competition_collector.py collects the competition slugs from the Kaggle search engine. This script has 3 CLI arguments:

    • --competition_count is the maximum amount of collected competition slugs
    • --process_id is the id of the worker
    • --total_processes is the total amount of workers
  • competition_tags.py extracts competition tags. The tags include metric as well as metadata like data type or subject.

  • kernel_collector.py collects kernel slugs from the Kaggle search engine. This script has 2 CLI arguments:

    • --kernel_count is the maximum amount of collected kernel slugs
    • --process_id is the id of the worker. The script assumes that there are 3 workers in total.
  • kernel_parser.py takes the kernel slugs from csv file in KERNEL_FILENAME variable, then it loads each kernel and splits it into code blocks. This script has 1 CLI argument:

    • --process_id is the id of the worker. The script assumes that there are 3 workers in total.

    Input: table with kernel links Output: table with each kernel code blocks, code blocks ids and kernels metadata:"kaggle_score", "kaggle_comments", "kaggle_upvotes", "kaggle_link", "kaggle_id", "kaggle_dataset”.

  • unite_kernel_lists.py takes all csv files from a folder and appends them to the main csv file which is defined in MERGED_LIST variable.

Kernels collection and parsing running

  1. Collection of kernels links to kernel_lists directory.

mkdir kernel_lists

python collect_kernels_from_competitions.py

The script does the following:

  • Collects the links to the Kaggle competitions to ‘competitions_ref.csv’
  • Runs competition_kernels.sh, which collect kernels of every competition
  • Collects .csv files with the Kaggle kernels links to kernel_lists directory
  1. Combining table data into one .csv table

python unite_kernel_lists_w_year.py

  1. Combining table data into one .csv table

python unite_kernel_lists_w_year.py

Input: kernel_lists directory Output: .csv table with link to the kernel and submission year

Kaggle kernels parsing

python kernel_parser.py --process_id $process_id 1>./parser.err 2>./parser.log

process_id = {0,1,2}

The parser uses the BeautifulSoup4 library, as well as tqdm for a beautiful and convenient progress bar. to collect references to the data used, Selenium was required. Selenium uses the Google Chrome webdriver.

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...