Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git github
afe3d8fa3a
Updating the Jupyter Notebooks and the raw data
2 years ago
afe3d8fa3a
Updating the Jupyter Notebooks and the raw data
2 years ago
cfa8b37449
Updating the scripts and the DVC pipeline
2 years ago
cfa8b37449
Updating the scripts and the DVC pipeline
2 years ago
8b70f246ad
Updating the Jupyter Notebooks
3 years ago
632b013e8d
Create LICENSE
1 year ago
420f9c5ae9
Updating the README file and removing the unnecessary credentials file
2 years ago
420f9c5ae9
Updating the README file and removing the unnecessary credentials file
2 years ago
098d3aaadb
Adjusting the scripts and Definition of preprocessing DVC pipeline
3 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

DiB Covid-19 Production Dataset

Description

This repository is responsible for versioning of the required source code to generate the COVID-19 manuscripts dataset, which it was published in the DiB (Data in Brief) journal.

Input

The raw data collected by the Jupyter Notebooks, which are contained into the folder "notebooks/collect".

The data sources are:

Output

The final dataset is combination of the arXiv, bioRxiv, medRxiv, PubMed and Scopus datasets collected. It is generated by the use of DVC pipeline defined in this repository.

The features of the resulting dataset are:

  • id (identifier): the identifier key of a manuscript.
  • doi: the DOI of a manuscript.
  • title: the title of a manuscript.
  • abstract: the abstract of a manuscript.
  • publication_date: the date of publication of a manuscript.
  • citation_num: the number of citation of a manuscript.
  • language: the language/idiom of a manuscript.
  • production_type: the category/type/classification of source of a manuscript.
  • source_type: the category/type/classification of source of a manuscript. It is a short version of feature production_type.
  • auth_keywords: the list of keywords defined by the authors of a manuscript.
  • index_terms: the list of indexed terms that defined by Scopus.
  • issn: the ISSN/E-ISSN of a manuscript.
  • vehicle_name: the name of source where a manuscript was published.
  • publisher: the name of publisher that published a manuscript.
  • affiliations: the list of affiliations (ID, country and name of affiliation) contained in a manuscript.
  • subject_areas: the list of subject/study fields of a manuscript.
  • authors: the list of authors (ID and name) contained in a manuscript.
  • author_affil: the list of authors organized with their affiliations. The combination of the features authors and affiliations.
  • ref_count: the number of references contained in a manuscript.
  • references: the list of references data (authors, title, DOI and ID).
  • data_source: the source database of a manuscript.
  • period: the combination of the year and month, respectively, extracted from the feature publication_date.

Steps for generating the dataset

For the execution of the following steps, I will consider that you already cloned/downloaded this repository, as well as the steps will be executed via shell/prompt within the folder of this repository. In addition, an essential prerequisite is that DVC is already installed on your machine.

For reusing the raw data that I already collected and the pipeline created, you can do the following steps:

  1. Download the raw data, that is available on Google Drive, and put them in the data/raw folder. You can download these files from this link.

  2. Execute the preprocessing pipeline. So, you can execute the following command:

    dvc repro
    

Citation

DOI:10.1016/j.dib.2020.106178

How does it cite?

Santos, Breno Santana; Silva, Ivanovitch; Ribeiro-Dantas, Marcel da Câmara; Alves, Gisliany; Endo, Patricia Takako; Lima, Luciana. COVID-19: A scholarly production dataset report for research analysis. Data in Brief, Volume 32, 2020, DOI:10.1016/j.dib.2020.106178.

How does the article download?

You can download the article from this link.

Tip!

Press p or to see the previous file or, n or to see the next file

About

The repository of the DiB (Data in Brief) paper's project related to COVID-19 manuscripts dataset.

Collaborators 1

Comments

Loading...