Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a single PostgreSQL database; import scripts are primarily in Python, with Rust code for high-throughput processing of raw data files.
If you use these scripts in any published reseaerch, cite our paper:
Michael D. Ekstrand, Mucun Tian, Mohammed R. Imran Kazi, Hoda Mehrpouyan, and Daniel Kluver. 2018. Exploring Author Gender in Book Rating and Recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys '18). ACM, pp. 242–250. DOI:10.1145/3240323.3240373. arXiv:1808.07586v1 [cs.IR].
Note: the limitations section of the paper contains important information about the limitations of the data these scripts compile. Do not use this data or tools without understanding those limitations. In particular, VIAF's gender information is incomplete and, in a number of cases, incorrect.
pg_prewarm
(from the
PostgreSQL Contrib package) installed.psql
executable on the machine where the import scripts will runIt is best if you do not store the data files on the same disk as your PostgreSQL database.
The environment.yml
file defines an Anaconda environment that contains all the required packages except for the PostgreSQL server. It can be set up with:
conda env create -f environment.yml
All scripts read database connection info from the standard PostgreSQL client environment variables:
PGDATABASE
PGHOST
PGUSER
PGPASSWORD
After creating your database, initialize the extensions (as the database superuser):
CREATE EXTENSION orafce;
CREATE EXTENSION pg_prewarm;
The default PostgreSQL performance configuration settings will probably not be very effective; we recommend turning on parallelism and increasing work memory, at a minimum.
This imports the following data sets:
data/LOC
directory.data/LOC
.data
directory.data
.data
.data
The import process is scripted with invoke. The first tasks to run are the import tasks:
invoke loc.import-books
invoke loc.import-names
invoke viaf.import
invoke openlib.import-authors openlib.import-works openlib.import-editions
invoke goodreads.import
invoke ratings.import-az
invoke ratings.import-bx
Once all the data is imported, you can begin to run the indexing and linking tasks:
invoke viaf.index
invoke loc.index-books
invoke loc.index-names
invoke openlib.index
invoke goodreads.index-books
invoke analyze.cluster --scope loc
invoke analyze.cluster --scope ol
invoke analyze.cluster --scope gr
invoke analyze.cluster
invoke ratings.index
invoke goodreads.index-ratings
invoke analyze.authors
The tasks keep track of the import status in an import_status
table, and will
keep you from running tasks in the wrong order.
The -schema
files contain the base schemas for the data:
common-schema.sql
— common tablesloc-mds-schema.sql
— Library of Congress catalog tablesol-schema.sql
— OpenLibrary book dataviaf-schema.sql
— VIAF tablesaz-schema.sql
— Amazon rating schemabx-schema.sql
— BookCrossing rating data schemagr-schema.sql
— GoodReads data schemaloc-ids-schema.sql
- LOC ID schemasPress p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?