400 Dataset from AWS Data Registry are Available on DagsHub
  Back to blog home

400 Dataset from AWS Data Registry are Available on DagsHub

Open Source Datasets Feb 28, 2023

We're excited to share that we added over 400 datasets from AWS Registry to DagsHub, which you can view, stream, and use in your machine learning project. These datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals and covers various domains, including audio, computer vision, NLP,  geology, biology, and tabular data.

400 Dataset from AWS Data Registry Available on DagsHub
DagsHub Datasets

Where can you find over 400 datasets from AWS Registry?

You can explore the 400 datasets from the AWS Registry on the new DagsHub Dataset page. It holds all the available and up-to-date datasets in a user-friendly interface, enabling one to browse and filter datasets by categories, such as Audio, Computer Vision, NLP, Geology, Biology, and Tabular data.

Each dataset card holds a brief description of the dataset,  a link to the dataset repository, a code snippet for streaming it, and all of its relevant tags. With DagsHub’s Data Catalog, you can seamlessly explore the datasets, view their content, and use them in your machine-learning projects.

How can you use a dataset from AWS Registry with DagsHub?

Direct Data Access supports streaming files from an s3 bucket connected to a DagsHub repository. This means you can stream a subset of the AWS Registry datasets without downloading it entirely to your local storage.

Check out this example:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="<https://dagshub.com/DagsHub-Datasets/fast-ai-imageclas-dataset>")

fs.listdir("s3://fast-ai-imageclas")

Demo: How to use a dataset from AWS Registry

To find the dataset that meets your need, you can check out the Dataset DB landing page and explore the supported dataset under the specific data domain by clicking the “Click for more” button.

Audio Datasets from AWS - available on DagsHub
Audio datasets

As you find the dataset you are interested in, click the dataset card and check out more information on the dataset page.

Dataset card on DagsHub
Dataset Card

Press the link button to access the repository or copy the DDA snippet to stream the data.

Stream Datasets from AWS using DagsHub
Stream Datasets from AWS using DagsHub

How to Filter Datasets on DagsHub

Another way of exploring the dataset for your machine learning project is to filter the dataset by data catalog. For instance, if you are looking for the computer vision dataset provided by the AWS registry, You can choose the open-data-registry in General and AWS s3 in Integration. Specify the data domain topic to computer vision, and you are ready to start exploring the computer vision datasets provided by the AWS registry.

DagsHub Datasets
DagsHub Datasets

How to Build a ML Project Using a Dataset from AWS?

If you're looking for an exciting example of utilizing the direct data accessing feature from DagsHub, check out this project on DagsHub! The project demonstrates how to train a computer vision model using the MNIST dataset provided by DagsHub-Dataset.

With over 400 datasets and direct data accessing from DagsHub, you'll never have to worry about a lack of quality data for your machine learning projects again. So what are you waiting for? Start exploring the Dataset DB today!

Tags

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.