Skip to content

Annotate Datasets

DagsHub provides an easy way to annotate your datasets, via DagsHub Annotations (based on the Label Studio Integration).

Annotations are also integrated into Data Engine, providing an easy way to annotate data and generate high-quality datasets for training.

Let's see how you can easily annotate your datasets and visualize the annotations, and use them to train a model with DagsHub. For the full annotation features, see the feature page.

DagsHub annotations are free for public projects, and part of the Team Tier for private projects.

Uploading or connecting your dataset

Before annotating your data, you'll need to upload or connect the first version of your dataset, and create a Data Engine datasource from it. See the Version Datasets section of the docs for all the Dataset creation approaches.

For the purpose of this guide, we'll go with the simplest option, uploading a dataset to DagsHub. We'll use a segment of the COCO 1K dataset. Make sure to create a repository, then in the following snippet replace the <repo_owner> and <repo_name> with your user and repository name respectively. Everything else should work.

Start by installing DagsHub (pip install -U dagshub), and make sure you're in your workspace.

Then run the following snippet:

from dagshub import get_repo_bucket_client
from dagshub.data_engine import datasources

# This retrieves the sample dataset an puts it in a folder called "data/". If you already have your own data there, then skip this line
ds = datasources.get('Dean/COCO_1K', 'COCO_1K')
ds.head(50).download_files(target_dir=".")

# Upload folder contents
client = get_repo_bucket_client("<repo_owner>/<repo_name>", flavor="s3fs")
client.put("data/", "<repo_name>/data", recursive=True)

# Create a datasource
datasources.create_datasource(repo="<repo_owner>/<repo_name>", name="hello-world", path="s3://<repo_name>/data")

After this you should see the dataset in your project's Datasets tab (the view below is after you click on "visualize" that dataset): Visualize New Dataset

Send data points to be annotated

There are 3 ways to send datasets to be annotated. You can use the DagsHub UI, DagsHub client or use the local visualization instance.

Send to annotation using the DagsHub UI

Sending through the UI is the easiest, most straight forward way for most users as it doesn't require running any code.

  1. Go to the Datasets tab and choose the Dataset or datasource you'd like to use: Visualize from the UI

  2. click on the Visualize button. This will open the dataset in the DagsHub UI. From here you can select the datapoints you want to annotate.

  3. After you selected the data points you'd like to annotate, click on the annotate button at the top right. Send to Annotations from UI

Tip: easily select multiple datapoints

You can use Shift+Click to select multiple data points easily, or the checkbox at the top to select all the datapoints.

Send to annotation using the DagsHub client

Sending to annotations using the client gives you the flexibility to pre-filter your dataset and only send parts of it to be annotated. This can be especially useful when working on improving the quality of your dataset, where you wouldn't necessarily want to redo the entire dataset's labels, but only parts.

To send data points to annotations using Dagshub client, use the .annotate() function:

# Get the datasource we just created
ds = datasources.get("<repo_owner>/<repo_name>", "hello-world")

# Send all the data points in the datasource to annotation
ds.annotate()

# Send only the first 5 samples to be annotated
ds.head(5).annotate()

Send to annotation from the local dataset visualizer (Voxel51)

To annotate selected data points from the local visualization instance, start by visualizing your dataset:

ds.visualize()

Select the images you'd like to annotate, then navigate to the DagsHub tab (if there is no DagsHub tab, click on the ‘+’ button and choose DagsHub) and click on the ‘Annotate selected in LabelStudio’ button:

Send to Annotations from Voxel

Configuring the annotation project

After sending data points for annotation, a new window with the DagsHub web platform will open. It might take a few moments to spin up the labeling server. You can play pong while waiting :)

From here you can either choose the annotation project to add the tasks to, or create a new project. This means you can manage the annotation process with multiple annotators, assigning the right tasks to the relevant annotator.

To add the selected annotation tasks to an existing annotation project, select the first option, Continue with one of the existing projects, and choose an existing one (You'll only see this option if you have a labeling project set up). To create a new annotation project, select the second option, Create new, and specify a name for it.

Add Datapoints to Annotation Project

Using Existing Labeling Project Configurations

You can import existing configurations (annotation templates, auto-labelers, etc.) to your new project by checking the ** Use project settings of:** option and choosing an existing project.

Select Existing Settings

Click start, and you will be directed to your project with the relevant tasks.

Annotation Project with Tasks

Saving annotations back to your datasource

You can update your datasource with the new annotations. To do that, annotate a datapoint (a task), then: 1. Click on the Submit button (or Update if you're updating an existing annotation. 2. Click the Save button at the top right of the screen.

save_annotations

Each annotation is saved as an enrichment field, named after the annotation project name you used + the word annotation, on the corresponding data point. The annotation is saved as blob with a Label Studio json format as the content.

Visualizing the new annotations

After saving the annotations your enrichment fields will be updated. You can then display them within your visualization instance. Either click Refresh Dataset button in the Voxel DagsHub tab, or re-run the .visualize() command again to update and display your new annotations:

Display Annotations in Voxel

Auto Labeling with ML Models

Label Studio supports active learning with automatic labeling with existing models. If you need this check out the tutorial blog about this.

Next steps

Now that you have labeled your dataset, you might be interested in converting it into a dataloader for training, or, if you already know how to do that, learn how to track your experiments.