The data model behind DiSSCo’s Annotation service

16 January 2023

DiSSCo envisions discovering the untapped value within vast natural science collections across Europe and aims to connect these collections to related datasets and new insights using the FAIR Digital Object (FDO) framework. A key milestone toward this vision is the development of various data models and services that can help annotate and publish digital specimen data. This allows for the enrichment, correction, and improvement of digital specimen information, not only through manual annotation but also with batch and AI-assisted services.

During our design and test phase, we considered how these new annotation digital objects can integrate into the existing biodiversity data ecosystem. We align with ongoing discussions on open Digital Specimen (openDS) and MIDS, emphasising FAIR implementation. This approach ensures seamless collaboration and leverages existing models, ontologies, and specifications.

Now, let’s delve into the background, showcase an example, and explore the data model.

By Sharif Islam (DiSSCo data architect)

 

The concept of web annotation or scientific data annotation is not new. The World Wide Web Consortium (W3C) annotation data model (WADM), at the forefront of this collaborative endeavour, establishes the foundation. Drawing inspiration from the International Image Interoperability Framework (IIIF) and Europeana Annotation Data Model, both of which also build upon the W3C framework, we adopted WADM as our starting point. This ensures that annotations in the biodiversity domain align with global standards, fostering interoperability and enriching the scholarly landscape.

The topic of annotation regarding natural science collection and biodiversity data is also not new. The community has been working on this from various perspectives (see activities of TDWG Annotation IG, tools like AnnoSys, Pensoft Annotator; also see recent publications like Morris et al. 2023, Stork et al. 2019). Our approach builds upon these previous works, emphasising FAIR implementation and FDO Framework.

In Fig 1, we see the core WADM framework, which serves as the starting point for thinking about how annotation and digital specimens can be integrated. Annotation is always carried out on something — the “target” in Fig 1. In DiSSCo’s case, the target can be a digital specimen, a media object, or another annotation object. The body represents the content of the annotation and these components are linked.

Fig 1: The W3C Annotation Data Model (WADM) framework

Let’s consider an example involving a herbarium specimen (Fig 2), where a deep learning model can be employed to annotate and detect plant organs. Each of these boxes can serve as part of the annotation target. Picture a user drawing a bounding box, adding comments or making corrections to the content. This process can be applied to images or various data points such as collector’s names and locations. Additionally, we aim to accommodate both single-user interactions and batch processing. Our current sandbox implementation includes some of these features, ready for testing, and we hope to iteratively enhance them.

Fig 2: Annotation of a herbarium specimen organ detection

However, we need a framework — a data model — to capture annotations alongside specimen data and ensure FAIRness. Specifically, we need a flexible approach that accommodates commenting, editing, and data improvement. Fig 2 above illustrates the overall structure of our Annotation data model, incorporating concepts such as motivation, target, body, creator, and generator from WADM, and AggregateRating from schema.org. We introduce a few new terms as needed.

Let’s delve into this model in more detail. Picture a LEGO set, each piece designed to contribute to the grand structure. In Fig 3, observe the unique identifier (ID), type, and motivation of the annotation — defining its essence and purpose within the structure.

Fig 3: Identifier, type, and motivation

Fig 4 shows a crucial element: the target of the annotation — essentially, the object being annotated. This could be an entire image, a digital specimen, or a specific field.

 

Fig 4: The target and the selected field.

Finally, in Fig 5, there’s the body — the content encapsulated within the Annotation object. Here, we showcase the value — this could be an addition, comment, or an assessment (based on the selected motivation displayed in Fig 3). Another crucial aspect for tracking provenance is identifying who created this annotation — this could be a human or a machine agent.

Fig 5: The value and the creator

Our model is designed with flexibility in mind to capture different types of annotations. We are also planning to integrates this into approval/rejection workflows. Additionally, this structure facilitates linking and exporting of data to various systems, whether through bulk export or API operations. This adaptability ensures our annotation framework is versatile and aligns with diverse operational needs and collaborative workflows.

This process of data modeling and design is a collective effort. Many thanks to the core DiSSCo development team at Naturalis Biodiversity Center (Wouter Addink, Sam Leeflang, Soulaine Theocharides, Tom Dijkema). We also gathered responses through an RFC (Request for Comments) process, where feedback and insights shaped our thinking, paving the way for further improvement. Excitingly, several new features are in the pipeline. For more details, check out our GitHub and sandbox. Happy annotating!

Do you want to know more about the technical side of DiSSCo? DiSSCo puts different technical knowledge platforms at the scientific community’s disposal:

DiSSCoTech: Get the latest technical posts about the design of DiSSCo’s Infrastructure

DiSSCo Labs: A preview of experimental services and demonstrators by the DiSSCo community

DiSSCo GitHub: Code hosting for DiSSCo software, version control and collaboration

Categories

Archives