This lesson is still being designed and assembled (Pre-Alpha version)

Being precise

Overview

Teaching: 23 min
Exercises: 6 min
Questions
  • How to make my metadata interoperable?

  • How to avoid disambiguation?

Objectives
  • Using public identifiers

  • Understand difference between close vocabulary and ontology

  • Finding ontology terms

Being precise

If the metadata purpose is to help understand the data, it has to be done in a precise and “understandable” way i.e. it has to be interoperable. To be interoperable metadata should use a formal, accessible, shared, and broadly applicable terms/language for knowledge representation.

One of the easiest examples is the problem of author disambiguation.

Why we need ORCID

After Libarary Carpentry FAIR Data

Open Researcher and Contributor ID (ORCID)

Have you ever searched yourself in pubmed and found that you have a doppelganger? So how can you uniquely associate something you created to just yourself and no other researcher with the same name?

ORCID is a free, unique, persistent identifier that you own and control—forever. It distinguishes you from every other researcher across disciplines, borders, and time.

ORCIDs of authors of this episode are:

You can connect your iD with your professional information—affiliations, grants, publications, peer review, and more. You can use your iD to share your information with other systems, ensuring you get recognition for all your contributions, saving you time and hassle, and reducing the risk of errors.

If you do not have an ORCID, you should register to get one!

ORCID provides the registry of researchers, so they can be precisely identified. Similarly, there are other registries that can be used to identify many of biological concepts and entities:

BioPortal or NCBI are good places to start searching for a registry or a term.

Exercise 1: Public ID in action (3 min)

The Wellcome Open Research journal uses ORCID to identify authors.

  • Open one of our papers doi.org/10.12688/wellcomeopenres.15341.2 and have a look how public IDs such as ORCID can be used to interlink information.

  • If you have not done so yet, register yourself at ORCID*

Solution

ORCID is used to link to authors profiles which list their other publications.

Exercise 2: Public ID in action 2 (3 min)

  • The second metadata example (the Excel table): contains two other types of public IDs. Metadata in data table example Figure credits: Tomasz Zielinski and Andrés Romanowski

    • Can you find the public IDs?
    • Can you find the meaning behind those Ids?

Solution

The metadata example contains genes IDs from The Arabidopsis Information Resource TAIR and metabolites IDs from KEGG

Ontologies (7 min teaching)

Disambiguation

In academic disciplines we quickly run into problems of naming standards e.g.:

  • Escherichia coli
  • EColi
  • E. coli
  • E. Coli
  • Kanamycin A
  • Kanamycin
  • Kanam.
  • Kan.

Ontologies represent a standardised, formal naming system and define categories, properties and relationships between data. Ontologies allow to describe properties of a subject area and how they are related (e.g. taxonomy). This reduces the complexity of the data through use of controlled vocabulary.

Controlled Vocabulary

Definition: Any closed prescribed list of terms

Key Features:

  • Terms are not usually defined
  • Relationships between the terms are not usually defined
  • the simplest form is a list

Example:

  • E. coli
  • Drosophila melanogaster
  • Homo sapiens
  • Mus musculus
  • Salmonella

Use of controlled vocabulary (a list) can be organised hierarchically into a taxonomy, a system we know mostly from our species taxonomy.

Taxonomy

Definition: Any controlled vocabulary that is arranged in a hierarchy

Key Features:

  • Terms are not usually defined
  • Relationships between the terms are not usually defined
  • Terms are arranged in a hierarchy

Example:

  • Bacteria
    • E. coli
    • Salmonella
  • Eucariota
    • Mammalia
      • Homo sapiens
      • Mus musculus
  • Insecta
    • Drosophila melanogaster

Ontologies add a further dimension to controlled vocabularies and taxonomy. They allow us to conceptualise relationships between the established hierarchy which helps with more sophisticated data queries and metadata searches.

Ontology

Definition: A formal conceptualisation of a specified domain

Key Features:

  • Terms are DEFINED
  • Relationships between the terms are DEFINED, allowing logical inference and sophisticated data queries
  • Terms are arranged in a hierarchy
  • expressed in a knowledge representation language such as RDFS, OBO, or OWL

Example:

  • Bacteria
    • E. coli
    • Salmonella
  • Eucariota — has_part —> nucleas
    • Mammalia — has_part —> placenta
      • Homo sapiens
      • Mus musculus
  • Insecta
    • Drosophila melanogaster

ontology-example Figure credits: Tomasz Zielinski



  1. Finding ontologies: https://bioportal.bioontology.org/
  2. List of recommended ontologies: http://www.obofoundry.org/

Key Points

  • Public identifiers and ontologies are key to knowledge discovery

  • Automatic data aggregations needs standarised metadata formats and values