Being precise
Overview
Teaching: 23 min
Exercises: 6 minQuestions
How to make my metadata interoperable?
How to avoid disambiguation?
Objectives
Using public identifiers
Understand difference between close vocabulary and ontology
Finding ontology terms
Being precise
If the metadata purpose is to help understand the data, it has to be done in a precise and “understandable” way i.e. it has to be interoperable. To be interoperable metadata should use a formal, accessible, shared, and broadly applicable terms/language for knowledge representation.
One of the easiest examples is the problem of author disambiguation.
After Libarary Carpentry FAIR Data
Open Researcher and Contributor ID (ORCID)
Have you ever searched yourself in pubmed and found that you have a doppelganger? So how can you uniquely associate something you created to just yourself and no other researcher with the same name?
ORCID is a free, unique, persistent identifier that you own and control—forever. It distinguishes you from every other researcher across disciplines, borders, and time.
ORCIDs of authors of this episode are:
You can connect your iD with your professional information—affiliations, grants, publications, peer review, and more. You can use your iD to share your information with other systems, ensuring you get recognition for all your contributions, saving you time and hassle, and reducing the risk of errors.
If you do not have an ORCID, you should register to get one!
ORCID provides the registry of researchers, so they can be precisely identified. Similarly, there are other registries that can be used to identify many of biological concepts and entities:
- species e.g. NCBI taxonomy
- chemicals e.g. ChEBI
- proteins e.g. UniProt
- genes e.g. GenBank
- metabolic reactions, enzymes e.g KEGG
BioPortal or NCBI are good places to start searching for a registry or a term.
Exercise 1: Public ID in action (3 min)
The Wellcome Open Research journal uses ORCID to identify authors.
Open one of our papers doi.org/10.12688/wellcomeopenres.15341.2 and have a look how public IDs such as ORCID can be used to interlink information.
If you have not done so yet, register yourself at ORCID*
Solution
ORCID is used to link to authors profiles which list their other publications.
Exercise 2: Public ID in action 2 (3 min)
The second metadata example (the Excel table): contains two other types of public IDs. Figure credits: Tomasz Zielinski and Andrés Romanowski
- Can you find the public IDs?
- Can you find the meaning behind those Ids?
Solution
The metadata example contains genes IDs from The Arabidopsis Information Resource TAIR and metabolites IDs from KEGG
Ontologies (7 min teaching)
Disambiguation
In academic disciplines we quickly run into problems of naming standards e.g.:
- Escherichia coli
- EColi
- E. coli
- E. Coli
- Kanamycin A
- Kanamycin
- Kanam.
- Kan.
Ontologies represent a standardised, formal naming system and define categories, properties and relationships between data. Ontologies allow to describe properties of a subject area and how they are related (e.g. taxonomy). This reduces the complexity of the data through use of controlled vocabulary.
Controlled Vocabulary
Definition: Any closed prescribed list of terms
Key Features:
- Terms are not usually defined
- Relationships between the terms are not usually defined
- the simplest form is a list
Example:
- E. coli
- Drosophila melanogaster
- Homo sapiens
- Mus musculus
- Salmonella
Use of controlled vocabulary (a list) can be organised hierarchically into a taxonomy, a system we know mostly from our species taxonomy.
Taxonomy
Definition: Any controlled vocabulary that is arranged in a hierarchy
Key Features:
- Terms are not usually defined
- Relationships between the terms are not usually defined
- Terms are arranged in a hierarchy
Example:
- Bacteria
- E. coli
- Salmonella
- Eucariota
- Mammalia
- Homo sapiens
- Mus musculus
- Insecta
- Drosophila melanogaster
Ontologies add a further dimension to controlled vocabularies and taxonomy. They allow us to conceptualise relationships between the established hierarchy which helps with more sophisticated data queries and metadata searches.
Ontology
Definition: A formal conceptualisation of a specified domain
Key Features:
- Terms are DEFINED
- Relationships between the terms are DEFINED, allowing logical inference and sophisticated data queries
- Terms are arranged in a hierarchy
- expressed in a knowledge representation language such as RDFS, OBO, or OWL
Example:
- Bacteria
- E. coli
- Salmonella
- Eucariota — has_part —> nucleas
- Mammalia — has_part —> placenta
- Homo sapiens
- Mus musculus
- Insecta
- Drosophila melanogaster
Figure credits: Tomasz Zielinski
Here some links to help you find the right ontologies for your research:
- Finding ontologies: https://bioportal.bioontology.org/
- List of recommended ontologies: http://www.obofoundry.org/
Key Points
Public identifiers and ontologies are key to knowledge discovery
Automatic data aggregations needs standarised metadata formats and values