Tagging biomedical packages with EDAM
debian/upstream/edam is a file to allow a formal categorisation of a package with concepts from the EDAM ontology (https://bioportal.bioontology.org/ontologies/EDAM?p=classes).
It is formatted in YAML like the other files in the debian/upstream folder. Use commandline YAML Lint for consistent validation. NB. The commandline version is not as strict as the (ugly) formatting produced by the online version!
For source packages with multiple binary packages that all need different EDAM annotation and/or for which the main binary package is not named like the source package, it is suggested to name the edam file packagename.edam .
The context of this development is the emerging bio.tools database of the European ELIXIR project. A set of scripts for an automated upload a bio.tools-ready description from the Debian database (edam, control, copyright, changelog) has already been implemented and is available:
https://github.com/bio-tools/biotoolsConnect/blob/master/DebianMed/edam.sh in the biotoolsConnect github repository
https://salsa.debian.org/med-team/community/infrastructure/tree/master/edam/registry-tool.py in the Debian Med subversion repository
These link the Debian package archive to entries in bio.tools. Those in search of a particular tool may thus become more quickly aware of Debian-provided binaries. Particularly in biological and medical sciences the confidence to use the same binary as others do, is of a particular value. Alternatives may be web services, but for many of today's high-throughput data, I/O is a bottleneck.
Issues
- There is yet no automated update of entries in bio.tools - today a second entry would be created. This is not what we want which is why there are e.g. no weekly reuploads or for instance a hook in the Debian Med git repository to automate was has already been automated
- Licensing - the data in bio.tools is of a creative commons license, which needs to be mentioned in debian/control
The bio.tools folks have established a range of EDAM annotations already, in part with renowned community efforts like ?SeqAnwers.com. How do we exchange information properly? The Debian package maintainer freely decides to peek a boo at those resources and describes so in the debian/copyright file?
Format description
Borrowing from the debian/upstream/edam file of the aspiring Debian package condetri, which again borrowed from trimmomatic, the first line identifies the ontology and version the file refers to. Typical for the EDAM ontology the whole package then has a single topic. That topic may have several scopes, but typically there is just one, i.e. a summary such.
--- ontology: EDAM (1.12) topic: - Sequencing scopes: - name: summary function: - Sequence trimmimg - Sequencing quality control inputs: - data: Sequence formats: [FASTQ] outputs: - data: Sequence formats: [FASTQ]
For some softwares suites, like for instance EMBOSS, it may be suitable to have several scopes to separate binaries. A scope has functions, with inputs and outputs.
Examples
A series of packages already features an EDAM annotation. You may decide to adopt terms from a similar program as a head start:
tophat - lost?
This list is not complete.
Tools helping to organise EDAM annotation
Andreas interlinked the EDAM files with the UDD and provides this script to access the information. Perform the following for an overview on tools that feature an EDAM annotation in Debian:
wget -O edam_query.sh https://raw.githubusercontent.com/bio-tools/biotoolsConnect/master/DebianMed/edam.sh chmod +x edam_query.sh # install postgresql client if not already installed [ -x /usr/bin/psql ] || sudo apt-get install postgresql-client-9.5 ./edam_query.sh
This produces a file named edam.txt with everything Debian today knows about EDAM and more - feels almost like worthy to upload to biotools
$ head -n 3 edam.txt | tail -n 1 abacas | debian | sid | main | 1.3.1 | abacas | http://abacas.sourceforge.net/ | Algorithm Based Automatic Contiguation of Assembled Sequences | ABACAS is intended to rapidly contiguate (align, order, orientate), +| | | | | 8 / 11 / 168 | 10.1093/bioinformatics/btp347 | {"Probes and primers"} | [{"name": "summary", "inputs": [{"data": "Sequence", "formats": ["FASTA"]}], "outputs": [{"data": "Sequence", "formats": ["FASTA"]}], "function": ["PCR primer design"]}]
You can also create json output when calling the script with -j option:
$ ./edam_query.sh -j
This script is actually not intended as a fully qualified tool but rather as an example for an UDD query that can be turned into a tool.
See also
UltimateDebianDatabase (UDD)
The Common workflow language aims at providing the means to help inter-connecting tools in the bio.tools database and beyond.