Large dataset manager
Mentor: DebianMed
Summary: Download, process, manage and integrate large public datasets to Debian.
Required skills:
- Familiarity with one programming or scripting language.
- Familiarity with Debian packaging.
- Bioinformatics or expertise in another field using large public datasets.
Description:
Large public datasets, like databases for bioinformatics are typically too big and too volatile to fit the traditional source/binary packaging scheme of Debian. There are some programs that are distributed in Debian, like blast and emboss, can index specialised databases, but Debian lacks a tool to install or update the datasets they need and keep their indexing in sync. Although this task is traditionally preformed by hand or with custom scripts by the local administrator, the development of cloud computing increases the need for separating the operating system with its analysis software from the repository containing the reference data to be used during the analyis. On the other hand, one will not want to wait and pay for re-indexing all the datasets each time an instance is launched. This is why we think that there is a need for distributing pre-indexed data in packages adapted to the default configuration and layout of a Debian system. The Debian Med projects looks for a student interested in the management of local copies of large datasets using the same paradigms as software management in the Debian operating system. We encourage the conception of a tool that is functional with multiple fields of interest (not only biology) and operating systems (not only Debian).
As a starting point or a source of inspiration, the students can have a look to the getData programs with which we are currently exploring the issues of data management.
Please contact us on debian-med@lists.debian.org for applying.