Introduction
Software changes repeatedly and the package maintainers do the best to keep pace with upstream's progress. It seems inappropriate though to prepare regular Debian packages for large database since
- data is released frequently
- some user demand weekly updates
- others refer to official releases
- some databases are large, e.g. UniProt/Pfam all are beyond the Gigabyte barrier
- updating databases will demand further operations
- update of indices
- ..? which depends on other tools and packages that are installed on the machine
A tool is needed to help automating the update of packages. A first rudimentary skeleton was prepared with getData.pl on the Debian-Med subversion repository.
Public Databases that may be considered for Debian
Name |
Contents |
Licence |
Package |
Treated by getData.pl |
Public sequences |
publicly available |
|
|
|
Sequence alignments version 3, and a C program for scoring |
unknown, but contains a header file from ClustalW, which is not free |
|
|
|
Multiple alignments and scoring system |
|
|
|
|
pseudogenes |
unknown |
|
|
|
Transcription factor binding sites |
"Freely available" |
|
|
|
Regulatory sequences |
LGPL |
|
|
|
public repository for regulatory data |
Says Open Source but not found |
|
|
|
A big table associating PMIDs, PMCIDs and DOIs for all the articles in Pubmed Central. |
|
|
|
|
Restriction Enzymes |
? |
|
||
Repeat elements |
academic, registration needed |
|
|
|
Repeat elements (Zebrafish) |
no license |
|
|
Probably, many free databases can be found in the database issue of Nucleic Acid Research http://nar.oxfordjournals.org/content/vol34/suppl_1/index.dtl
We also need open-source software to warehouse the databases
Name |
Licence |
Package |
GPL |
depends on PHP and (My|Postgre)SQL |
|
Artistic |
depends on java |
|
GPL |
depends on Perl and SQL |
|
MPL 1.1 |
|
|
4-clause BSD |
complex |
|
LGPLv3 |
|
Here is an interesting link about tools for biological data: http://biodatamodel.org/
Back to DebianScience/Biology