Introduction

Software changes repeatedly and the package maintainers do the best to keep pace with upstream's progress. It seems inappropriate though to prepare regular Debian packages for large database since

A tool is needed to help automating the update of packages. A first rudimentary skeleton was prepared with getData.pl on the Debian-Med subversion repository.

Public Databases that may be considered for Debian

Name

Contents

Licence

Package

Treated by getData.pl

Genbank

Public sequences

publicly available

BAliBASE3

Sequence alignments version 3, and a C program for scoring

unknown, but contains a header file from ClustalW, which is not free

OXbench

Multiple alignments and scoring system

www.pseudogene.org

pseudogenes

unknown

Jaspar

Transcription factor binding sites

"Freely available"

ORegAnno

Regulatory sequences

LGPL

Pazar

public repository for regulatory data

Says Open Source but not found

ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz

A big table associating PMIDs, PMCIDs and DOIs for all the articles in Pubmed Central.

REbase

Restriction Enzymes

?

fink

Repbase

Repeat elements

academic, registration needed

Zebrafish repeats

Repeat elements (Zebrafish)

no license

Probably, many free databases can be found in the database issue of Nucleic Acid Research http://nar.oxfordjournals.org/content/vol34/suppl_1/index.dtl

We also need open-source software to warehouse the databases

Name

Licence

Package

S3DB

GPL

depends on PHP and (My|Postgre)SQL

BioMOBY

Artistic

depends on java

hitkeeper

GPL

depends on Perl and SQL

BioWarehouse

MPL 1.1

mrs

4-clause BSD

complex

BioSQL

LGPLv3

Here is an interesting link about tools for biological data: http://biodatamodel.org/