Sequence database

In the field of bioinformatics , sequence databases store and manage collections of DNA , RNA or protein sequences with the help of computers . A biochemistry database can include sequences from a single organism, e.g. B. contain all proteins of the yeast Saccharomyces cerevisiae , or DNA sequences of all organisms whose genome has been sequenced. There are several ways to search for information in databases: The most common is to search for DNA or protein sequences that are similar to a known sequence. The BLAST program enables such a query.

The biggest problem with huge sequence databases is that entries come from many different sources, from individual researchers to large genome sequencing centers. The quality of the sequences themselves and the associated biological annotations therefore varies considerably. Furthermore, redundancies occur very often, as many laboratories submit numerous sequences that are identical or almost identical to entries that have already been filed.

In addition, many annotations are not based on laboratory experiments, but on the results of sequence similarity searches with previously annotated sequences. Since a sequence annotated in this way and stored in the database can itself form the basis of future annotations, several further annotations can lie between a specific database entry and the information actually obtained from a laboratory experiment. One also speaks of the transitive annotation problem , i.e. H. the transmission or forwarding of the annotations. Therefore, biological annotations in the large sequence databases must be viewed with a certain skepticism, as long as they are not supported either by references to relevant, high-quality experimental data from scientific publications, or by references to a human-managed sequence database (such as Swiss-Prot ) .

Examples

GenBank (DNA database)
UniProt (protein database)
GISAID (sequence database)