FASTA format

from Wikipedia, the free encyclopedia

The FASTA format is a text-based format for displaying and storing the primary structure of nucleic acids ( nucleic acid sequence ) and proteins ( protein sequence ) in bioinformatics . The nucleobases or amino acids are represented by a one-letter code . The format allows the sequences to be preceded by a name and comments.

The simplicity of the format makes it easy for word processing tools and scripting languages ​​to read in and process the data.

format

A sequence in FASTA format begins with a one-line description, followed by the sequence data. It is recommended that each line of the file contain a maximum of 80 characters. A sequence ends with the end of the file or another header line.

The following is a simple example of a protein sequence in FASTA format from cytochrome b of the Asian elephant :

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Header

The header (Engl. Headerline ) is the line that includes a (unique) name and a description of each sequence. It precedes the sequence data and begins with a greater than sign (">"). The name and / or an ID of the sequence follows without spaces . Many sequence databases use standardized headers, which allow different information to be obtained automatically from the header. The header can also contain several IDs, which are then separated by a ^ A (Control-A) character. The header in this form is optional. It is important that several sequences in a FASTA file are separated from one another by a "> + Description".

Comments

The header is optionally followed by one or more comment lines, each of which begins with a semicolon (";"). The semicolon must also be the first character in the respective line. Many databases and application programs do not recognize the comments, so these comments are practically not found in any current sequence database. However, they are part of the official format. An example of a FASTA file with multiple sequences and comment lines:

>Sequenz 1
;Kommentarzeile A
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGEVAAQL
>Sequenz 2
;Kommentarzeile B
;Kommentarzeile C
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Sequence display

The header and comment are followed by one or more lines that contain the sequence. Each line should not contain more than 80 characters. Sequences can be protein or nucleic acid sequences, may contain gaps and alignment characters . The sequences should be given according to the IUB / IUPAC standard codes for amino acids and nucleic acids. Permitted exceptions are:

  • Lower case letters are allowed, but will be converted to upper case
  • A hyphen or dash represents a gap
  • In amino acid sequences, "U" and "*" represent valid characters. (See below)
  • Nucleotide sequences are shown in 5 'to 3' direction.

Numeric characters are not allowed, but are used in some databases to indicate the position of the sequence.

Allowed codes for nucleobases
code meaning
A. A denin
C. C ytosin
G G uanine
T T hymin
U U racil
R. GA ( Pu R ine )
Y TC ( P Y rimidine )
K GT ( K etone )
M. AC ( A M inogroups )
S. GC ( S trong interaction)
W. AT ( W oak interaction)
B. GTC (not A) ( B comes after A)
D. GAT (not C) ( D comes after C)
H ACT (not G) ( H comes after G)
V GCA (not T, not U) ( V comes after U)
N AGCT (a N y)
- Gap of indefinite length
Table II: Permitted codes for amino acids
code meaning
A. Alanine
B. Aspartic acid or asparagine
C. Cysteine
D. Aspartate
E. Glutamate
F. Phenylalanine
G Glycine
H Histidine
I. Isoleucine
K Lysine
L. Leucine
M. Methionine
N Asparagine
P Proline
Q Glutamine
R. Arginine
S. Serine
T Threonine
U Selenocysteine
V Valine
W. Tryptophan
Y Tyrosine
Z Glutamate or glutamine
X any amino acid
* Translation stop
- Gap of indefinite length

File extension

There is no standard file extension for a text file in FASTA format. However, the following extensions are often used: .fa, .mpfa, .fna, .fsa or .fasta.

Sequence IDs

The National Center for Biotechnology Information has defined a standard for an ID used for sequences. This "SeqID" is used in the header. The formatdb's help page states the following: "formatdb will automatically parse the SeqID and create indexes, but the database identifiers in the FASTA definition line must follow the conventions of the FASTA Defline Format."

However, this is not a final definition for the header format. Different possibilities are shown below:

GenBank gi|gi-number|gb|accession|locus
EMBL Data Library gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|name
TrEMBL tr|accession|name
Brookhaven Protein Data Bank (1) pdb|entry|chain
Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE
Patents pat|country|number
GenInfo Backbone Id bbs|number
General database identifier gnl|database|identifier
NCBI Reference Sequence ref|accession|locus
Local sequence identifier lcl|identifier

The vertical lines are not separators according to the Backus-Naur form , but part of the format.

See also

Web links

Individual evidence

  1. FASTA representation of the cytochrome b of an Asian elephant on ncbi.nlm.nih.gov, accessed on August 21, 2018