FASTA format
The FASTA format is a text-based format for displaying and storing the primary structure of nucleic acids ( nucleic acid sequence ) and proteins ( protein sequence ) in bioinformatics . The nucleobases or amino acids are represented by a one-letter code . The format allows the sequences to be preceded by a name and comments.
The simplicity of the format makes it easy for word processing tools and scripting languages to read in and process the data.
format
A sequence in FASTA format begins with a one-line description, followed by the sequence data. It is recommended that each line of the file contain a maximum of 80 characters. A sequence ends with the end of the file or another header line.
The following is a simple example of a protein sequence in FASTA format from cytochrome b of the Asian elephant :
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
Header
The header (Engl. Headerline ) is the line that includes a (unique) name and a description of each sequence. It precedes the sequence data and begins with a greater than sign (">"). The name and / or an ID of the sequence follows without spaces . Many sequence databases use standardized headers, which allow different information to be obtained automatically from the header. The header can also contain several IDs, which are then separated by a ^ A (Control-A) character. The header in this form is optional. It is important that several sequences in a FASTA file are separated from one another by a "> + Description".
Comments
The header is optionally followed by one or more comment lines, each of which begins with a semicolon (";"). The semicolon must also be the first character in the respective line. Many databases and application programs do not recognize the comments, so these comments are practically not found in any current sequence database. However, they are part of the official format. An example of a FASTA file with multiple sequences and comment lines:
>Sequenz 1 ;Kommentarzeile A MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGEVAAQL >Sequenz 2 ;Kommentarzeile B ;Kommentarzeile C SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Sequence display
The header and comment are followed by one or more lines that contain the sequence. Each line should not contain more than 80 characters. Sequences can be protein or nucleic acid sequences, may contain gaps and alignment characters . The sequences should be given according to the IUB / IUPAC standard codes for amino acids and nucleic acids. Permitted exceptions are:
- Lower case letters are allowed, but will be converted to upper case
- A hyphen or dash represents a gap
- In amino acid sequences, "U" and "*" represent valid characters. (See below)
- Nucleotide sequences are shown in 5 'to 3' direction.
Numeric characters are not allowed, but are used in some databases to indicate the position of the sequence.
code | meaning |
---|---|
A. | A denin |
C. | C ytosin |
G | G uanine |
T | T hymin |
U | U racil |
R. | GA ( Pu R ine ) |
Y | TC ( P Y rimidine ) |
K | GT ( K etone ) |
M. | AC ( A M inogroups ) |
S. | GC ( S trong interaction) |
W. | AT ( W oak interaction) |
B. | GTC (not A) ( B comes after A) |
D. | GAT (not C) ( D comes after C) |
H | ACT (not G) ( H comes after G) |
V | GCA (not T, not U) ( V comes after U) |
N | AGCT (a N y) |
- | Gap of indefinite length |
code | meaning |
---|---|
A. | Alanine |
B. | Aspartic acid or asparagine |
C. | Cysteine |
D. | Aspartate |
E. | Glutamate |
F. | Phenylalanine |
G | Glycine |
H | Histidine |
I. | Isoleucine |
K | Lysine |
L. | Leucine |
M. | Methionine |
N | Asparagine |
P | Proline |
Q | Glutamine |
R. | Arginine |
S. | Serine |
T | Threonine |
U | Selenocysteine |
V | Valine |
W. | Tryptophan |
Y | Tyrosine |
Z | Glutamate or glutamine |
X | any amino acid |
* | Translation stop |
- | Gap of indefinite length |
File extension
There is no standard file extension for a text file in FASTA format. However, the following extensions are often used: .fa, .mpfa, .fna, .fsa or .fasta.
Sequence IDs
The National Center for Biotechnology Information has defined a standard for an ID used for sequences. This "SeqID" is used in the header. The formatdb's help page states the following: "formatdb will automatically parse the SeqID and create indexes, but the database identifiers in the FASTA definition line must follow the conventions of the FASTA Defline Format."
However, this is not a final definition for the header format. Different possibilities are shown below:
GenBank |
gi|gi-number|gb|accession|locus
|
EMBL Data Library |
gi|gi-number|emb|accession|locus
|
DDBJ, DNA Database of Japan |
gi|gi-number|dbj|accession|locus
|
NBRF PIR |
pir||entry
|
Protein Research Foundation |
prf||name
|
SWISS-PROT |
sp|accession|name
|
TrEMBL |
tr|accession|name
|
Brookhaven Protein Data Bank (1) |
pdb|entry|chain
|
Brookhaven Protein Data Bank (2) |
entry:chain|PDBID|CHAIN|SEQUENCE
|
Patents |
pat|country|number
|
GenInfo Backbone Id |
bbs|number
|
General database identifier |
gnl|database|identifier
|
NCBI Reference Sequence |
ref|accession|locus
|
Local sequence identifier |
lcl|identifier
|
The vertical lines are not separators according to the Backus-Naur form , but part of the format.
See also
Web links
- Sequence formats
- Description of the FASTA format of the NCBI (English)
- LFasta (English)
- Nexus to Fasta converter (English)
- GenBank to Fasta conventer (English)
Individual evidence
- ↑ FASTA representation of the cytochrome b of an Asian elephant on ncbi.nlm.nih.gov, accessed on August 21, 2018