T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation) is a multiple sequence alignment software using a progressive approach.[1] It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from Protein Data Bank (PDB) files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs (Mocca). It produces alignment in the aln format (Clustal) by default, but can also produce PIR, MSF, and FASTA format. The most common input formats are supported (FASTA, Protein Information Resource (PIR)).
Algorithm
T-Coffee algorithm consist of two main features, the first by, using heterogeneous data sources, can provide simple and flexible means to generate multiple alignments. T-coffee can compute multiple alignments using a library that was generated using a mixture of local and global pair-wise alignments.[1]
The second is the "Optimization method", used to find the multiple alignment that best fits the pair-wise alignments in the input library using a progressive strategy that can be compared to the one used in ClustalW. The Optimization method has the advantage of being fast and robust. The information in the library is used to carry out progressive alignments and facilitates the duty of considering the alignments between all the pairs while carrying out every step of the progressive multiple alignments.[1]
Generating a primary library of alignments
The library incorporates a set of pair-wise alignments between all of the sequences to be aligned, the alignments are not required to be consistent. Inside the library, there can be found information on each of the N(N-1)/2 in where N is the number of sequences. Two alignment sources are used for each pair of sequences, one of them classified as local, and the other as global.[1]
Global alignments are constructed using ClustalW on the sequences, two at a time, and sed to give one full-length alignment between each pair of sequences. The local alignments are the ten top-scoring non-intersecting local alignments gathered using the Lalign program of the FASTA package.[1]
Each alignment is represented in the library as a list of pair-wise residue matches, each pair is a constraint; however, some constraints are more relevant than others. the importance of each constraint depends on which are more likely to be correct. While computing the multiple alignments, priority is given to the most reliable residue pairs by using a weighting scheme.[1]
Combination of the libraries
Efficient combination of local and global alignment information is an important factor of T-Coffee. By using the ClustalW and Lalign primary libraries it can be achieved with a process of addition. Any duplicated pair between both libraries is merged into a single entry with the weight of the total sum of both pairs. Else, a new entry is created for the pair. Pairs with a weight of zero will not be represented.[1] For each pair of aligned residues in the library, it is possible to assign a weight that belongs to the degree to which those residues align consistently. This is called Library extension.
Comparisons with other alignment software
While the default output is a Clustal-like format, it is sufficiently different from the output of ClustalW/X that many programs supporting Clustal format cannot read it; fortunately ClustalX can import T-Coffee output so the simplest fix for this issue is usually to import T-Coffee's output into ClustalX and then re-export. Another possibility is to request the strict ClustalW output format with the option "-output=clustalw_aln".
An important specificity of T-Coffee is its ability to combine different methods and different data types. In its latest version, T-Coffee can be used to combine protein sequences and structures, RNA sequences and structures. It can also run and combine the output of the most common sequence and structure alignment packages.
T-Coffee comes along with a sophisticated sequence reformatting utility named seq_reformat. An extensive documentation is available online.
Variations
M-Coffee: a special mode of T-Coffee that makes it possible to combine the output of the most common multiple sequence alignment packages (Muscle, ClustalW, Mafft, ProbCons, etc.). The resulting alignments are slightly better than the individual one, but most importantly the program indicates the alignment regions where the various packages agree upon. Regions of high agreement are usually well aligned.[2]
Expresso and 3D-Coffee: these are special modes of T-Coffee making it possible to combine sequence and structures in an alignment. The structure based alignments can be carried out using the most common structural aligners such as TMalign, Mustang, and sap.[3][4][5][6]
R-Coffee: a special mode of T-Coffee making it possible to align RNA sequences while using secondary structure information.[7][8]
PSI-Coffee: aligns distantly related proteins using homology extension (slow and accurate)[9][10]
TM-Coffee: aligns transmembrane proteins using homology extension[11]
Accurate: automatically combine the most accurate modes for DNA, RNA and proteins (experimental).[13]
Combine: combines two (or more) multiple sequence alignments into one.[1][9]
Evaluation
Transitive Consistency Score (TCS) is an extended version of the T-Coffee scoring scheme.[14] It uses T-Coffee libraries of pairwise alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy. TCS has been shown to lead to significantly better estimates of structural accuracy and more accurate phylogenetic trees against Heads-or-Tails, GUIDANCE, Gblocks, and trimAl.[15]