Teiresias algorithm

The Teiresias algorithm is a combinatorial algorithm for the discovery of rigid patterns (motifs) in biological sequences. It is named after the Greek prophet Teiresias and was created in 1997 by Isidore Rigoutsos and Aris Floratos.[1]

The problem of finding sequence similarities in the primary structure of related proteins or genes arises in the analysis of biological sequences. It can be shown that pattern discovery in its general form is NP-hard.[2] The Teiresias algorithm is based on the observation that if a pattern spans many positions and appears exactly k times in the input then all fragments (sub patterns) of the pattern have to appear at least k times in the input. The algorithm is able to produce all patterns that have a user-defined number of copies in the given input, and manages to be very efficient by avoiding the enumeration of the entire space. Finally, the algorithm reports motifs that are maximal in both length and composition.

A new implementation of the Teiresias algorithm was recently made available by the Computational Medicine Center at Thomas Jefferson University. Teiresias is also accessible through an interactive web-based user interface by the same center. See external links for both.

Pattern description

The Teiresias algorithm uses regular expressions to define the patterns. This allows the patterns reported to consist not only from the characters that appear in each position (literals) but from a specific group of characters (bracketed literals) or even from any character (wild card). The patterns created by the algorithm are <L,W> patterns that have at least k instances in the input, where L ≤ W and L, W, k positive integers. A pattern is called an <L,W> pattern if and only if any L consecutive literals or bracketed literals span at most W positions (i.e. there can be no more than W-L wild cards).

The algorithm reports only maximal patterns. Given a set of sequences S, a pattern P that appears k times in S is called maximal if and only if there exists no pattern P' which is more specific than P and also appears exactly k times in S. If there exists such a pattern P' then we say that P cannot be maximal and P is considered to be subsumed by P'. A pattern P' is said to be more specific than a pattern P if and only if P' can be obtained from P by (a) dereferencing a wild card or (b) instantiating a bracketed literal to a literal, or (c) appending a string of literals, bracketed literals or/and wild cards to the right of P, or (d) prepending a string of literals, bracketed literals or/and wild cards to the left of P.[3]

Algorithm description

Teiresias consists of two phases, Scanning and Convolution. During the first phase the input is scanned for the patterns that satisfy the minimum requirements, the elementary patterns. The elementary patterns consist of exactly L literals and/or bracketed literals and includes at most W-L wild cards. During convolution, the elementary patterns are recursively combined and maximal patterns are created. The order in which the convolutions are performed is very important since it guarantees that all patterns will be generated and all maximal patterns are generated before all the patterns that are subsumed by them. The order is dictated by the following rules

  • The priority of each pattern is defined by its contents from left to right.
  • A literal has higher priority than a bracketed literal and both have higher priority than wild cards (the more specific first).
  • Longer patterns have higher priority than shorter ones.
  • Ties are resolved alphabetically.

Given the assurance that all maximal patterns will be created first, the maximality of a newly created pattern can be easily checked. If the new pattern is subsumed by any maximal patterns found so far, it is discarded. Otherwise, a new maximal pattern is found. To avoid checking against all maximal patterns found so far, maximal patterns are placed in a hash map. The hashing scheme is designed so that a maximal pattern P and all non maximal patterns subsumed by P hash to the same entry, but two different maximal patterns are unlikely to hash to the same entry. The algorithm terminates when no more patterns can be combined to form new maximal patterns. The length of any maximal pattern is bounded from above by the length of the longest input sequence.

Time complexity

The algorithm is "output-sensitive." The time complexity of the TEIRESIAS algorithm is[3]

where L and W are user-specified parameters that define the "minimum density" of a pattern (any L literals or brackets cannot span more than W positions), m is the number of characters the input includes, C is the average number of patterns found in a hash entry, tH is the time needed for locating the hash entry corresponding to any given hash value, rc(P) is number of literals in P, and it can be shown at most rc(P) patterns can be placed on the stack while building P. And the summation Σ is the maximum number of patterns that will ever be placed in the stack that keeps the patterns for extension during convolution.

References

  1. ^ Rigoutsos, I, Floratos, A (1998) Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14: 55-67
  2. ^ Maier, D., "The Complexity of Some Problems on Subsequences and Supersequences", Journal of the ACM, 322-336, 1978
  3. ^ a b Floratos A., and Rigoutsos, I., "On the time complexity of the Teiresias algorithm", IBM technical report RC 21161 (94582), IBM TJ Watson Research Center, 1998