Notes
Slide Show
Outline
1
String Algorithms
  • Sequence and string are synonymous
    • Finite number of letters1 from an alphabet, written contiguously from left to right   Si..j
      • A C T G G T C A
  • Subsequence
    • an ordered subset obtained by removing letters from a sequence
      • A C T G G T C A
      •     A T G G T A       (Both C’s are removed)
  • Substring
    • remaining letters are consecutive


2
 
3
String Algorithms
  • Algorithms for String Matching
    • No preprocessing
      • Naïve
      • Boyer-Moore
    • Preprocessing of target string
      • Aho-Corasick
    • Preprocessing of fixed DB
      • Rabin-Karp
      • Suffix Tree
  • Algorithms for String Alignment
    • No preprocessing
      • Needleman Wunsch
      • Smith Waterman
    • Preprocessing of ‘fixed’ Database
      • Complicated heuristics such as BLAST, FASTA
4
STRING MATCHING
  • Looking for an EXACT substring match
5
Naïve algorithm
Operates in quadratic time
6
Boyer-Moore
  • ‘Smarter’.  Operates in linear time
  • Doesn’t ‘walk’ the string; it ‘skips along’ the string
  • Choose the better of:
    • Jump the distance derived from the bad character heuristic
        • Or
    • Jump the distance derived from the good suffix heuristic
7
 
8
AHO-CORASICK
  • 2 Phases
  • Preprocess: Build a finite automaton based on the query strings (quadratic time)
  • Use the target string as input to the automaton (linear time)
9
AHO-CORASICK
10
Rabin Karp
  • Preprocesses database using hashing with the hash function mod n (linear time)


  • Hashing issues
  • Complexity of right-shifting the window
    • Pretty much linear
  • Hash Collisions
    • Must check (or decide not to)
  • Choice of n
11
Hash Collisions
12
For example,
13
Rabin-Karp
14
 
15
Trade-offs…..use several p’s
16
Suffix Trees
  • Preprocessing of the target (Tree building: linear)
  • Linear Search time
  • Can generalize to multiple targets
    • Allows other functions besides exact matching
    • Longest Common Substring
    • Wildcards
17
Suffix Trees : An example
18
Suffixes of CalloohCallay
19
Suffixes of CalloohCally: ordered
20
Constructing the Tree
21
Suffix Tree for ‘CalloohCallay”
22
Alignment
  • Not an exact match
  • Can be based on edit distance
  • Usually based on a similarity measure
23
Why Align Strings?
  • Find small differences between strings
    • Differences ~every 100 characters in DNA
  • See if the suffix of one sequence is a prefix of another
    • Useful in shotgun sequencing
  • Find common subsequences (cf definition)
    • Homology or identity searching
  • Find similarities of members of the same family
    • Structure prediction
24
Edit Distance
25
Similarity
  • We are more inclined to use the concept  of similarity, an alignment scoring function instead. We can then
    • deal with gaps
    • weight specific substitutions.

  • Note that similarity is NOT A METRIC.


26
Example  of a Scoring Function
for Similarity
27
Similarity Scoring of an Alignment
 Example of Two of 6 Possible Alignments
28
String (Sequence) Alignment
  • Global Alignment
    • Every character in the query (source) string lines up with a character in the target string
    • May require gap (space) insertion to make strings the same length
  • Local Alignment
    • An “internal” alignment or embedding of a substring (sic) into a target string



29
Global vs Local
30
Optimal Global Alignments
31
Dynamic Programming
to Find Optimal  Sequence Alignment
  • In sequence alignment, can piece together optimal prefix alignments to get a global solution based on optimizing a scoring function (maximizing in this case).
  • Can be applied to a wide variety of alignment problems (Max probability through a Markov Chain®Viterbi Algorithm).
32
The Basic Optimal Alignment Problem has a Complete Algorithmic Solution
Using Dynamic Programming
  • Define a scoring function
  • Find optimal alignment for prefixes of the query and target strings
    • May need to insert gaps to accomplish this
  • Extend the process to larger chunks of the problem
    • Dynamic Programming


33
Dynamic Programming
An Example
  • Consider a network of cities connected by some roads.


  • What is the shortest distance from City A to City B ? (optimal solution)


  • The answer will have the minimum cost function (in this case, distance) of all possible routes
34
Dynamic Programming
35
Dynamic Programming
36
Same Question
37
Same Question
  • We can keep asking the same question, and applying the identical answer, till we run out of remaining cities.


  • We have glued together optimal solutions to sub-problems by recursively applying the answer to the recurring question.
38
"There is a recurrence relationship..."
  • There is a recurrence relationship between a part and all smaller parts. Here is an algorithm:


  • Set starting city as Left endpoint and final city as right endpoint
  • GetDist


  • Function GetDist
    • Exit when down to the smallest


    • Compute all distances between endpoints  passing through  all trial cities and find mindist city


    • Keep the left endpoint and set the mindist city as a right endpoint


    • Set the mindist city as the left  endpoint and keep the right endpoint


    • GetDist

39
Sequence Alignment and Recursive Algorithms
  • The sequence alignment problem can be also be solved by a recursive algorithm


  • See Setabul and Meidanis, Fig 3.3 for the pseudocode.
40
Recursive Algorithms
  • Computations grow exponentially with the number of recursive calls
  • But the number of distinct recursive calls grows as a polynomial
  • Recursion is thus  a “top-down” inefficient solution to the alignment problem


41
Dynamic Programming
One Answer….
  • Build a “Bottom-up” solution for only the elements corresponding to distinct recursive calls
  • More conceptually complex,  BUT………
    • The global solution runs in time O(n2) (polynomial time)
    • Works when a problem possesses the principle of optimality.
    • If so, will always give the optimal solution (not a heuristic)



42
Bottom-Up Computation
Matrix Form
  • Start with smallest possible indices (i,j) of the two strings  ¾ (1,1) here





  • Compute best solution from 3 choices
43
Bottom-Up Computation
Matrix Form
  • Increase the size of the problem by incrementing an index and select best of all possible  smaller solutions


44
Dynamic Programming
  • Bottom-up computation
  • Traceback
    • For each increasing size, keep track of which of the (3) possible subsolutions was the optimal one
45
A Popular Scoring Rule
for Alignment on a Matrix
46
Dynamic Programming
  • We are going to look at all possible prefixes in the query (m) against all possible prefixes in the target (n).
  • At each step, we will pick out whether we get the best score with an indel, a match, or a replacement, based not on our local score alone, but also in comparison with the cumulative score presented in adjacent cells
  •  The difficulty of the problem is O(mn)


47
Global Alignment
Dynamic Programming Algorithm
Needleman-Wunsch
48
Global Alignment
Dynamic Programming Algorithm
Needleman-Wunsch
49
Global Alignment
Needleman-Wunsch
First prefix is examined
50
Global Alignment
Needleman-Wunsch
Best score is selected
51

The path(s) by which the optimum prefix was generated is kept, along with the score itself
52
Global Alignment
The next optimum prefix is determined
by finding the best score
53
Global Alignment
The path and score to that prefix are remembered, as well
54
Global Alignment
All optimum paths are determined until the m,n th position is reached
55
Global Alignment
From the m,n th position, the highest scoring continuous path(s) back are determined.  This (these) are the optimal alignments.
This score is -1
56
Global Alignment
This score is –1, also
57
Global Alignment
This score is –1, as well
58
Global Alignment
Constructing the alignment
  • Read off the alignment from end to start, beginning with the m,nth cell
    • If leaving a cell diagonally
      • Read off the query and target suffix letters
    • If leaving a cell horizontally
      • Read off the letter in the query
      • A gap in the target
    • If leaving a cell vertically
      • Read off the letter in the target
      • A gap in the query
59
Global Alignment
60
Needleman-Wunsch Algorithm
  • Runs in polynomial (mn) time
  • Unfortunately runs in mn space as well
  • Can be made to run in linear space
    • Finding max score is easy
    • Finding path is not so easy
61
LOCAL ALIGNMENT
62
"A local alignment is an..."
  • A local alignment is an alignment of the query string with a substring1 of the target string.  It is an optimal suffix alignment.
63
Smith-Waterman Algorithm
  • For a local alignment
  • Finds the highest scoring substrings (suffixes) of the query and target strings
  • Waterman: “Fitting one sequence into another”
64
Local Alignment
 Smith-Waterman Algorithm
  • Could get a complete solution in O((mn)2)•O(mn)=O((mn)3)
  • S-W runs in O(mn)
  • Aligns suffixes instead of prefixes
65
Local Alignment
 Smith-Waterman Algorithm
  • There are no initial gaps in the best local alignment, so first row and column have 0’s propagated from the origin
  • Our general algorithm is modified for a 4th case i.e. 0


66
 
67
Local Alignment
 Smith-Waterman Algorithm
  • We need to keep track of whether a zero arises from a calculation or from the choice of the fourth case.


68
Local Alignment
 Smith-Waterman Algorithm
There are no gaps; 1st row and col are 0’s
69
Local Alignment
 Smith-Waterman Algorithm
Recipe
  • Pad first row and col with 0’s
  • Apply dynamic programming (modified) algorithm
  • Distinguish between selected 0 and computed 0
    • Do traceback arrows with all computed values (incl 0)
    • No traceback arrows from selected 0
  • After matrix is completed, find maximum value
  • Trace the path back until there are no arrows out
70
Local Alignment
 Smith-Waterman Algorithm
Nuances
  • The max value is the alignment score
  • There are sub and superstrings of different scores
  • There may be more than one string with the same max value
  • There may be other unrelated strings with lower scores which nevertheless might be important to the problem under study
71
 
72
Local Alignment
 Smith-Waterman Algorithm
Results
73
Multiple Sequence Alignment
  • (MSA)
74
MSA
  • Based on similarity of the ensemble of sequences, not specific pairs
  • Shows patterns
  • Discloses families
  • Tracks changes (phylogeny)
  • Relates mutant genes to wild types or their homologs (Cystic Fibrosis story)
75
MSA
76
MSA
Sum-of-pairs (SP)
  • Go ONLY BY COLUMNS
  • Take the sum of the scores of all pairs in each column
    • for 4 rows, there would be
      • Score(1st,2nd)+Score(1st,3rd)+Score(1st,4th)
      • +Score(2nd,3rd)+Score(2nd,4th)+Score(3rd,4th)
  • In your scoring function, gap-gap is given 0


77
MSA
Sum of Pairs (SP)
78
MSA
  • This scoring part of the algorithm takes




  •   for a problem with k rows where n is the size of the longest string with no gaps.  The complexity is O(n2) i.e.,running in polynomial time
79
MSA
  • Now, how will we solve the problem in k dimensions and how hard will it be?



  • We can certainly apply the dynamic programming algorithm to the k dimensional case.
80
MSA
Dynamic programming
  • The problem now generalizes to a hypermatrix of size n+1 and of dimension k.
  • Storage space, then, goes up exponentially with k (without the use of a space-saving heuristic such  as the one available for the 2 dimensional case)


81
MSA
  • We need an heuristic!
    • Expectation Maximization
    • Gibbs Sampler
    • Simulated Annealing
    • Star Alignment
82
Star Alignment Heuristic
  • ‘Star’ Alignment  (sometimes called merge-alignment) is commonly used.
  • Like all heuristics, it is fast but does not guarantee the optimal answer
  • It is called ‘Star’ because one of the k sequences is selected to be ‘in the center’ to be a basis of comparison to the other k-1 sequences. This might be diagrammed as sequences at the ends of spokes radiating from the center sequence.
83
Star Alignment
  • Find the standard alignment scores for each of the  (k(k-1)/2) pairs of sequences and enter them into a k´k matrix.
  • Consider this set of sequences k=4, of maxlength n=5
84
Star Alignment
85
Star Alignment
86
Star Alignment
87
Star Alignment
  • Find the actual best alignments of each sequence with the center-of-the-star. In this case, s2 is selected.
  •  Using the center-of-the-star sequence, s2 , as the query, align it (global) with each of the remaining sequences, yielding n-1 best alignments.


88
Star  Alignment
Get the optimal pairwise alignments using s2 as the query
89
Final Product of the Star  Alignment
90
Star Alignment
Complexity
  • To build the table and get center of star:
      • O(k2n2)
  • To do the MSA: O(kn2+k2l) (l is length after gaps)
    •  O(kn2)for pairwise alignments
    •  O(k2l) for assembling sequences

  • Can be optimized to O(kn+kl)
91
Star Alignment Pitfalls
  • If the sequences selected at the start are widely divergent, the heuristic can begin a chase down a trail that is leading the wrong way.


  • For this reason, other MSA strategies are sometimes considered.