|
1
|
- Sequence and string are synonymous
- Finite number of letters1 from an alphabet, written
contiguously from left to right
Si..j
- Subsequence
- an ordered subset obtained by removing letters from a sequence
- A C T G G T C A
- A T G G T A (Both C’s are removed)
- Substring
- remaining letters are consecutive
|
|
2
|
|
|
3
|
- Algorithms for String Matching
- No preprocessing
- Preprocessing of target string
- Preprocessing of fixed DB
- Algorithms for String Alignment
- No preprocessing
- Needleman Wunsch
- Smith Waterman
- Preprocessing of ‘fixed’ Database
- Complicated heuristics such as BLAST, FASTA
|
|
4
|
- Looking for an EXACT substring match
|
|
5
|
|
|
6
|
- ‘Smarter’. Operates in linear
time
- Doesn’t ‘walk’ the string; it ‘skips along’ the string
- Choose the better of:
- Jump the distance derived from the bad character heuristic
- Jump the distance derived from the good suffix heuristic
|
|
7
|
|
|
8
|
- 2 Phases
- Preprocess: Build a finite automaton based on the query strings
(quadratic time)
- Use the target string as input to the automaton (linear time)
|
|
9
|
|
|
10
|
- Preprocesses database using hashing with the hash function mod n (linear
time)
- Hashing issues
- Complexity of right-shifting the window
- Hash Collisions
- Must check (or decide not to)
- Choice of n
|
|
11
|
|
|
12
|
|
|
13
|
|
|
14
|
|
|
15
|
|
|
16
|
- Preprocessing of the target (Tree building: linear)
- Linear Search time
- Can generalize to multiple targets
- Allows other functions besides exact matching
- Longest Common Substring
- Wildcards
|
|
17
|
|
|
18
|
|
|
19
|
|
|
20
|
|
|
21
|
|
|
22
|
- Not an exact match
- Can be based on edit distance
- Usually based on a similarity measure
|
|
23
|
- Find small differences between strings
- Differences ~every 100 characters in DNA
- See if the suffix of one sequence is a prefix of another
- Useful in shotgun sequencing
- Find common subsequences (cf definition)
- Homology or identity searching
- Find similarities of members of the same family
|
|
24
|
|
|
25
|
- We are more inclined to use the concept
of similarity, an alignment scoring function instead. We can then
- deal with gaps
- weight specific substitutions.
- Note that similarity is NOT A METRIC.
|
|
26
|
|
|
27
|
|
|
28
|
- Global Alignment
- Every character in the query (source) string lines up with a character
in the target string
- May require gap (space) insertion to make strings the same length
- Local Alignment
- An “internal” alignment or embedding of a substring (sic) into a target
string
|
|
29
|
|
|
30
|
|
|
31
|
- In sequence alignment, can piece together optimal prefix alignments to
get a global solution based on optimizing a scoring function (maximizing
in this case).
- Can be applied to a wide variety of alignment problems (Max probability
through a Markov Chain®Viterbi Algorithm).
|
|
32
|
- Define a scoring function
- Find optimal alignment for prefixes of the query and target strings
- May need to insert gaps to accomplish this
- Extend the process to larger chunks of the problem
|
|
33
|
- Consider a network of cities connected by some roads.
- What is the shortest distance from City A to City B ? (optimal solution)
- The answer will have the minimum cost function (in this case, distance)
of all possible routes
|
|
34
|
|
|
35
|
|
|
36
|
|
|
37
|
- We can keep asking the same question, and applying the identical answer,
till we run out of remaining cities.
- We have glued together optimal solutions to sub-problems by recursively
applying the answer to the recurring question.
|
|
38
|
- There is a recurrence relationship between a part and all smaller parts.
Here is an algorithm:
- Set starting city as Left endpoint and final city as right endpoint
- GetDist
- Function GetDist
- Exit when down to the smallest
- Compute all distances between endpoints passing through all trial cities and find mindist
city
- Keep the left endpoint and set the mindist city as a right endpoint
- Set the mindist city as the left
endpoint and keep the right endpoint
- GetDist
|
|
39
|
- The sequence alignment problem can be also be solved by a recursive
algorithm
- See Setabul and Meidanis, Fig 3.3 for the pseudocode.
|
|
40
|
- Computations grow exponentially with the number of recursive calls
- But the number of distinct recursive calls grows as a polynomial
- Recursion is thus a “top-down”
inefficient solution to the alignment problem
|
|
41
|
- Build a “Bottom-up” solution for only the elements corresponding to distinct
recursive calls
- More conceptually complex, BUT………
- The global solution runs in time O(n2) (polynomial time)
- Works when a problem possesses the principle of optimality.
- If so, will always give the optimal solution (not a heuristic)
|
|
42
|
- Start with smallest possible indices (i,j) of the two strings ¾ (1,1) here
- Compute best solution from 3 choices
|
|
43
|
- Increase the size of the problem by incrementing an index and select
best of all possible smaller
solutions
|
|
44
|
- Bottom-up computation
- Traceback
- For each increasing size, keep track of which of the (3) possible
subsolutions was the optimal one
|
|
45
|
|
|
46
|
- We are going to look at all possible prefixes in the query (m) against
all possible prefixes in the target (n).
- At each step, we will pick out whether we get the best score with an
indel, a match, or a replacement, based not on our local score alone,
but also in comparison with the cumulative score presented in adjacent
cells
- The difficulty of the problem is
O(mn)
|
|
47
|
|
|
48
|
|
|
49
|
|
|
50
|
|
|
51
|
|
|
52
|
|
|
53
|
|
|
54
|
|
|
55
|
|
|
56
|
|
|
57
|
|
|
58
|
- Read off the alignment from end to start, beginning with the m,nth
cell
- If leaving a cell diagonally
- Read off the query and target suffix letters
- If leaving a cell horizontally
- Read off the letter in the query
- A gap in the target
- If leaving a cell vertically
- Read off the letter in the target
- A gap in the query
|
|
59
|
|
|
60
|
- Runs in polynomial (mn) time
- Unfortunately runs in mn space as well
- Can be made to run in linear space
- Finding max score is easy
- Finding path is not so easy
|
|
61
|
|
|
62
|
- A local alignment is an alignment of the query string with a substring1
of the target string. It is an
optimal suffix alignment.
|
|
63
|
- For a local alignment
- Finds the highest scoring substrings (suffixes) of the query and target
strings
- Waterman: “Fitting one sequence into another”
|
|
64
|
- Could get a complete solution in O((mn)2)•O(mn)=O((mn)3)
- S-W runs in O(mn)
- Aligns suffixes instead of prefixes
|
|
65
|
- There are no initial gaps in the best local alignment, so first row and
column have 0’s propagated from the origin
- Our general algorithm is modified for a 4th case i.e. 0
|
|
66
|
|
|
67
|
- We need to keep track of whether a zero arises from a calculation or
from the choice of the fourth case.
|
|
68
|
|
|
69
|
- Pad first row and col with 0’s
- Apply dynamic programming (modified) algorithm
- Distinguish between selected 0 and computed 0
- Do traceback arrows with all computed values (incl 0)
- No traceback arrows from selected 0
- After matrix is completed, find maximum value
- Trace the path back until there are no arrows out
|
|
70
|
- The max value is the alignment score
- There are sub and superstrings of different scores
- There may be more than one string with the same max value
- There may be other unrelated strings with lower scores which
nevertheless might be important to the problem under study
|
|
71
|
|
|
72
|
|
|
73
|
|
|
74
|
- Based on similarity of the ensemble of sequences, not specific pairs
- Shows patterns
- Discloses families
- Tracks changes (phylogeny)
- Relates mutant genes to wild types or their homologs (Cystic Fibrosis
story)
|
|
75
|
|
|
76
|
- Go ONLY BY COLUMNS
- Take the sum of the scores of all pairs in each column
- for 4 rows, there would be
- Score(1st,2nd)+Score(1st,3rd)+Score(1st,4th)
- +Score(2nd,3rd)+Score(2nd,4th)+Score(3rd,4th)
- In your scoring function, gap-gap is given 0
|
|
77
|
|
|
78
|
- This scoring part of the algorithm takes
- for a problem with k rows where n
is the size of the longest string with no gaps. The complexity is O(n2) i.e.,running
in polynomial time
|
|
79
|
- Now, how will we solve the problem in k dimensions and how hard will it
be?
- We can certainly apply the dynamic programming algorithm to the k
dimensional case.
|
|
80
|
- The problem now generalizes to a hypermatrix of size n+1 and of
dimension k.
- Storage space, then, goes up exponentially with k (without the use of a
space-saving heuristic such as
the one available for the 2 dimensional case)
|
|
81
|
- We need an heuristic!
- Expectation Maximization
- Gibbs Sampler
- Simulated Annealing
- Star Alignment
|
|
82
|
- ‘Star’ Alignment (sometimes
called merge-alignment) is commonly used.
- Like all heuristics, it is fast but does not guarantee the optimal
answer
- It is called ‘Star’ because one of the k sequences is selected to be ‘in
the center’ to be a basis of comparison to the other k-1 sequences. This
might be diagrammed as sequences at the ends of spokes radiating from
the center sequence.
|
|
83
|
- Find the standard alignment scores for each of the (k(k-1)/2) pairs of sequences and
enter them into a k´k
matrix.
- Consider this set of sequences k=4, of maxlength n=5
|
|
84
|
|
|
85
|
|
|
86
|
|
|
87
|
- Find the actual best alignments of each sequence with the
center-of-the-star. In this case, s2 is selected.
- Using the center-of-the-star
sequence, s2 , as the query, align it (global) with each of
the remaining sequences, yielding n-1 best alignments.
|
|
88
|
|
|
89
|
|
|
90
|
- To build the table and get center of star:
- To do the MSA: O(kn2+k2l) (l is length after gaps)
- O(kn2)for pairwise
alignments
- O(k2l) for assembling
sequences
- Can be optimized to O(kn+kl)
|
|
91
|
- If the sequences selected at the start are widely divergent, the
heuristic can begin a chase down a trail that is leading the wrong way.
- For this reason, other MSA strategies are sometimes considered.
|