|
1
|
|
|
2
|
- We are given a bunch of sequences and we wish to discover motifs
embedded in sequences.
- There may be more or less than one motif in each sequence
- We don’t know the location of the motifs within the sequences
- We don’t know the composition of the motif
- We may or may not know the length of the motif
- There may be gaps in some of the motifs
|
|
3
|
- This is a (really) hard problem.
- In fact, we have to settle for an heuristic, and might even have to
settle for a greedy algorithm
|
|
4
|
- The content of each word to be aligned in an MSA is known (up to gap info)
- Motif-finding does not know the content- it must discover it
|
|
5
|
|
|
6
|
- We will discuss the EM approach
- For simplicity, we are going to constrain the problem:
- Exactly one motif/sequence
- No gaps
- Length is fixed, and given
- Background sequence info, as distinct from motif, will be identified
|
|
7
|
- When we find the motif, the SNR
(motif to background ) will be maximized
- Restated: The motif will be unlikely in the context of its background
- But we don’t know what content of the motif is, so how do we determine
if it is unlikely?
|
|
8
|
|
|
9
|
- A profile is a Position-Specific Scoring Matrix (PSSM)
- The PSSM is a matrix with a column for the position of each letter in
the motif, and a row for the frequency (probability) of each
possible letter in that position,
in the context of the alphabet (4
rows for DNA, 20 rows for amino acids).
- If you think about it, each column is the pdf for the letters in that
column
- The background has but one column, since there is no concept of position
|
|
10
|
|
|
11
|
- The pdf of the motif will be the most probable subsequence in the data.
- Restated, the motif pdf will look exactly like the correct PSSM. (But we
don’t know what the ‘correct’ PSSM looks like, only how we guess when we start.)
- Another viewpoint: the ‘correct’
pdf that is the one farthest from the psd of the background. (We do not
know the psd farthest from the background.)
|
|
12
|
|
|
13
|
- Set a joint probability expression=1
- Start in position 1 of the sequence data.
- If the first letter of the first sequence is an A, multiply the joint
probability by the probability of an A as the first letter of a the
motif, decided by the
probability in the first position of the PSSM corresponding to an A
- If the first letter of the first sequence is a C, multiply the joint
probability by the probability of a C as the first letter of a the
motif, decided by the
probability in the first position of the PSSM corresponding to an C in
the first position of the PSSM
- If the first letter of the first sequence is a G, multiply the joint
probability by the probability of a G as the first letter of a the
motif, decided by the
probability in the first position of the PSSM corresponding to an G
- If the first letter of the first sequence is a T, multiply the joint
probability by the probability of a T as the first letter of a the
motif, decided by the
probability in the first position of the PSSM corresponding to an T
|
|
14
|
|
|
15
|
- Do this for all m letters of the motif
- For the remaining letters in the first sequence, by hypothesis, if they
are not in the motif, they must be background, so continue to compute
the joint probability using the frequencies of the background letters
- When all the letters of the first sequence are accounted for, there is
an expression for the joint probability* of the first letter in the
first sequence being the first letter of the motif
|
|
16
|
|
|
17
|
|
|
18
|
- When we are done, we have the joint
probability of the motif if it were to start in each position of
the data
- We want a relative probability that the motif starts in a position, so
we need to normalize each of the joint probabilities, sequence by
sequence, by dividing by the sum* of the joint probabilities. This gives
us our expectation matrix
|
|
19
|
- Instead of the joint probability of the motif and background, we could
ask “how ‘far’ is the motif pdf from the background pdf?”
- The Kullback-Leibler distance measures how far pdfs are from each other
|
|
20
|
|
|
21
|
|
|
22
|
|
|
23
|
- This completes the Expectation step.
- For the Maximization step:
- Now use the expectation matrix to generate a new PSSM by adding weights
|
|
24
|
- Create an empty new PSSM
- Augment each base count in each position of the PSSM by a weight:
- For each sequence in the data
- For each cell i + j in the Expectation Matrix
- For each column j in the new PSSM
- Add the i + j th probability to the bin in the j th column
of the PSSM appropriate to the base in the i + j th position
of the original data
|
|
25
|
|
|
26
|
|
|
27
|
- Exhaust all possible starting positions in the expectation matrix
- Do this for each sequence in the data
|
|
28
|
- When all the weights are accumulated in the new PSSM, normalize them!
- The product is a refined PSSM
|
|
29
|
- Repeat
- Create a new Expectation matrix using the new PSSM
- Maximize PSSM by using the
Expectation Matrix for weights
- Stop: when things don’t change.
|
|
30
|
|
|
31
|
|
|
32
|
- The PSSM will fully describe the elicited motifs in the data
|
|
33
|
- The EM algorithm we have described discovers one motif in each sequence.
- There are sophisticated algorithms using the EM principle that can
handle gaps and can find no, one, or many motifs in each sequence MEME is such an application available
on the web from the San Diego supercomputing center http://meme.sdsc.edu/meme/intro.html
|
|
34
|
|