Notes
Slide Show
Outline
1
Motif Discovery
2
The Challenge
  • We are given a bunch of sequences and we wish to discover motifs embedded in sequences.
  • There may be more or less than one motif in each sequence
  • We don’t know the location of the motifs within the sequences
  • We don’t know the composition of the motif
  • We may or may not know the length of the motif
  • There may be gaps in some of the motifs
3
Complexity
  • This is a (really) hard problem.


  • In fact, we have to settle for an heuristic, and might even have to settle for a greedy algorithm
4
This is NOT MSA

  • The content of each word to be aligned in an  MSA is known (up to gap info)


  • Motif-finding does not know the content- it must discover it
5
Motif Elicitation
Standard Approaches
  • EM
  • HMM
  • Home-spun
6
EM: A strategy
  • We will discuss  the EM approach
  • For simplicity, we are going to constrain the problem:
    • Exactly one motif/sequence
    • No gaps
    • Length is fixed, and given
    • Background sequence info, as distinct from motif,  will be identified
7
Features of the Motif
  • When we find the motif,  the SNR (motif to background ) will be maximized


  • Restated: The motif will be unlikely in the context of its background


  • But we don’t know what content of the motif is, so how do we determine if it is unlikely?
8
 
9
Profiles (PSSM)
  • A profile is a Position-Specific Scoring Matrix (PSSM)


  • The PSSM is a matrix with a column for the position of each letter in the motif, and a row for the frequency (probability) of each possible  letter in that position, in the context of the alphabet  (4 rows for DNA, 20 rows for amino acids).
  • If you think about it, each column is the pdf for the letters in that column


  • The background has but one column, since there is no concept of position
10
 
11
The object is to  determine the pdf of the motif
  • The pdf of the motif will be the most probable subsequence in the data.
  • Restated, the motif pdf will look exactly like the correct PSSM. (But we don’t know what the ‘correct’ PSSM looks like, only how we guess  when we start.)


  • Another viewpoint:  the ‘correct’ pdf that is the one farthest from the psd of the background. (We do not know the psd farthest from the background.)
12
 
13
Expectation Phase
Recipe 1
  • Set a joint probability expression=1
  • Start in position 1 of the sequence data.
      • If the first letter of the first sequence is an A, multiply the joint probability by the probability of an A as the first letter of a the motif, decided  by the probability in the first position of the PSSM corresponding to an A
      • If the first letter of the first sequence is a C, multiply the joint probability by the probability of a C as the first letter of a the motif, decided  by the probability in the first position of the PSSM corresponding to an C in the first position of the PSSM
      • If the first letter of the first sequence is a G, multiply the joint probability by the probability of a G as the first letter of a the motif, decided  by the probability in the first position of the PSSM corresponding to an G
      • If the first letter of the first sequence is a T, multiply the joint probability by the probability of a T as the first letter of a the motif, decided  by the probability in the first position of the PSSM corresponding to an T


14
 
15
Expectation Phase
Recipe-3
  • Do this for all m letters of the motif
  • For the remaining letters in the first sequence, by hypothesis, if they are not in the motif, they must be background, so continue to compute the joint probability using the frequencies of the background letters
  • When all the letters of the first sequence are accounted for, there is an expression for the joint probability* of the first letter in the first sequence being the first letter of the motif
16
Example
Probability of an A in the first position of the first sequence
17
 
18
Expectation Phase
Recipe-5
  • When we are done, we have the joint  probability of the motif if it were to start in each position of the data


  • We want a relative probability that the motif starts in a position, so we need to normalize each of the joint probabilities, sequence by sequence, by dividing by the sum* of the joint probabilities. This gives us our expectation matrix
19
Variation
  • Instead of the joint probability of the motif and background, we could ask “how ‘far’ is the motif pdf from the background pdf?”
  • The Kullback-Leibler distance measures how far pdfs are from each other
20
 
21
 
22
 
23
Maximization Phase
  • This completes the Expectation step.


  • For the Maximization step:
  • Now use the expectation matrix to generate a new PSSM by adding weights
24
Maximization Phase
 Recipe-1
  • Create an empty new PSSM
  • Augment each base count in each position of the PSSM by a weight:
    • For each sequence in the data
      • For each cell i + j in the Expectation Matrix
      • For each column j in the new PSSM
        • Add the i + j th probability to the bin in the j th column of the PSSM appropriate to the base in the i + j th position of the original data

25
 
26
 
27
Maximization Phase
 Recipe-2
  • Exhaust all possible starting positions in the expectation matrix


  • Do this for each sequence in the data
28
Maximization Phase
 Recipe-3


  • When all the weights are accumulated in the new PSSM, normalize them!


  • The product is a refined PSSM
29
Motif Discovery by EM

  • Repeat
    • Create a new Expectation matrix using the new PSSM
    • Maximize PSSM  by using the Expectation Matrix for weights


  • Stop:  when things don’t change.
30
 
31
 
32
The Final Product
  • The PSSM will fully describe the elicited motifs in the data
33
Multiple Em Motif Elicitation
  • The EM algorithm we have described discovers one motif in each sequence.
  • There are sophisticated algorithms using the EM principle that can handle gaps and can find no, one, or many motifs in each  sequence   MEME is such an application available on the web from the San Diego supercomputing center http://meme.sdsc.edu/meme/intro.html
34
Other approaches to Motif Discovery
  • Gibbs sampler
  • HMM