Notes
Slide Show
Outline
1
HMMs
What can we use them for?
  • Parameters and
  • Probabilities
2
NOTATION
  • Emissions are observations of individual symbols drawn from an alphabet of possible symbols
  • Each symbol observed in a sequence of symbols is called an observation, oi and the set of observed symbols (emissions) is denoted O.
    • By convention
    • The observations(often time, depending on the application) are indexed by t.
      • There are t=1,T observations
    • The states Q are indexed by i
      • There are i=1,N states
    • There are k emissions in each set Bj specific to each state, sometimes denoted bj,k (note that k is not necessarily the same among the various Bj)


3
NOTATION
  • M is the Model (p,A,B) where
    • p is dead-start state distribution
    • Bi is the set of emission probabilities {bi,k} for each state i.
      • There are N  sets of B, that is,   {bjk} where k is the number of emitable symbols in each state
    • A is the set of transition probabilities from one state to the next {ai,i+1}

4
Important Issues re Interpretation of Sequences in the Context of the Model
  • Evaluation:
    • What is the full probability of the given observed sequence of symbols O, given the model [What is p(O|M)], and hence the probability of any path)?
      • The Forward Algorithm
  • Decoding:
    • What is the most probable path p* given the model? p*(Q|M)
      • The Viterbi algorithm
  • Learning:
    • Given an  observed sequence, what is the a posteriori probability that some observation came from a particular state?  In particular,
    • What is the most probable model, given the set of observations? [What is P(M|O)]
      • The Baum Welch algorithm

5
Numerical Answers
  • All of these would be hard problems
    • BUT…..


  • The evaluation and decoding problems p(O|M) can be solved with dynamic programming yielding polynomial complexity solutions


  • The p(M|O) problem can be solved in polynomial time with an heuristic based upon the expectation-maximization technique, and converges to a solution after only a few iterations
6
 Model with 3 Coding Regions (States)
7
OBSERVED SEQUENCE O
8
The Forward Algorithm
9
 
10
 
11
 
12
 
13
The recursion:
  • Define a function relating all joint probabilities up to the observation of interest


14
 
15
 
16
 
17
The Viterbi Algorithm
  • Consider the Forward Algorithm variable (function)  f , vis à vis the Viterbi variable vk(i)


    • v is an ad hoc argmax times emission probability
    • f is an  ad hoc joint probability of emission probability  times transition probability

18
Viterbi Algorithm
19
 
20
 
21
 
22
 
23
 
24
 
25
Implementations
  • Both the Forward Algorithm and the Viterbi Algorithm can be implemented with DYNAMIC PROGRAMMING
    • Otherwise they are hard problems (exponential)
  • Use logs to prevent underflow


26
THE LEARNING PROBLEM -EM
  • WHAT ABOUT THE LEARNING PROBLEM?
  • We don’t know the parameters but we are given sets of observations
27
Expectation Maximization
28
Example Gaussian Mixture
29
 
30
So, all we have is..
31
From which we need to infer the
 likelihood functions which generated the observations….
32
Types of Problems
  • Parametric
    • We need to find the parameters of the pdf
  • Non Parametric
    • We need to find pdf itself
33
Parametric Problem
Example
34
EM
  • Typically, we have data which is generated by a probability density  x1,x2,x3….,y1,y2…..
  • The difficulty is that we only observe x1,x2,x3….
  • We need to figure out the parameters Q (m,s2) of the entire system which will govern both X and Y, all of the data
35
EM
  • We are looking for the parameter(s) Q


  • We are give the observations O=x1,x2,x3
  •  but unlabelled


  • The labels are the hidden data.


36
EM
  • We need to optimize the expression
  •             P(Obs,Hidden| Q)


  • Given only
  •             P(Obs|Q)


  •          by choosing the best Q
37
The EM Algorithm
  • EM is a 2 step algorithm to get the best Q
  • EXPECTATION STEP
    • Compute the expectation of the hidden variables under the current parameter:  yi| Q
  • MAXIMIZATION STEP
    • Replace the parameters with the most likely parameter based on the values of the  H, assuming that the y1,y2….. represent the expected values of H

38
The EM Algorithm
  • Iterate the Expectation and Maximization steps until Q converges


  • Will converge to a local maximum likelihood (Dempster,Laird,Rubin)


    • The computations to compute the expectations  can be messy, as can the maximization of the likelihood.
39
EXAMPLE 
The Gaussian Mixtures already presented
40
 
41
Q:What is the probability of any xi given process (m,s2) j?
42
EXPECTATION Step
43
Since it is a Gaussian Distribution, we can define the probabilities concisely
44
But, remember for a Gaussian process, the square error is minimum when m is the mean
45
MAXIMIZATION STEP
46
Maximum Likelihood
  • Minimization of error was easy with a Gaussian
    • The ‘magic’ of the mean
    • The simplicity of deriving a probability given just the parameters
  • There are many nonparametric situations (e.g. HMM)
  • In some discrete distributions, it is necessary to compute the multinomial distribution, then set the derivative to 0.  This can get really ugly.
  • Likewise, other distributions require even more complicated calculus (LaGrangian multipliers and Heaven knows what else)
47
DEMO
48
Non Parametric
Example
49
The Task
  • Observe blood types in a bunch of people
    • Theses types are A,B, AB, and O
    • (the Blood types determined by the Blood Bank are the phenotypes)
    • The phenotypes are the observed data


  • Infer the frequencies ( ie a discrete pdf) of the blood type alleles A,B and O
50
What’s missing?
  • We need to know the genotypes which underlie the phenotypic expression
    • The possible genotypes are AA, AO, OA,AB,BB, BO,OB,OO
    • The genotypes are the hidden data
51
Experiment:  Observe Phenotypes
  • Sample 30 people and type them
    • Type A    16
    • Type B      2
    • Type AB   1
    • Type O    11
52
Create a matrix for the probabilities
53
Now create a matrix for the hidden data (the genotypes)
54
EXPECTATION
  • NORMALIZE each row so that the row entries represent a probability.


  • Specifically, each entry in a row represents the probability that the entry’s genotype (column heading) led to the entry’s phenotype (row heading).
55
Matrix  normalized
56
MAXIMIZATION
  • Now we wish to recover the Probabilities of  the alleles being drawn from the population. (Remember this was the goal)


  • The Probability matrix was set up as a 2 column matrix
    • Column 1 was the probability of recovering the ‘left’ allele from the population*
    • Column 1 was the probability of recovering the ‘left’ allele from the population

  • *In this particular example, the left and right allele probabilities are equal
57
MAXIMIZATION
  • Now, ask the question: Which entries have contributed to a specific left allele, say, A?
  • (We will ask and answer the same question for all alleles, both left and right)
  • Answer:  For the ‘left’ A, it is
  • row 1  (A phenotype) col 1(genotype AA)
  • row 1  (A phenotype) col 2 (genotype AO),
  • row 3 (AB phenotype), col 7 (AB genotype)
58
 
59
MAXIMIZATION
  • So, compute the probability of the Left A allele as follows:


    • Sum of entries in row 1 (A phenoype) .333+.333=.666
    •    There are 16 individuals with phenotype A


    • Sum of entries in row 3 (AB phenotype)= .5
    •    There is one individual with phenotype AB
60
MAXIMIZATION
  • P(Aleft)= .667´16    +   0.5 ´ 1= .372


  • Do the same calculation for Aright, for Bright,
  • for for Oright,  and for Oleft


61
MAXIMIZATION
62
EM
  • Now, use the new Probability Matrix and
  • ITERATE!
63
Many applications of EM, but we are interested in three.
  • Learning transition probabilities in a Hidden Markov Model, given only the emissions (observations) as training data. There is an efficient (polynomial) special-case algorithm, the Baum-Welch Algorithm, exploiting the Markov structure of the HMM .
  • Learning a most likely motif subsequence given many sequences of data, each of  which contains a coding subsequence for the protein function of interest.  The algorithm is called the Multiple-Sequence Expectation-Maximization Motif Elicitation (MEME)
  • Unsupervised cluster discovery (to be visited later)
64
Finding the parameters from training data
EM- The Baum-Welch Algorithm
  • We need the concept of a backward variable (an a posteriori probability)
  • We need 2 more variables
    • the probability of being in state i at time t and in state j at t+1 given the model M and the observation  O
    • the probability of being in state i at time t given the observation sequence O and the model M





65
Backward Variable
  • What is the probability of the partial observation from t+1 to the end
  • GIVEN state qi AND
  • GIVEN the parameters of the model
66
 
67
 
68
 
69
 
70
 
71
 
72
 
73
Now we are ready to do EM
  • Reestimate the parameters
74
Estimation
  • γt(i) summed over t ~ expected number of transitions out of state I
  • ξ(i,j) summed over t is expected transitions from state I to state j
75
Estimation
76
Estimation
77
Estimation
78
MAXIMIZATION
79
"ITERATE"
  • ITERATE
80
Training Data with Unknown Paths using EM: Baum-Welch
  • Expectation Step
    • Estimate transition and emission probabilities using frequency counts (the states will look alike)
    • Run the model with these estimates, using Posterior decoding


  • Maximization step
    • Re-estimate the parameters using maximum likelihood estimator
  • Iterate