|
1
|
- Parameters and
- Probabilities
|
|
2
|
- Emissions are observations of individual symbols drawn from an alphabet
of possible symbols
- Each symbol observed in a sequence of symbols is called an observation,
oi and the set of observed symbols (emissions) is denoted O.
- By convention
- The observations(often time, depending on the application) are indexed
by t.
- There are t=1,T observations
- The states Q are indexed by i
- There are k emissions in each set Bj specific to each state,
sometimes denoted bj,k (note that k is not necessarily the
same among the various Bj)
|
|
3
|
- M is the Model (p,A,B) where
- p is dead-start state
distribution
- Bi is the set of emission probabilities {bi,k}
for each state i.
- There are N sets of B, that
is, {bjk} where k
is the number of emitable symbols in each state
- A is the set of transition probabilities from one state to the next {ai,i+1}
|
|
4
|
- Evaluation:
- What is the full probability of the given observed sequence of symbols
O, given the model [What is p(O|M)], and hence the probability of any
path)?
- Decoding:
- What is the most probable path p* given the model? p*(Q|M)
- Learning:
- Given an observed sequence, what
is the a posteriori probability that some observation came from a
particular state? In particular,
- What is the most probable model, given the set of observations? [What
is P(M|O)]
|
|
5
|
- All of these would be hard problems
- The evaluation and decoding problems p(O|M) can be solved with dynamic
programming yielding polynomial complexity solutions
- The p(M|O) problem can be solved in polynomial time with an heuristic
based upon the expectation-maximization technique, and converges to a
solution after only a few iterations
|
|
6
|
|
|
7
|
|
|
8
|
|
|
9
|
|
|
10
|
|
|
11
|
|
|
12
|
|
|
13
|
- Define a function relating all joint probabilities up to the observation
of interest
|
|
14
|
|
|
15
|
|
|
16
|
|
|
17
|
- Consider the Forward Algorithm variable (function) f , vis à vis the Viterbi variable vk(i)
- v is an ad hoc argmax times emission probability
- f is an ad hoc joint probability
of emission probability times
transition probability
|
|
18
|
|
|
19
|
|
|
20
|
|
|
21
|
|
|
22
|
|
|
23
|
|
|
24
|
|
|
25
|
- Both the Forward Algorithm and the Viterbi Algorithm can be implemented
with DYNAMIC PROGRAMMING
- Otherwise they are hard problems (exponential)
- Use logs to prevent underflow
|
|
26
|
- WHAT ABOUT THE LEARNING PROBLEM?
- We don’t know the parameters but we are given sets of observations
|
|
27
|
|
|
28
|
|
|
29
|
|
|
30
|
|
|
31
|
|
|
32
|
- Parametric
- We need to find the parameters of the pdf
- Non Parametric
- We need to find pdf itself
|
|
33
|
|
|
34
|
- Typically, we have data which is generated by a probability density x1,x2,x3….,y1,y2…..
- The difficulty is that we only observe x1,x2,x3….
- We need to figure out the parameters Q (m,s2) of the
entire system which will govern both X and Y, all of the data
|
|
35
|
- We are looking for the parameter(s) Q
- We are give the observations O=x1,x2,x3
- but unlabelled
- The labels are the hidden data.
|
|
36
|
- We need to optimize the expression
- P(Obs,Hidden|
Q)
- Given only
- P(Obs|Q)
- by choosing the best Q
|
|
37
|
- EM is a 2 step algorithm to get the best Q
- EXPECTATION STEP
- Compute the expectation of the hidden variables under the current
parameter: yi| Q
- MAXIMIZATION STEP
- Replace the parameters with the most likely parameter based on the
values of the H, assuming that
the y1,y2….. represent the expected values of H
|
|
38
|
- Iterate the Expectation and Maximization steps until Q converges
- Will converge to a local maximum likelihood (Dempster,Laird,Rubin)
- The computations to compute the expectations can be messy, as can the maximization
of the likelihood.
|
|
39
|
|
|
40
|
|
|
41
|
|
|
42
|
|
|
43
|
|
|
44
|
|
|
45
|
|
|
46
|
- Minimization of error was easy with a Gaussian
- The ‘magic’ of the mean
- The simplicity of deriving a probability given just the parameters
- There are many nonparametric situations (e.g. HMM)
- In some discrete distributions, it is necessary to compute the
multinomial distribution, then set the derivative to 0. This can get really ugly.
- Likewise, other distributions require even more complicated calculus
(LaGrangian multipliers and Heaven knows what else)
|
|
47
|
|
|
48
|
|
|
49
|
- Observe blood types in a bunch of people
- Theses types are A,B, AB, and O
- (the Blood types determined by the Blood Bank are the phenotypes)
- The phenotypes are the observed data
- Infer the frequencies ( ie a discrete pdf) of the blood type alleles A,B
and O
|
|
50
|
- We need to know the genotypes which underlie the phenotypic expression
- The possible genotypes are AA, AO, OA,AB,BB, BO,OB,OO
- The genotypes are the hidden data
|
|
51
|
- Sample 30 people and type them
- Type A 16
- Type B 2
- Type AB 1
- Type O 11
|
|
52
|
|
|
53
|
|
|
54
|
- NORMALIZE each row so that the row entries represent a probability.
- Specifically, each entry in a row represents the probability that the
entry’s genotype (column heading) led to the entry’s phenotype (row
heading).
|
|
55
|
|
|
56
|
- Now we wish to recover the Probabilities of the alleles being drawn from the
population. (Remember this was the goal)
- The Probability matrix was set up as a 2 column matrix
- Column 1 was the probability of recovering the ‘left’ allele from the
population*
- Column 1 was the probability of recovering the ‘left’ allele from the
population
- *In this particular example, the left and right allele probabilities are
equal
|
|
57
|
- Now, ask the question: Which entries have contributed to a specific left
allele, say, A?
- (We will ask and answer the same question for all alleles, both left and
right)
- Answer: For the ‘left’ A, it is
- row 1 (A phenotype) col
1(genotype AA)
- row 1 (A phenotype) col 2
(genotype AO),
- row 3 (AB phenotype), col 7 (AB genotype)
|
|
58
|
|
|
59
|
- So, compute the probability of the Left A allele as follows:
- Sum of entries in row 1 (A phenoype) .333+.333=.666
- There are 16 individuals with
phenotype A
- Sum of entries in row 3 (AB phenotype)= .5
- There is one individual with
phenotype AB
|
|
60
|
- P(Aleft)= .667´16 +
0.5 ´ 1= .372
- Do the same calculation for Aright, for Bright,
- for for Oright, and
for Oleft
|
|
61
|
|
|
62
|
- Now, use the new Probability Matrix and
- ITERATE!
|
|
63
|
- Learning transition probabilities in a Hidden Markov Model, given only
the emissions (observations) as training data. There is an efficient
(polynomial) special-case algorithm, the Baum-Welch Algorithm,
exploiting the Markov structure of the HMM .
- Learning a most likely motif subsequence given many sequences of data,
each of which contains a coding
subsequence for the protein function of interest. The algorithm is called the
Multiple-Sequence Expectation-Maximization Motif Elicitation (MEME)
- Unsupervised cluster discovery (to be visited later)
|
|
64
|
- We need the concept of a backward variable (an a posteriori probability)
- We need 2 more variables
- the probability of being in state i at time t and in state j at t+1
given the model M and the observation
O
- the probability of being in state i at time t given the observation
sequence O and the model M
|
|
65
|
- What is the probability of the partial observation from t+1 to the end
- GIVEN state qi AND
- GIVEN the parameters of the model
|
|
66
|
|
|
67
|
|
|
68
|
|
|
69
|
|
|
70
|
|
|
71
|
|
|
72
|
|
|
73
|
- Reestimate the parameters
|
|
74
|
- γt(i) summed over t ~ expected number of transitions out
of state I
- ξ(i,j) summed over t is expected transitions from state I to state
j
|
|
75
|
|
|
76
|
|
|
77
|
|
|
78
|
|
|
79
|
|
|
80
|
- Expectation Step
- Estimate transition and emission probabilities using frequency counts
(the states will look alike)
- Run the model with these estimates, using Posterior decoding
- Maximization step
- Re-estimate the parameters using maximum likelihood estimator
- Iterate
|