Notes
Slide Show
Outline
1
What is Probability?
  • At least 3 ways to think about it…
  • Intuitive: A measure of one’s ‘belief’
  • Physical Science Frequency Interpretation : Empirically derived frequency ratios, which, in the limit, define a probability
  • Theoretical: A set of properties based a specific algebra



2
Intuitive
  • We are waiting for a hot molecule to come out of a small hole in a vessel full of gas. It takes a certain energy to reach the hole and escape.
  • We have an intuitive feeling that not all molecules will come out, only certain ones, and it might not happen often.
  • What are our expectations (predictions) about this awaited event?
  • Related questions are:
  • What is the chance of a hot molecule?
  • How often will we see a hot molecule?
  • Should we bet on the occurrence a hot molecule’s arrival?
  • Should we bet on the time of a hot molecule’s arrival?


3
Circular intuitive ideas
  • We end up trying to express this notion by using one undefined word to define another.  Out of our ruminations come words like
    • Chance
    • Probability
    • Frequency
    • Likelihood
    • How often
    • Odds
    • Predictability
  • and we struggle to give meaning to one without using another as part of the definition.
  • Even though we cannot seem to nail this down, the idea is simple: there is uncertainty involved and we are trying to measure this uncertainty somehow.
4
Intuitive
  • Laplace:
  • “advanced knowledge”


  • Suggested idea:
  • A measure of uncertainty
5
Intuitive
  • In reality, we are saying that we don’t know how to express the formula or law of nature in closed form.
  • When we do know the closed form, we say that the process is deterministic
  • So, maybe another  good intuitive idea of probability could be
        • A measure* of non-determinism

  • *The  word ‘measure’ here is in the everyday sense.  We will use this word more rigorously as well, later on.
6
Determinism
  • A working definition:
  • When the measure of determinism (sci probability) is 1, the process is deterministic.  When the measure is <1, it is random.
7
Determinism vis à vis randomness
  • Sometimes deterministic processes can mimic randomness – you can’t always tell by looking at the outcomes.
  • For example, here is a chaos generator (non-linear eqn)
  • used as a pseudo random number generator *
8
 
9
The Logistic Map in an ergodic regime
10
Physical Science Frequency Interpretation
 Estimating Probability

  • We can estimate it based on observations of frequency.


  • We define probability as frequency.


11
"For example,"
  • For example, we witness a set of outcomes of an experiment.  If we keep repeating the experiment, a very large number of times, and count a specific outcome x, then




12
Watch out!
  • Laplace:
  • The sun has risen 1,826,250 times in 5000 years
  • Odds are 1,826,251:1 that it will rise tomorrow


  • Right?


  • Feller points out that this may not be a probabilistic event and that records of other sun-risings are irrelevant. Restated: The sun rising or not may not be among the outcome choices.
13
Theoretical: Defining Probability
 based on Algebraic Topology
  • Why an abstract and non intuitive definition?


  • Because development of the underlying algebra (Borel Algebra) and the functions defined on this algebra provide the rationale for all of the so-called ‘laws’ of probability.  They are not really axioms; they are constraints and consequences which follow from the definition of a probability space.
14
What we need in a nutshell
  • An outcome set Ω
  • A smallest Boolean algebra on that set with countably infinite unions of subsets (ie a smallest σ-algebra ,S  )
  • A function P with the countable additivity property


  • (Ω,S,  P) defines a probability space
15
The Boolean Algebra
16
Example
17
Sigma-Algebras
18
Countable additivity
  • Define a function P on the s-Algebra
19
Look familiar?
20
Countable additivity, axioms
  • So, this last property of P is the magic that we await.
  • From this abstract idea, with a rigorous definition of this function P, we have a theoretical framework for what we always knew:


  • If a number of subsets of S   are chosen such that they are mutually exclusive, then the probability of their union is equal to the sum of their individual probabilities.
21
"On the non-negative countably additive..."
  • On the non-negative countably additive sets in the  σ-algebra, define a measure μ(A). μ(A) ¹Æ


  • μ is a probability measure, and the countable additivity property ensures that all probability measures in Ω sum to 1.0
22
So-called ‘Laws’ of Probability
23
TERMS ASSOCIATED WITH PROBABILITY

    • Random Variable:   A numerical value (label) , V, assigned  to each event  in  a collection of events , which indicates the  outcome of  a   Random  experiment.


      • Random Variables don’t necessarily sum to 1, nor are they  in [0,1]


      • RVs are associated with an underlying probability governing their occurrence









24
RV  example
  • A die has 6 possible outcomes.  The number on the face of the die is a random variable (1,2,3,4,5,6)


  • The probability underlying each random variable is .18667



25
RVs
  • So, given a set of outcomes (RVs), what is the probability of each RV?  The set of outcomes, each associated with its underlying probability, is called a probability density.
  • Example  Throw one of 2 loaded dice (choice of die  is random and equiprobable)
  • One die {1,2,3,4,5,6}
  • The other die {1,1,1,4,5,6}
  •                           The density is


26
Terms associated with RVs
      • Distribution Function:  A function, F(X),which gives the probability that  an outcome of an experiment will have its random variable  value ,V,   less than or equal to  X.  The probability that V will be in the interval [a,b ] , is given by                           prob  ( a £  V £ b)  =   F(b)  -  F(a)]


      • Density Function:   The Derivative (if it exists) of the distribution function is the Probability density function
27
 
28
Probability Density Function  of a Continuous Random Variable
29
Confusing terminology
  • The precise terms
  • Cumulative Distribution Function
  • and its derivative
  • Density Function
  • are frequently used carelessly and,  in error, interchangeably.  The density function is very often carelessly called
  • the ‘probability distribution’
  • Yet, we appropriately can call the process characterized by the density
  •  ‘the distribution’ or the ‘probability distribution’
  • …Very confusing indeed


30
Examples of density functions
  • Discrete probabilities (p(Heads), p(Tails))
  • Uniform Density p(x)=k ,  xÎ[0,1] | òp(x)dx=1
  • Exponential Densities


31
Exponential Densities
Gaussian
32
Exponential Densities
Extreme Values
Gumbel Density Functions for Min and Max
33
Exponential Densities
Boltzman density
34
Poisson: another work horse—also a limit
t is the time interval to be examined
l is the rate of ‘events’ per unit time
k is the number f events which occurred
 
 










Exponential Density: is the waiting time to the first event of a Poisson process
i.e., given l, how long is the wait until the first event.
35
The Central Limit Theorem
  • The density of the average* (or sum) of independent identically distributed RVs taken from ANY** process, regardless of its probability density, approaches a Gaussian density in the limit.
36
More about RVs…Moments
  • When we take an average (mean) of a set of RVs, we are summing the values of the RVs, then dividing by the number  of values. There is no inherent tie to the probability that each of those values might arise.
37
The Mean
38
Mean
  • Notice that the mean ‘ignores’  the underlying probabilities, giving little insight to the true character of the data.


  • As an extreme example, consider drawing 2 balls from an urn of  one million balls, each labeled with a number.
  • 999,999 balls are labeled 1, and one ball is labeled 999,999
  • The mean of the set is 1.999998
  • If we were to draw two balls, a 1 and a 999,999, the mean value of the sample subset would be 500,000, yet we ‘know’ that the mother set does not really look like that


39
Expectation (Expected Value)
  • Let us now tie the value of each RV to the probability of its occurrence
  • Suppose the value of the rv is vi.  Then, compute, for each rv, vi  × prob(vi)


  • Then the EXPECTATION, or EXPECTED VALUE,  of the set of RVs is   ∑vip(vi)


  • For a continuous RV v,
40
Expectation
  • Again consider the same set {5,1,1,1,.1,2.9}


  • The expectation of this set of RVs, given the same respective underlying probabilities,  is
41
 
42
Moments
  • Although it seems a bit contrived, the Expectation is a powerful notion.  We use Expectation to define another notion: moments.
  • Moments characterize a random process.  Under specific conditions,  they are said to be the parameters of the process
  • In many random processes that you are familiar with, if the process in normalized, these moments become equivalent to ideas you are very familiar  with such as mean and variance
43
Moments
  • More formally, the nth moment of a random process X is the Expectation of the nth power of X


  • 1st moment  E(X)
  • 2nd moment E(X2)
  • 3rd moment E(X3)
  •    ……….
  • nth moment E(Xn)
44
Central Moments
  • If the process is GAUSSIAN, the first moment is the mean.


  • The nth central moment is computed by removing the first moment (mean) before finding the Expectation
45
Moments

  • We define variance in terms of moments:
    • Var(x)=E(X2)-[E(X)]2


    • The variance is thus the second moment minus the square of the first moment.
    • Note that the variance and the second moment are the same if the process is normalized and GAUSSIAN
46
Higher order moments
  • Consider the 3rd central moment
  •     E(X3)=E([X-E(X)]3)


  • This moment is known as skewness.  It describes how ‘off-center’ the process is. Look at the Gumbel density.


  • What is this moment for a Gaussian density?
47
Higher order moments
  • The 4th moment has no real utility, but the 4th moment minus 3 times the square of the second moment,  E(X4)  - 3 [E(X2)] 2
  • is very useful.


  • This is called kurtosis. Kurtosis is a measure of ‘pointy-ness’  (+) kurtosis is ‘pointy’,  (-) kurtosis is ‘flat’


  • Aside from its general descriptiveness as a parameter, it s a value of great interest in Independent Component Analysis.


  • What is the kurtosis of a Gaussian process?
48
Moments: Manipulation
  • If X and Y are independent
  • E(X+Y)=E(X) + E(Y)
  • E(XY)=E(X) E(Y)


  • Var(X + Y)= Var(X) + Var(Y)
49
JOINT PROBABILITY
  • The probability of more than one simultaneous outcome from the same process (probability distribution)
  • Ex: Draw 2 labeled balls from an urn   What is the probability of that particular combination of labels?
  • Answer: The joint probability, derived from  of the probability of each ball
50
JOINT PROBABILITY
  • What if the outcome of, say, two events,  depends upon both events having happened.


  • Restated: What if the outcome of one must occur before the other can occur?
51
 
52
Conditional Probability Definition
53
A family has 2 children
  • Given that 1 is a boy
  • What is the probability that both are boys?




54
Probabilities of Possible Families
  • {B,B} .25
  • {B,G} .25
  • {G,B} .25
  • {G,G} .25


55
Formulation
  • Let F represent BOTH are BOYS
  • Let E represent AT LEAST ONE is a BOY
  • We then seek P(F|E)
56
Application of Formula
57
Substitution
58
INDEPENDENCE
Also Stochastic Independence, Statistical Independence
  • Events are independent iff:
      • P(A ÙB) =P(A) ´ P(B)
      • events A ,B


      • Restated:
      • P(AB)=P(A)P(B) Û A and B are independent
      • P(A|B)=P(A)        Û A and B are independent



59
Examples
(From Feller)
  • Draw a card:  Suits are independent of numbers
  • P(Ace of Spades)= P(Ace) P(Spade)
  • =(1/13)(1/4)
  • Roll 2 dice:
  • P(One on first die and even face on second die)
  •        =P(One)P(Even)
  •         = (1/6)(3/6)
60
An example of joint density of two continuous probability densities
61
Conditional Probability in a continuous joint density
62
Contingency Table
63
 
64
 
65
 
66
Computing Joint Conditional Probabilities
67
A simplifying assumption
  • The Markov property holds in many natural stochastic processes
  • The Markov assumption is that there is no memory of any step before the current step
  • A stochastic process with the Markov property is called a Markov (or memoryless) process
  • We will encounter many, many discrete Markov processes in this course
68
The Markov property
69
Joint Moments: Manipulation
  • Cov(X,Y)=E (  [X-E(X)] –  [Y-E(Y)  ]
  •                  =E(XY) - E(X)E(Y)


  • If X and Y are independent,
  •      E(XY)=E(X) E(Y), so Cov (XY) must = 0


70
Bayes’ Theorem
  • Often we are given P(A|B),
  • But we are really interested in P(B|A)
71
 
72
BAYES’  THEOREM
  • POSTERIOR  µ PRIOR * LIKELIHOOD
73
Three more densities
  • What to use in a Bayes calculation if you don’t know the prior
74
Connecticult vs Teal
  • 48 blacks and 259 whites took a test for promotion


  • 26 blacks passed, 22 failed  P(P) = 0.54


  • 206 whites passed, 53 failed P(P) = 0.795


    • Total passed = 232


75
Connecticult vs Teal
  • P(B|P) = 26/232 = 0.11
  • P(W|P) = 206/232 = .89


  • ‘Conclusion*’ : Possibility of discrimination
76
CONTINGENCY TABLE
77
WHAT WE REALLY NEED IN ORDER TO SHOW DISCRIMINATION:
  • P(P|B) vs P(P|W)
78
BAYES’ THEOREM
When you don’t have access to the marginals in the contingency table….
79
"P(P|B)"


  • P(P|B) = 25/48 = 0.54
  • P(P|W) = 206/259 = 0.80
80
"Pluralitas non est ponenda sine neccesitate"

also

Frustra fit per plura quod potest fieri per pauciora.
81
OCCAM’S RAZOR
  • When there are 2 or more possible explanations for an observed event, we want to choose the explanation with the fewest and simplest assumptions


  • This is OCCAM’S razor: it ‘cuts’ away the excess verbiage, conditions, complications, and unnecessary logic


82
So, what is ‘simple’?
  • We will always assume that a system will seek its lowest energy state


  • Equivalently


  • We will always assume that a system will always seek its highest entropy state


  • Equivalently


  • The least (Kolmogorov) complexity is the best


83
ENTROPY
84
Entropy
  • Entropy (real number ³0) is a measure of disorder
  • High entropy
    • High disorder
    • Much information needed to specify all the states
  • Low entropy
    • Well organized
    • Little or no information needed to specify all the states
85
 
86
Shannon Entropy
  • A bridge between statistical thermodynamics and information


  • If you can know (or guess) about each particle in the system (say, a gas), you can determine the entropy of the system
87
Shannon Entropy
  • Likewise, you can measure the information in a message by knowing (or guessing) the probability of each element of the message.


  • Information relates to entropy through probability as:
  • S= -p(x)log2p(x)
  •     where S is entropy, p(x) is the probability of event x
88
Shannon Entropy
  • Shannon generalized this for a set of events in a system and for letters in an alphabet.
89
CODING THEOREM
  • Shannon theorem for codes forms basis  of all data compression schemes.
  • The Entropy of an alphabet is the lowest number of average bits per character achievable in any coding scheme.
  • The ASCII code uses 8 bits per character--The entropy of English is about 5 bits per character--Therefore it pays to compress
90
Shannon Entropy
  • Example:
  • Given AATGATGCTGCAAATAAGTA
  • The frequencies (probabilities) of the bases are


91
Shannon Entropy
92
Minimum Description Length
  • The MDL is the length of the most parsimonious way to encode a message without an error
  • MDL is usually measured in bits
93
 
94
 
95
MDL
  • Suppose we have an efficient programming language, similar to, say, C#
  • We will represent the length of the program as one byte for each character and one byte for each token.
  • Any integer number N can be represented by log2(N) bytes, rather than using a byte for each symbol (digit) . We will not allow fractions of bytes in this example
96
 
97
 
98
 
99
 
100