|
1
|
- At least 3 ways to think about it…
- Intuitive: A measure of one’s ‘belief’
- Physical Science Frequency Interpretation : Empirically derived
frequency ratios, which, in the limit, define a probability
- Theoretical: A set of properties based a specific algebra
|
|
2
|
- We are waiting for a hot molecule to come out of a small hole in a
vessel full of gas. It takes a certain energy to reach the hole and
escape.
- We have an intuitive feeling that not all molecules will come out, only
certain ones, and it might not happen often.
- What are our expectations (predictions) about this awaited event?
- Related questions are:
- What is the chance of a hot molecule?
- How often will we see a hot molecule?
- Should we bet on the occurrence a hot molecule’s arrival?
- Should we bet on the time of a hot molecule’s arrival?
|
|
3
|
- We end up trying to express this notion by using one undefined word to
define another. Out of our
ruminations come words like
- Chance
- Probability
- Frequency
- Likelihood
- How often
- Odds
- Predictability
- and we struggle to give meaning to one without using another as part of
the definition.
- Even though we cannot seem to nail this down, the idea is simple: there
is uncertainty involved and we are trying to measure this uncertainty
somehow.
|
|
4
|
- Laplace:
- “advanced knowledge”
- Suggested idea:
- A measure of uncertainty
|
|
5
|
- In reality, we are saying that we don’t know how to express the formula
or law of nature in closed form.
- When we do know the closed form, we say that the process is deterministic
- So, maybe another good intuitive
idea of probability could be
- A measure* of non-determinism
- *The word ‘measure’ here is in
the everyday sense. We will use
this word more rigorously as well, later on.
|
|
6
|
- A working definition:
- When the measure of determinism (sci probability) is 1, the process is deterministic. When the measure is <1, it is random.
|
|
7
|
- Sometimes deterministic processes can mimic randomness – you can’t
always tell by looking at the outcomes.
- For example, here is a chaos generator (non-linear eqn)
- used as a pseudo random number generator *
|
|
8
|
|
|
9
|
|
|
10
|
- We can estimate it based on observations of frequency.
- We define probability as frequency.
|
|
11
|
- For example, we witness a set of outcomes of an experiment. If we keep repeating the experiment, a
very large number of times, and count a specific outcome x, then
|
|
12
|
- Laplace:
- The sun has risen 1,826,250 times in 5000 years
- Odds are 1,826,251:1 that it will rise tomorrow
- Right?
- Feller points out that this may not be a probabilistic event and that
records of other sun-risings are irrelevant. Restated: The sun rising or
not may not be among the outcome choices.
|
|
13
|
- Why an abstract and non intuitive definition?
- Because development of the underlying algebra (Borel Algebra) and the
functions defined on this algebra provide the rationale for all of the
so-called ‘laws’ of probability.
They are not really axioms; they are constraints and consequences
which follow from the definition of a probability space.
|
|
14
|
- An outcome set Ω
- A smallest Boolean algebra on that set with countably infinite unions of
subsets (ie a smallest σ-algebra ,S
)
- A function P with the countable additivity property
- (Ω,S, P) defines a
probability space
|
|
15
|
|
|
16
|
|
|
17
|
|
|
18
|
- Define a function P on the s-Algebra
|
|
19
|
|
|
20
|
- So, this last property of P is the magic that we await.
- From this abstract idea, with a rigorous definition of this function P,
we have a theoretical framework for what we always knew:
- If a number of subsets of S are chosen such that they are mutually
exclusive, then the probability of their union is equal to the sum of
their individual probabilities.
|
|
21
|
- On the non-negative countably additive sets in the σ-algebra, define a measure μ(A).
μ(A) ¹Æ
- μ is a probability measure, and the
countable additivity property ensures that all probability measures in Ω
sum to 1.0
|
|
22
|
|
|
23
|
- Random Variable: A numerical
value (label) , V, assigned to
each event in a collection of events , which
indicates the outcome of a
Random experiment.
- Random Variables don’t necessarily sum to 1, nor are they in [0,1]
- RVs are associated with an underlying probability governing their
occurrence
|
|
24
|
- A die has 6 possible outcomes.
The number on the face of the die is a random variable
(1,2,3,4,5,6)
- The probability underlying each random variable is .18667
|
|
25
|
- So, given a set of outcomes (RVs), what is the probability of each
RV? The set of outcomes, each
associated with its underlying probability, is called a probability
density.
- Example Throw one of 2 loaded
dice (choice of die is random and
equiprobable)
- One die {1,2,3,4,5,6}
- The other die {1,1,1,4,5,6}
- The
density is
|
|
26
|
- Distribution Function: A
function, F(X),which gives the probability that an outcome of an experiment will
have its random variable value
,V, less than or equal to X.
The probability that V will be in the interval [a,b ] , is
given by
prob ( a £ V £ b)
= F(b) -
F(a)]
- Density Function: The Derivative (if it exists) of the
distribution function is the Probability density function
|
|
27
|
|
|
28
|
|
|
29
|
- The precise terms
- Cumulative Distribution Function
- and its derivative
- Density Function
- are frequently used carelessly and,
in error, interchangeably.
The density function is very often carelessly called
- the ‘probability distribution’
- Yet, we appropriately can call the process characterized by the density
- ‘the distribution’ or the ‘probability
distribution’
- …Very confusing indeed
|
|
30
|
- Discrete probabilities (p(Heads), p(Tails))
- Uniform Density p(x)=k , xÎ[0,1] | òp(x)dx=1
- Exponential Densities
|
|
31
|
|
|
32
|
|
|
33
|
|
|
34
|
|
|
35
|
- The density of the average* (or sum) of independent identically
distributed RVs taken from ANY** process, regardless of its probability
density, approaches a Gaussian density in the limit.
|
|
36
|
- When we take an average (mean) of a set of RVs, we are summing the values
of the RVs, then dividing by the number
of values. There is no inherent tie to the probability that each
of those values might arise.
|
|
37
|
|
|
38
|
- Notice that the mean ‘ignores’
the underlying probabilities, giving little insight to the true
character of the data.
- As an extreme example, consider drawing 2 balls from an urn of one million balls, each labeled with a
number.
- 999,999 balls are labeled 1, and one ball is labeled 999,999
- The mean of the set is 1.999998
- If we were to draw two balls, a 1 and a 999,999, the mean value of the
sample subset would be 500,000, yet we ‘know’ that the mother set does
not really look like that
|
|
39
|
- Let us now tie the value of each RV to the probability of its occurrence
- Suppose the value of the rv is vi. Then, compute, for each rv, vi × prob(vi)
- Then the EXPECTATION, or EXPECTED VALUE,
of the set of RVs is ∑vip(vi)
- For a continuous RV v,
|
|
40
|
- Again consider the same set {5,1,1,1,.1,2.9}
- The expectation of this set of RVs, given the same respective underlying
probabilities, is
|
|
41
|
|
|
42
|
- Although it seems a bit contrived, the Expectation is a powerful
notion. We use Expectation to
define another notion: moments.
- Moments characterize a random process.
Under specific conditions,
they are said to be the parameters of the process
- In many random processes that you are familiar with, if the process in normalized,
these moments become equivalent to ideas you are very familiar with such as mean and variance
|
|
43
|
- More formally, the nth moment of a random process X is the Expectation
of the nth power of X
- 1st moment E(X)
- 2nd moment E(X2)
- 3rd moment E(X3)
- ……….
- nth moment E(Xn)
|
|
44
|
- If the process is GAUSSIAN, the first moment is the mean.
- The nth central moment is computed by removing the first moment (mean)
before finding the Expectation
|
|
45
|
- We define variance in terms of moments:
- Var(x)=E(X2)-[E(X)]2
- The variance is thus the second moment minus the square of the first
moment.
- Note that the variance and the second moment are the same if the
process is normalized and GAUSSIAN
|
|
46
|
- Consider the 3rd central moment
- E(X3)=E([X-E(X)]3)
- This moment is known as skewness.
It describes how ‘off-center’ the process is. Look at the Gumbel
density.
- What is this moment for a Gaussian density?
|
|
47
|
- The 4th moment has no real utility, but the 4th
moment minus 3 times the square of the second moment, E(X4) - 3 [E(X2)] 2
- is very useful.
- This is called kurtosis. Kurtosis is a measure of ‘pointy-ness’ (+) kurtosis is ‘pointy’, (-) kurtosis is ‘flat’
- Aside from its general descriptiveness as a parameter, it s a value of
great interest in Independent Component Analysis.
- What is the kurtosis of a Gaussian process?
|
|
48
|
- If X and Y are independent
- E(X+Y)=E(X) + E(Y)
- E(XY)=E(X) E(Y)
- Var(X + Y)= Var(X) + Var(Y)
|
|
49
|
- The probability of more than one simultaneous outcome from the same
process (probability distribution)
- Ex: Draw 2 labeled balls from an urn
What is the probability of that particular combination of labels?
- Answer: The joint probability, derived from of the probability of each ball
|
|
50
|
- What if the outcome of, say, two events,
depends upon both events having happened.
- Restated: What if the outcome of one must occur before the other can
occur?
|
|
51
|
|
|
52
|
|
|
53
|
- Given that 1 is a boy
- What is the probability that both are boys?
|
|
54
|
- {B,B} .25
- {B,G} .25
- {G,B} .25
- {G,G} .25
|
|
55
|
- Let F represent BOTH are BOYS
- Let E represent AT LEAST ONE is a BOY
- We then seek P(F|E)
|
|
56
|
|
|
57
|
|
|
58
|
- Events are independent iff:
- P(A ÙB) =P(A) ´ P(B)
- events A ,B
- Restated:
- P(AB)=P(A)P(B) Û
A and B are independent
- P(A|B)=P(A) Û A and B are
independent
|
|
59
|
- Draw a card: Suits are
independent of numbers
- P(Ace of Spades)= P(Ace) P(Spade)
- =(1/13)(1/4)
- Roll 2 dice:
- P(One on first die and even face on second die)
- =P(One)P(Even)
- = (1/6)(3/6)
|
|
60
|
|
|
61
|
|
|
62
|
|
|
63
|
|
|
64
|
|
|
65
|
|
|
66
|
|
|
67
|
- The Markov property holds in many natural stochastic processes
- The Markov assumption is that there is no memory of any step before the
current step
- A stochastic process with the Markov property is called a Markov (or
memoryless) process
- We will encounter many, many discrete Markov processes in this course
|
|
68
|
|
|
69
|
- Cov(X,Y)=E ( [X-E(X)] – [Y-E(Y) ]
- =E(XY) - E(X)E(Y)
- If X and Y are independent,
- E(XY)=E(X) E(Y), so Cov (XY)
must = 0
|
|
70
|
- Often we are given P(A|B),
- But we are really interested in P(B|A)
|
|
71
|
|
|
72
|
- POSTERIOR µ PRIOR * LIKELIHOOD
|
|
73
|
- What to use in a Bayes calculation if you don’t know the prior
|
|
74
|
- 48 blacks and 259 whites took a test for promotion
- 26 blacks passed, 22 failed P(P)
= 0.54
- 206 whites passed, 53 failed P(P) = 0.795
|
|
75
|
- P(B|P) = 26/232 = 0.11
- P(W|P) = 206/232 = .89
- ‘Conclusion*’ : Possibility of discrimination
|
|
76
|
|
|
77
|
|
|
78
|
|
|
79
|
- P(P|B) = 25/48 = 0.54
- P(P|W) = 206/259 = 0.80
|
|
80
|
|
|
81
|
- When there are 2 or more possible explanations for an observed event, we
want to choose the explanation with the fewest and simplest assumptions
- This is OCCAM’S razor: it ‘cuts’ away the excess verbiage, conditions,
complications, and unnecessary logic
|
|
82
|
- We will always assume that a system will seek its lowest energy state
- Equivalently
- We will always assume that a system will always seek its highest entropy
state
- Equivalently
- The least (Kolmogorov) complexity is the best
|
|
83
|
|
|
84
|
- Entropy (real number ³0) is a measure of disorder
- High entropy
- High disorder
- Much information needed to specify all the states
- Low entropy
- Well organized
- Little or no information needed to specify all the states
|
|
85
|
|
|
86
|
- A bridge between statistical thermodynamics and information
- If you can know (or guess) about each particle in the system (say, a
gas), you can determine the entropy of the system
|
|
87
|
- Likewise, you can measure the information in a message by knowing (or
guessing) the probability of each element of the message.
- Information relates to entropy through probability as:
- S= -p(x)log2p(x)
- where S is entropy, p(x) is
the probability of event x
|
|
88
|
- Shannon generalized this for a set of events in a system and for letters
in an alphabet.
|
|
89
|
- Shannon theorem for codes forms basis
of all data compression schemes.
- The Entropy of an alphabet is the lowest number of average bits per
character achievable in any coding scheme.
- The ASCII code uses 8 bits per character--The entropy of English is
about 5 bits per character--Therefore it pays to compress
|
|
90
|
- Example:
- Given AATGATGCTGCAAATAAGTA
- The frequencies (probabilities) of the bases are
|
|
91
|
|
|
92
|
- The MDL is the length of the most parsimonious way to encode a message
without an error
- MDL is usually measured in bits
|
|
93
|
|
|
94
|
|
|
95
|
- Suppose we have an efficient programming language, similar to, say, C#
- We will represent the length of the program as one byte for each
character and one byte for each token.
- Any integer number N can be represented by log2(N) bytes,
rather than using a byte for each symbol (digit) . We will not allow
fractions of bytes in this example
|
|
96
|
|
|
97
|
|
|
98
|
|
|
99
|
|
|
100
|
|