Probability

1  Events and probability

Example

Suppose we roll a fair six-sided die. The set of possible outcomes is W = {1,2,3,4,5,6}. We can consider many possible events, e.g. ``the result is 5'', ``the result is at least 4'', ``the result is divisible by 2''.

Since the die is fair we would say that all 6 outcomes are equally likely, with each having probability 1/6. So writing P for Probability, P(result is i) = 1/6 for i=1,2,...,6 and P(result is at least 4) = P(4,5,6) = 3/6.

1.1  Outcomes and events

Consider an experiment with a set of possible outcomes W.

Examples

  1. Tossing a coin. W = {H,T}. Event ``getting a head'': A = {H}.
  2. Rolling a die twice. W = {(i,j): i,j=1,2,...,6}. Event ``obtaining a total of 4'': A = {(1,3),(2,2),(3,1)}.
  3. Measuring the lifetime of a lightbulb. W = [0,¥). Possible event ``the bulb is still working after t time units'': A = (t,¥).
  4. Record the price of a share over a period of trading of length T. W = {f:[0,T] ® [0,¥]}. Possible event ``the price never falls below L'': A = {f:[0,T] ® [L,¥)}.
For example 2, if the outcome of the experiment is (2,2) then then event A occurs.

For events A,B Í W,

1.2  Elementary probability

If the sample space W is finite and if each sample point w Î W is equally likely then we can consider the special case in which the probability of an event A is defined to be P(A) = |A|/|W|.

Example

A random number generator generates R random digits. For k = 0,1, ...,9 find the probability that:
  1. no digit exceeds k.
  2. k is the greatest digit generated.
Take the first sentence to mean that W = {(d1,d2,...,dr): di = 0,1,...,9; i = 1,2,...,r} with each of the 10r sample points equally likely.
  1. The event of interest Ak = {(d1,d2,...,dr): di = 0,1,...,k; i = 1,2,...,r}. Here |Ak| = (k+1)r, so P(Ak) = |Ak|/|W| = (k+1)r/10r.
  2. Write Bk for the event that k is the greatest digit generated. Bk = Ak \ Ak-1 (with A-1 = Ø). Now Ak-1 Í Ak so |Bk| = |Ak|-|Ak-1| = (k+1)r - kr. Thus P(Bk) = |Bk|/|W| = (k+1)r-kr/10r.

1.3  Counting

Ordered selection

Suppose we have n balls numbered 1,...,n in a box and that we choose them sequentially. There are n! possible outcomes. If only r of the balls are chosen, the number of possible outcomes is n(n-1)...(n-r+1). This procedure is called sampling without replacement.

If the balls are retuned to the box before the next choice is made, the procedure is called sampling with replacement1. The number of possible outcomes when r choices are made is nr.

Examples

In a group of r people, what's the probability that two or more have the same birthday? Write bi for the birthday (day of year) of the ith person. Then W = {(b1,...,br) : bi=1,2,...,365; i = 1,2,...,r}. Assume that all 365r outcomes are equally likely. Write A = {two or more people share a birthday}. Then AC = {all r people have different birthdays} = 365 × 364 × ... × (365-r+1) = 365r. Since |A| = |W|-|AC| = 365r - 365r, P(A) = 1 - 365r/365r.

Unordered selection

Recall that no. of subsets of {1,...,n} with r elements is n r = n!/r!(n-r)!. More generally there are n!/n1!n2!... nm! ways of partitioning the set {1, ...,n} into a first n1-subset, a second n2-subset, ..., and an mth nm-subset, where åk=1m nk = n.

Example

What is the probability of the event A that a hand in bridge (13 cards) contains 5 spades, 4 hearts, 3 diamonds and 1 club? No. of hands of cards = 52 13. No. of hands with 5S, 4H, 3D, 1C is 13 513 413 313 1, so
P(A) =
|A|
|W|
=
13 513 4 13 313 1
52 13
.

1.4  Probability measures

A collection of subsets of W is called an event space or a s-field if
  1. W, Ø Î
  2. A Î AC Î
  3. A1, A2, ... Î Èi=1¥ Ai Î
Note that since A Ç B = (AC È BC)C, A, B Î A Ç B Î and more generally A1,A2, ... Î Çi=1¥Ai Î .

The informal idea is that in a particular application with sample space W, the event space corresponds to the collection of events ``of interest''. (We will se that ``of interest'' means those sets whose probabilities we may wish to know or calculate.)

Examples

  1. = {all subsets of W}
  2. = {Ø, W} or = {Ø, A, AC, W} for some A Î W.
Definition A function P: is called a probability measure if
  1. 0 £ P(A) £ 1    " A Î
  2. P(W) = 1
  3. If A1, A2, ... are disjoint events in then P(Èi=1¥) = åi=1¥ P(Ai). The number P(A) is called the probability of the event A. The triple (W, , P) is called a probability space.
Technical aside: formally P(A) is only defined for A Î . We will adopt the convention throughout that unless otherwise stated all subsets of interest belong to .


Proposition A probability measure P satisfies:
  1. P(AC) = 1-P(A)
  2. P(Ø) = 0
  3. If A Í B then P(A) £ P(B).
  4. P(A È B) = P(A) + P(B) - P(A Ç B)
Proof
  1. From II and III, 1 = P(W) = P(A È AC) = P(A)+P(AC), since A Ç AC = Ø.
  2. P(Ø) = P(WC) = 1-P(W) (by 1) = 1-1 (by II) =0.
  3. For A Í BB = AÈ (B Ç AC) (disjoint). From III, P(B) = P(A) + P(B Ç AC) ³ P(A), since P(B Ç AC) ³ 0 by I.
  4. Write A È B = A È (B Ç AC) and B = (B Ç A) È (B Ç AC) (disjoint). From III P(A È B) = P(A) + P(B Ç AC) (*) and P(B) = P(B Ç A) + P(B Ç AC) (**). (iv) follows by subtracting (**) from (*) and rearranging.
Theorem (Inclusion-Exclusion formula)
P(
n
È
i=1
Ai)
=
 
å
i=1
n P(Ai) -
 
å
i<j
P(Ai Ç Aj) + ...
   
+(-1)r-1
 
å
i1<i2<...<ir
P(Ai1 Ç ... Ç Air) + ... + (-1)n-1 P(A1 Ç ... Ç An)
The summation åi1<i2<...<ir means summation over all n r sets of indices which are subsets of 1, ..., n of size r.

Proof by induction
Property (iv) of the previous theorem gives the result for n=2. Further, by (iv) again,
P(A1 È ... È An) = P((A1 È ... An-1) È An)
  = P(A1 È ... È An-1) + P(An) - P((A1 È ... È An-1) Ç An)
  = P(A1 È ... È An-1) + P(An) - P((A1 Ç An) È ... È (An-1 Ç An))
Now assume the result for n-1 and substitute for first and third terms to give the result.

Example A group of n people place their coats in a pile. Later they each take one coat at random from the pile. What is the probability that at least one person has their own coat?

Take as sample space the n! possible assignments of coats to people. Let Ak be the event that person k has their own coat. To get P(Èi=1n Ai use inclusion-exclusion formula.

Note that
P(Ai1 Ç Ai2 Ç ... Ç Air =
(n-r)!
n!
Thus
 
å
i1<i2<...<ir
p(Ai1 Ç ... Air
=
n r
(n-r)!
n!
=
1
r!
,
and so P(
n
È
i=1
Ai)
=
1 -
1
2!
+
1
3!
- ··· + (-1)(n-1)
1
n!
1-e-1 as n ¥

1.5  Conditional probability and independence

Sometimes we may have partial information about the outcome of an experiment. In general this partial information will change the calculation of probabilities. For example, having thrown a fair die, we may be todl that the outcome is even. Then the probability of a 1, 3 or 5 becomes zero, and the probability of a 2, 4 or 6 becomes 1/3.

Definition Provided P(B)>0, define a conditional probability of A given B, written P(A|B), by P(A|B) = P(A Ç B)/P(B).

Remark For fixed B with P(B)>0 we can define a new function Q on by Q(A) º P(A|B). It is straightforward to check that Q is a probability measure.

Example You are playing poker. Define R º ``I have a royal flush'', E º ``My hand contains Aª''. Find P(R|E).
  1. By definition,
    P(R|E) =
    P(R Ç E)
    P(E)
    =
    1/52 5
    51 4/52 5
    =
    1
    52 4


  2. From first principles. Assume one card is Aª and consider new experiment relating only to the unknown values of the other four cards. Each of the 51 4 outcomes is equally likely and exactly one will result in a royal flush. The required condition probability is thus 1/51 4.
Note that P(R) = 4/52 5, and so P(R|E) = 13/5 P(R) > P(R). Knowledge that E has occurred changes (here increases) the probability of R occurring.

Example A hat contains three cards. One card is black on both sides, one is black on one side and white on the other, and one is white on both sides. A card is drawn at random and placed on the table. The visible side is black. What is the probability that the other side is black?

Label the faces of the cards b1,b2 for black-black, w1, w2 for white-white and b3,w3 for black-white. Sample space W = { (b1,b2),(b2,b1),(w1,w2),(w2,w1),(b3,w3),(w3,b3) } (first number is upper face). All six outcomes equally likely.

Define event BU = {(b1,b2),(b2,b1),(b3,w3)} (black uppermost) and BD = {(b1,b2),(b2,b1), (w3,b3)} (black downermost). Then
P(BD|BU) =
P(BD Ç BU)
P(BU)
=
P({(b1,b2),(b2,b1)})
P(BU)
=
2/6
3/6
=
2
3
.

Theorem (Properties of conditional probability)
  1. P(A Ç B) = P(A)P(B|A) = P(B)P(A|B)
  2. P(A Ç B Ç C) = P(A)P(B|A)P(C|A Ç B)
  3. P(A|B Ç C) = P(A Ç B|C)/P(B|C)
Proof Immediate from definition of conditional probability.


We call events A and B independent if the occurrence of one of them does not affect the probability of the other.

Definition Events A and B are independent if P(A Ç B) = P(A)P(B). More generally, a colelction of events Ai (i Î I) are independent if P(Çi Î J Ai) = Õi Î J P(Ai) for all finite subsets J of I.

Note that A and B independent P(A|B) = P(A) and P(B|A) = P(B).

Example Two dice are thrown. The sample space W = (i,j): 1 £ i,j £ 6 has 36 equally likely outcomes. Let A1 = {first die odd},   A2 = {second die odd},  A3 = { sum is odd}. Are A1, A2 independent? P(A1) = P(A2) = 18/36,   P(A1 Ç A2) = 9/36. Then P(A1 Ç A2) = P(A1)P(A2), so yes.

Similarly A2, A3 and A1,A3 are independent so the three events are pairwise independent.

Are A1, A2, A3 independent? P(A1 Ç A2 Ç A3) = 0 ¹ 1/8 = P(A1)P(A2)P(A3), so no.

Example You are asked a series of questions and you get each right with probability p. The outcomes of different questions are independent (A sequence of Bernoulli trials). The probability that the first correct answer is at the rth question is pr=(1-p)r-1p. Since år=1¥ pr=1 you've got to get a question right eventually.

Theorem (Law of Total Probability / Partition Theorem) Let B1, B2,...,Bn be a partition of W. Then P(A)=åi=1n P(A|Bi)P(Bi).

[s]Proof If i ¹ j then A Ç Bi and A Ç Bj are disjoint. Then
n
å
i=1
P(A|Bi)P(Bi)=å P(AÇ Bi) = P(
n
È
i=1
(AÇ Bi)) = P(A)

Example The probability I remember to lock my bike is 19/20. If I forget to lock it, it's stolen with probability 4/5, while if I do lock it it's stolen with probability 1/100. What is the probability that it's stolen tomorrow?

Let L={bike locked}, S={bike stolen}. Then P(L)=19/20 and P(LC)=1/20. P(S|L)=1/100, P(S|LC)=4/5. Now P(S)=P(S|L)P(L) + P(S|LC)P(LC)=99/2000=0.0495 by above theorem using the partition L, LC.

Theorem (Bayes Formula) Let B1,B2,...,Bn be a partition of W. Then
P(Bi|A)=
P(A|Bi)P(Bi)
n
å
j=1
P(A|Bj)P(Bj)
[s]Proof
P(Bi|A)=
P(BiÇ A)
P(A)
=
P(A|Bi)P(Bi)
n
å
j=1
P(A|Bj)P(Bj)
(top half by definition of conditional probability, bottom by previous result)

Example A diseae occurs by chance in one in every 200 people. A random person is tested; if they have the disease, the test will correctly say so with probability 0.95; if not, the test will wrongly say they do with probability 0.01. Find the probability that (i) The test says a person has the disease (ii) the person has the disease given that the test says they do.

Let D={person has disease}, A={test says they do}.
  1. P(A)=P(A|D)P(D)+P(A|DC)P(DC) (using partition D, DC) =0.95× (1/200) + 0.01 × (199/200) = 0.0147.
  2. By Bayes formula, P(D|A)=(P(A|D)P(D)/0.0147 =0.95× (1/200)/0.0147 = 95/294 » 0.32.
Suppose instead that the person is being examined by a doctor because they are feeling ill. After the examination but before the test, the doctor reckons P(D)=0.3. Then P(A)=0.95× (3/10) + 0.01 × (7/10) = 0.292 and P(D|A)=P(A|D)P(D)/0.292 = 0.95 × (3/10)/0.292 = 285/292 » 0.976.

2  Discrete Random Variables

Example

A roulette wheel has 38 slots: 18 red, 18 black, 2 green. Suppose the red slots are labelled 1,3,...,35, the black slots 2,4,...,36 and the green 0 and 00. A gambler bets £1 on red. She wins £1 for red and loses otherwise. So W = {00, 0, 1, ..., 36} and if the outcome is w, the gambler wins X(w) where X(1)=X(3)=...=X(35)=1, X(00)=X(0)=X(2)=...=X(36)=-1. The function X:W is a discrete random variable.

Take W to be finite or countable, with P defined on all subsets of W.

Definition A discrete random variable S is a real-valued function defined on the sample space W.

Technical aside If we do not assume that W is countable then X:W is a discrete random variable on the probability space (W, , P) if
  1. {X(w):wÎW} is a countable set
  2. {w Î W: X(w) = x} Î " x Î
(1) ensures X is discrete (takes a countable set of values). (2) needed to ensure that the event ``X takes value X'' belongs to event space .

Example

Suppose you throw two dice: W = {(i,j): 1 £ i,j £ 6}. We can define random variables Y and Z by Y(i,j)=i+j and Z(i,j)=max(i,j).

Write RX for the range of X, i.e. the set of values that the random variable X can take.

Let RX={xi: i=1,2,...} (finite or countable since W is). Write P(X=xi)=P({X=xi})=P({wÎW:X(w)=xi})
=
 
å
w:X(w)=xi
P({w})
Further, for A Ì , P(X Î A)=P({w Î W:X(w) Î A})
=
 
å
w:X(w)Î A
P({w}) =
 
å
xiÎ A
 
å
w:X(w)=xi
P({w})=
 
å
xiÎ A
P(X=xi)

Examples

In the roulette example, RX = {-1,1}. P(X=1)=p1+p2+...+p35 =18/38=9/19. pi = P{outcome is i}. P(X=-1)=p00+p0+p2 +...+p36=20/38=10/19.

Dice example: RY={2,3,...,12}. P(Y³ 11)=P(Y=11)+P(Y=12) =2/36+1/36=1/12.

Definition The probabilities P(X=xi), xiÎ RX are referred to as the (probability) distribution of the random variable X. The function pX:RX [0,1] defined by pX(Xi)=P(X=xi) is called the probability mass function of X. Sometimes we abbreviate pX(x) to p(x). Note that since P(W)=1, åxÎ RXpX(x)=1.

For roulette example, p(-1)=10/19, p(1)=9/19.

Examples

  1. A random variable X with RX={0,1} is said to have a Bernoulli distribution if P(X=1)=p, P(X=0)1-p. e.g. conduct an experiment with two outcomes called ``success'' and ``failure'' and let p be the probability of success. Associate X=1 with success, X=0 with failure.

  2. A random variable X with RX={0,...,n} is said to have a Binonmial distribution with parameters n and p if P(X=k) = n kpk(1-p) n-k k=0,...,n. e.g. X is no. of successes in n independent trials of the experiment in (1).

  3. A random variable X with RX={1,2,...} is said to have a Geometric distribution with parameter p if P(X=k)=(1-p)k-1p   k=1,2,.... e.g. repeat independent trials of experiment in (1) and define X to be number of trials required to first get a success.

  4. Suppose l>0 is fixed. A random variable X with RX={0,1,...} is said to have a Poisson distribution with parameter l if P(X=k) = (e-llk)/k! for k=0,1,2,....

    It turns out that the Poisson distribution provides a good description of the numbers of ``rare'' events over some time period, e.g. no. of fatal accidents in a region in a year or number of paricles emitted by a radioactive sauce.

2.1  Expectation

Definition The expectation (or mean) of a random variable X is the number E(X) = åx Î RXxP(X=x) = åx Î RXxpx(x) provided the sum converges absolutely. (i.e. provided åx Î RX |xP(X=x)|<¥)

Examples

  1. Bernoulli (p) distribution: P(X=0)=(1-p), P(X=1)=p, E(X)= 0.(1-p)+1.p=p.
  2. Poisson (l) distribution.
    E(X)=
    ¥
    å
    x=0
    P(X=x) =
    ¥
    å
    x=0
    x
    e-llx
    x!
    (Put x-1=k)    =e-ll
    ¥
    å
    k=0
    lk
    k!
    = e-ll el = l

2.2  Functions of random variables

If X is a discrete random variable and g:RX is any function, then Y=g(X) is also a random variable defined by Y(w)=g(X(w)) for wÎW.

e.g. for constants a,b,c, Y1=aX+b, Y2=(X-c)2 are random variables taking values aX(w)+b, (X(w)-c)2, wÎW. In general, for Y=g(X), P(Y=y)=P(g(X)=y)=åxg(x)=yP(X=x). The expectation of the random variable g(X) is
E(g(X)) =
 
å
yÎ RY
yP(g(x)=y) =
 
å
yÎ RY
y
 
å
xg(x)=y
P(X=x) =
 
å
xÎ RX
g(x)P(X=x)

Theorem (Properties of Expectation) Suppose a and b are constants.
  1. If X³ a (i.e. if X(w)³ a"wÎW) then E(X)³ a.
  2. If P(X=b)=1 then E(X)=b.
  3. E(aX+b)=aE(X)+b
  4. E(g(X)+h(X))=E(g(X))+E(h(X))
[s]Proof
  1. E(X) = åxÎ RXxP(X=x)³åxÎ RXaP(X=x)=a
  2. P(X=x) = 1if x=b so E(X)=1.b+0=b 0otherwise
  3. E(aX+b) =
     
    å
    xÎ RX
    (ax+b)P(X=x) =
    a
     
    å
    xÎ RX
    xP(X=x) +b
     
    å
    xÎ RX
    P(X=x) = aE(X)+b
  4. E(g(X)+h(X)) =
     
    å
    xÎ RX
    (g(x)+h(x))P(X=x)
    =
     
    å
    xÎ RX
    g(x)P(X=x)+
     
    å
    xÎ RX
    h(x)P(X=x) = E(g(x)) + E(h(X))
Definition The variance of a random variable X, usually written Var(X), is defined by Var(X)=E((X-E(X))2). So writing µ=E(X), Var(X) = E((X-µ)2) = åxÎ RX(x-µ)2P(X=x).

The variance of a random variable means how ``spread out'' or dispersed its distribution is around its mean.

Example

Suppose X=-1with prob. 1/21with prob. 1/2 and Y=-100with prob. 1/2100with prob. 1/2. Then E(X)=E(Y)=0. Var(X)=1/2(-1-0)2+1/2(1-0)2=1, Var(Y)=1/2(-100-0)2+1/2(100-0)2=10000.

Theorem (Properties of Variance)
  1. Var(X) ³ 0.
  2. If a,b are constants then Var(aX+b)=a2Var(X).
  3. Var(X)=E(X2)-[E(X)]2
[s]Proof
  1. Var(X)=åxÎ RX(x-µ)2P(X=x) ³ åxÎ RX 0.P(X=x) = 0.
  2. Var(aX+b) = E[(aX+b-E(aX+b))2] = E[(aX+b-aE(X)-b)2] = E[(a(X-E(X)))2] = a2Var(X)
  3. Var(X) = E((X-µ)2) = E(X2-2µ X2) = E(X2)-2µ E(X) +µ 2 = E(X2) - 2µµ + µ2 = E(X2)-µ2

Examples

  1. Bernoulli (p) distribution. P(X=1)=p=1-P(X=0). E(X2) = åxÎ RXx2P(X=x)=12.p+02(1-p)=p. Var(X) = E(X2) - (E(X))2 = p-p2 = p(1-p).

  2. Poisson (l) distribution: P(X=x)=e-l lx/x!  x=0,1,...
E(X2) =
¥
å
k=0
k2
e-llk
k!
= l e-l
¥
å
k=1
k
lk-1
(k-1)!
  =
l e-l é
ê
ê
ë
¥
å
k=1
(k-1)
l(k-1)
(k-1)!
+
¥
å
k=1
1.
lk-1
(k-1)!
ù
ú
ú
û
  =
l e-l é
ê
ê
ë
l
¥
å
k=2
lk-2
(k-2)!
+
¥
å
k=1
lk-1
(k-1)!
ù
ú
ú
û
= l e-l(l el + el) = ll


1
surprisingly enough

This document was translated from LATEX by HEVEA.