(p.217) Appendix: Uncertainty, Probability and Reasoning
(p.217) Appendix: Uncertainty, Probability and Reasoning
Uncertainty and Making Decisions
In many situations, whether in business, society, the environment, the clinic, or the laboratory, we are able to monitor a wider and wider variety of performance parameters and events. For effective knowledge distillation, interpretation, prediction, and proactive intervention we need to develop some methods and algorithms. When we look at a complex or complicated system and make observations, what do we learn? What do the observations tell us about the object of our attention? The pace of change of many technologies means that concepts and methods should ideally be adaptive to new inputs, should be selflearning or selftuning, and hence may be applied even as the underlying nature of the activity and the process of observation evolve. Analysts must deal with evolving uncertainties and express their estimation of these, and incorporate reasoning to make their understanding clear: they need methods that are genuinely much smarter.
By ‘smarter’ we do not mean an ability for making algorithms (or software) easy to set up or to integrate across data sources (that is called deployability), or the ability to provide and transmit content and support for such alerts. All of this is just efficient generation and use of ‘intelligence’, used in the sense of ‘military intelligence’ (the well known oxymoron). What we mean by ‘smarter’ is the sense that the algorithms can reason in a way that is consistent with the way that we ourselves think; that the algorithms are logical, and extend logic to common sense inferential rules; that the algorithms are used to consistently and continuously update the current view as to what is (or is not) certain. Moreover, a onetime decision (to change, to invest, to treat, to investigate, to interrupt, or terminate) may need to be taken for management reasons at some particular moment in time. In this case, best estimates of an evolving situation are required (What will happen under alternative decision scenarios? What is happening right now?). There is no time for the analysts to say, ‘There is not enough data’ or ‘We do not know whether something aberrant can happen—no danger thresholds have yet been exceeded’. Instead, decisionmaking requires the support of (p.218) our best current recommendations, even if there is little data. Designing algorithms to give good decision support requires us to draw the best possible inferences we can, based on everything we know to date (our prior knowledge), and our current incoming information (observations, data) about events.
What may be surprising is that properly viewed, probability theory is really just an extension of logic. It is not simply about making estimates of chance events, but it is far more powerful. Rather than asking, ‘We know how the system works and what its uncertainties are: what is the chance of it producing an output like X?’; we want to ask ‘Having observed an output like X, what does that tell us further about how the system works, and what its uncertainties are?’
A moment’s thought and you will see that almost all plausible reasoning (from common sense to deduction) requires us to grapple with the second type of question. In business or in science this is ever more true, as we are able to capture or monitor more things. We observe events and results, and then we ask how does that alters our view about what is happening.
A good introduction to plausible reasoning is a good grounding for any decisionmaking. Careful use of language, inferences, and knowledge of uncertainties flushes out what we do not know (where our uncertainties reside and derive from), and points us to the information that we need to make better justified decisions. The literature abounds with apparent paradoxes—these usually come from poor application of just a few simple principles (legal, medical, engineering, commercial, etc.)—or from using the wrong conditional models to update our world view.
Hence an understanding of the material in this Appendix represents a skill for life.
Sources of Uncertainty
Uncertainties arise in different ways. There is the uncertainty over the truth of many propositions about the underlying {system} that we are addressing, and over the events that we are observing. The car is safe. The witness is lying. The economy will improve. The climate is warming. England will win the World Cup. What do our observations tell us about the system, and the truth of these statements?
Observations of any related events may change the uncertainty we have in asserting these propositions. To see this, we need to model the consequences of the proposition and alternative propositions, and see whether the observations are more consistent with one hypothesis or another.
Modellers, and especially mathematical modellers, are often uncertain as to which models they may apply to a given situation, though. They must make some decisions, some assumptions, and some omissions, based on their experience and sometimes on pragmatics. They must choose a class of model, and not just go to what they know, like men with hammers looking for nails. How can modellers compare one model against another? This is called conceptual model uncertainty. Every assumption modellers make in setting out a model introduces uncertainty, and models which share assumptions are not independent: if a common underlying assumption becomes invalid then so do the models. When we observe some events that can or cannot be represented well by the behaviour of solutions to a model, we become more or less convinced of its usefulness. Our (p.219) uncertainty in the conclusions or predictions from the outputs decreases or increases accordingly.
Given a particular model it may well have parameters that we do not know, or that we are uncertain of, yet we have some experience or ideas about their value. This is called parameter uncertainty. We need to use evidence or observation to make better estimates for the possible whereabouts of parameter values.
Our models may contain variable or stochastic, space or timedependent, processes that are parameterized. This is called volatility or variability. You could think of these quantities as sophisticated parameters if you wish. These things need to be modelled too. This introduces submodels, with submodel conceptual and parameter uncertainties.
So, even armed with our prior assumptions and beliefs, our models, and our methods of making them operational (solution methods, numerical calculations), we may well need to actually find some things out. These may be inferences, forecasts, predictions, or may be decisions based on our estimation of uncertain model outputs. The more that we know or observe, the more we hope to evolve our representation of the uncertainties concerning such outputs.
Sometimes we have other challenges. For example, in watching an animal, a person, or a machine, and trying to understand the nature of (and constraints upon) its ability to reason with uncertainty and respond to its own perceptions or observations. We might be building or hypothesizing systems that could reason with information or possible facts. In these models we have to represent a process of reasoning, where the currency itself (the state variable) is uncertainty and is represented over ‘possibility space’.
To face any or all of these challenges we should at first be very clear about the mathematics of uncertainty and reason.
Bayesian Probability
What a strange thing probability is. This point was almost entirely missed at school, where we were all taught an experimental view, with definitions derived from expectations: the probability of picking a heart from a standard pack of cards—try it out lots of times. We were encouraged to imagine doing the experiment over and over again. Then the probability was simply the number of successful events divided by the number of attempts. But this approach is only ever valid in the limit of an infinite number of experiments: we do not have the time or the imagination in complex cases. And what about situations where we cannot actually make or imagine repeated experiments because the question in mind is too unique, or we have only a very small amount of evidence?
In such cases, ‘probability’ still means something to us though: it can obviously represent our uncertainty surrounding the truth of a given proposition. A proposition is just a statement about some event which may happen, or has happened, or a statement about certain circumstances being correct:
‘The card I select will be a heart’,
‘It is raining’, ‘ I am a liar’,
‘The next president of the US will be a woman’,
‘Ghosts exist’, or ‘Professor G is a murderer’.
(p.220) Look at these examples: the more interesting, controversial, or salacious a proposition, the harder it is to imagine any thought experiments that could determine an agreeable absolute probability which will satisfy all onlookers. The probability that such a proposition is true is entirely subjective (I may know more about the subject or have more information that you), so our opinions, and our assessments of the residual uncertainties, are personal to us. Subjective probability is really the currency of logical reasoning. Even though our estimates for the probabilities may differ between us in absolute values, whenever any new evidence arrives both of our subjective probabilities should be updated in a consistent manner. That is, if we share the same starting point (so that our prior estimates are the same) then the new information and evidence should alter our estimates in the same way, so that we arrive at the same ‘posterior’ estimate of uncertainty after we have accounted for the new evidence. This consistent updating process is called Bayesian Updating, named after the Reverend Thomas Bayes.
A Crash Course in Plausible Reasoning
So let us start probability theory all over again. Let us leave aside what we have learned up to now. This is all we will ever need and we shall hold to these central points.
Resources: three excellent and invaluable books to dip into are Jaynes and Bretthorst (2003), D’Agostini (2003), and O’Hagan and Forster (2004) [83, 32, 109].
1. Plausibility. For any statement, called a proposition, we refer to the ‘plausibility’ of the proposition meaning our belief that it is true. Sometimes we will have alternative, mutually exclusive, propositions, usually called ‘hypotheses’, where we are certain that one of them is true, and yet we are uncertain as to which one. They may have different levels of plausibility associated with each of them.

2. A conditional and subjective quality. The plausibility of a proposition, meaning an estimation of its truth, is always conditional: it is dependent on every other piece of information we have that we have taken into account. Necessarily then, it is subjective too, since my plausibilities are conditional on my knowledge and yours are conditional on yours. In some fairly simple and therefore rather special circumstance (usually when we are drawing balls from urns) we can agree what the plausibilities are if we can agree what other information (called assumptions) that we will both consider.

3. Conditional notation. In order to stress the conditional nature of plausibility we write $AX$ to represent the plausibility of the proposition that A is true given all background knowledge X, which stands for ‘everything that is known a priori’ when we determine the plausibility that A is true. Now suppose that we gain some new evidence E that was not in X. Then we will write $AE,X$ to denote the new plausibility that A is true given X and given E.

(p.221) 4. A numerical scale for plausibilities. Next we wish to measure plausibility on a numerical scale: that is we want to represent the plausibility of the truth of any proposition by a real number on a scale running from zero for false or untrue, up to one for certain or true. Any numerical function measuring plausibility must satisfy some simple constraints on its manipulation and combination though. For example, if there is more than one order in which to assemble hybrid propositions (for example, A and B) then the function and its manipulation must produce consistent results. Under such simple constraints it turns out that there is a unique plausibility function, P, mapping propositions onto the interval zero to one, such that
(A4.1)$$P(AX)+P(notAX)=1,$$for all propositions A; and
(A.5)for all propositions A and B. We call this function P the probability. Notice that we have a sum rule and a product rule. The first condition (A4.1) allows us to sum up probabilities over mutually exclusive events, and if, for example a set of m such events, E_{i} say, are all equally probable (and thus permutable) then each must have probability $P({E}_{i}X)=1/m$. The second condition (A.5) says that we must use multiplication to work out probabilities for hybrid propositions (logical ‘and’s).$$P(A\text{}and\text{}BX)=P(AB,X)\cdot P(BX),$$ 
5. Derivation of rules. The bald fact is that these two rules can be derived from scratch (as solutions to certain functional equations given by Cox and Bretthorst—see Jaynes (2003) [83]), as a consequence of some very reasonable consistency requirements of plausible reasoning, and without any recourse to frequentist repetition of experiments. This is not splitting hairs. Now we are free to apply probability theory to situations which are not equivalent to any repeatable thought experiments, and probability (though necessarily subjective and in context) represents our (un)certainty in any and all propositions that we dare to consider in a selfconsistent way. {Have confidence: if we stick to (A4.1) and (A.5) and their direct descendants we shall not go wrong.

6. Combining independent probabilities. If A is independent of B, then knowledge about B has no impact on our knowledge about A. So $P(AB,X)=P(AX)$ and then we see that we can simply multiply up probabilities: $P(A\text{}and\text{}BX)=P(AX).P(BX).$ This is a consequence of the more general product rule (A.5), but in schools it is often taught first!

(p.222) 7. Bayes’ theorem. This is named after the Reverend Thomas Bayes, 1702–1761. His theory of probability was published posthumously in 1764. Newly observed evidence, an event E, meaning that ‘E is true’, updates our estimates of others probabilities in a constant and repeatable way. Suppose we wish to consider the probability that A is true. Then with no further evidence we have $P(AX)$. Knowing further evidence, E also, we can consider two ways to write $P(A\text{}\text{and}\text{}EX)$ from (A.5):
$$P(A\text{}\text{and}\text{}EX)=P(AE,X)\cdot P(EX)=P(EA,X)\cdot P(AX).$$Hence we get the posterior (post E) probability, $P(AE,X)$ updated from the prior probability, $P(AX)$:
(A.6)$$P(AE,X)=P(EA,X)\cdot P(AX)/P(EX).$$This last equation is known as Bayes’ theorem. It is often written in different ways. Suppose we have a set of two or more alternative, mutually exclusive, and exhaustive propositions, A_{i} say. Then for each A_{i}, $P({A}_{i}X)$ is the prior probability. Once we know that ‘E is true’, we can use (A.6) to update these. Since all such equations contain the same denominator, $P(EX)$ we often just write
(A.7)$$P({A}_{i}E,X)\propto P(E{A}_{i},X)\cdot P({A}_{i}X).$$Then if the A_{i}s are exhaustive as well as exclusive, we can always normalize the righthand sides so that they sum to unity, to obtain an expression that avoids having to deal with the term $P(ED)$ which we may not know explicitly (though we can know it now!):
(A.8)$$P({A}_{i}E,X)=\frac{P(E{A}_{i},X)\cdot P({A}_{i}X)}{(\sum _{j}P(E{A}_{j},X)\cdot P({A}_{j}X))}.$$The terms $P(E{A}_{i},X)$ used in the updating (from prior to posterior for the A_{i}) can be thought of as model terms. Under each separate hypothesis that X and A_{i} is true, we must have such a model available for calculating this probability for E.

8. Odds and odds notation. Sometimes it is easier and mathematically convenient to talk about odds rather than probabilities. If p denotes the probability $P(AE,X)$ for some proposition, A, conditional on X and some event data, E, say, then the odds on A are simply given by
$$O(AE,\dots ,X)\equiv \frac{P(AE,\dots ,X)}{P(\text{not}\text{}AE,\dots ,X)}=\frac{p}{(1p)}.$$If we know the odds, $O(AE,\dots ,X)$ we can easily find $P(AE,\dots ,X)$ and vice versa. Just as we have prior and posterior probabilities, we can have the corresponding prior and posterior odds. (p.223) Now consider (A.6) as it is written, and again replacing A with its complement $not\text{}A$. Then taking the ratio we have Bayes’ theorem written for odds:
(A4.9)$$O(AE,X)=\frac{P(EA,X)}{P(E\text{not}\text{}A,X)}.O(AX).$$This formulation is very useful—it avoids the term $P(EX)$ again since it is a ratio of two equations each with a division by that term. It says the posterior odds are equal to the prior odds multiplied by a ratio of the model terms. If we have a model $P(EA,X)$ and a single model for $P(E\mathrm{n}\mathrm{o}\mathrm{t}\text{}A,X)$ then we can use these directly. If we have split $\text{not}\text{}A$ up into a greater number of mutually exclusive alternatives then this last formula gets a little more tricky. (See multiple hypothesis testing in Chapter 5).
The model ratio term in (A4.6) is often called a Bayes factor.
Before we go any further, let us imagine the kind of calculations we might make. If we observe a number of events which are all independent then we may apply (A4.6) successively. Each posterior, after each observation, becomes the prior as we update to account for the next observation. Since the events are independent we may simply multiply together all the Bayes factors so as to get the overall posterior odds. Numerically, if the $P(EA,X)$s are very small then the odds are also small and then this may result in underflows. So it is very good practice to take the logarithm of the odds. See Log Bayes Factors (LBFs) in Section 3.8.

9. Summaries. In any problem the priors belong to each of us: they are subjective. We may select them in a number of ways based on what we know. Sometimes we select them to make life easy for ourselves.
However, even when we differ over priors we shall all usually reason in the same direction in the light of new evidence.
The model terms (and the Bayes factors) must be fit for our purpose. This is mathematical modelling: for each hypothesis we must derive or assert a suitable distribution for observable events (over the set of possible values, continuous or discrete) under the assumption that it is true. X, our prior knowledge and skill, is required here.
Finally we have a posterior. A distribution of probabilities over a number of competing hypotheses. It is our job to summarize this posterior. This can be done in a number of ways. Perhaps with one or two simple parameters: the modal value (corresponding to the peak, the most likely hypothesis) or the mean/expected value. Or perhaps we should specify a credible set of values/hypotheses. Or even summarize by presenting the whole distribution for inspection.
We are done. Now we are ready to go to work.
Some Examples
Application: Arresting a Suspect
(p.224) Suppose a man is arrested in a New York neighbourhood suspected of committing a knife murder only one hour earlier. The NYPD believe he has a probability of $p=P(\mathit{\text{guilty}}X)$ of being the murderer, where X represents everything that they know. And the NYPD knows a lot. When he is searched at the station he is found to be carrying a knife. Does that make him more likely to be guilty? At first sight you may think it does, but we really need a little more information to decide the question.
What is $P(\text{carrying}\text{a}\text{knife}\text{not}\text{guilty},X)$? One in ten men of his age carry a knife like his in the tough neighbourhood where the suspect lives and was picked up. That is our model in this case: $P(\text{carrying}\text{a}\text{knife}\text{not}\text{guilty},X)$=1/10.
What is $P(\text{carrying}\text{a}\text{knife}\text{guilty},X)$? The policemen know that following a knife murder almost all murderers will discard the weapon as soon as they can. After one hour they estimate $P(\text{carrying}\text{a}\text{knife}\text{guilty},X)=1/50$ (so only one in fifty murderers would still have the knife). That is our model in this case. Now we apply (A.8). We have
Hence for any prior p, the discovery of the knife makes the suspect’s guilt less likely.
Of course, we could get a different result if we change the ‘models’ around a bit. But the lesson is clear. We cannot jump to any conclusion until we estimate the conditional probabilities for the new evidence under the alternative hypotheses, using some distinct ‘models’ to do so.
Note, if we use (A4.6) then things are much simpler in terms of odds:
There is a vast literature on Bayesian probability in legal reasoning and other types of inference. The reason for it is precisely because the results can be surprising (apparent paradoxes), and there are lots of cases where inferences are drawn without the full information (necessary to derive the alternative conditional probabilities for the new evidence) being considered (that is, jumping to conclusions).
The Monty Hall Problem
Here is a variant of a rather famous problem. You are on a game show. There are five doors. Behind one door is a prize car; behind the others there is nothing.
The game show host asks you to select two of the doors for yourself (you can keep whatever is behind them). You do so. Then the host, Monty Hall (who knows where the car is hidden), walks up to the remaining three doors and opens up two of them revealing nothing behind them. He then invites you to stick with your original choice (p.225) of two doors or to swap them both for the single remaining door. You will receive any prize behind the door or the pair of doors that you have after this. Should you swap?
The prior odds that the car is behind one of the two doors you first selected are 2/3 (the prior probability is 2/5 of course). We will use (A4.6).
What is the probability that the host could open two of the remaining three doors revealing nothing, assuming you already have already got the car behind one of your doors? It is one: he can easily do it. What is the probability that the host could open two of the remaining three doors revealing nothing, assuming you already have not got the car behind one of your doors? It is also one: he can easily do it since he knows which one of the three doors is hiding the car so he can open the other two. Let A denote the proposition that the car is behind either one of the two doors that you initially select. Let E denote the event whereby the game show host opens up the two empty doors.
Then (A4.6) become
So the model ratio term in (A4.6) is one: the posterior odds are the same as the prior odds. You still have a 2/5 chance of already having the car. But something has changed. The alternative ($not\text{}A$) has now been narrowed down to the car being behind a single door. The probability that the car is behind that door is 3/5 (the only remaining possibility if you have not got the car). Hence you should swap: you should give up your two doors for the final door.
Sloppy reasoning would just say that originally all of the doors had a one in five chance, so why give up two chances for just one? You should stick! Or that at the end we have three doors left; all are equally likely so two chances are better than one. Stick! Yet the single door that is offered in the swap has changed in status. Extra knowledge, that of the host, has been used to select that door for the final threesome, by opening and revealing the other two doors to be empty. That door has been selected in a special way with extra insight, and is not a randomly selected door from the five (as your original two were). Its status has changed. It now has a 3/5 probability of hiding the car because the game show host knew where the car was from the start and he selected it to remain closed.
If the host did not know where the car really was then this changes things completely. In that case when he opens up a random pair of doors, selected from the three that you have not selected, he risked revealing the car and you would have lost immediately—Game over!—but fortunately that did not happen. So the probability that the host could open two of the remaining three doors revealing nothing, assuming A, is one. He can easily do it. But the probability that the host could open two of the remaining three doors revealing nothing, assuming $not\text{}A$, is 1/3. He had a two thirds chance of revealing the car and ending the game early. So the model ratio term in (A4.6) is one divided by 1/3: equal to three. Thus the posterior odds are now two (the prior odds being the same as before, 2/3):
(p.226) Hence in the case where the idiotic host, Monty, did not know where the car was you should stick with the pair and not swap.
Example Application: The Swine Flu Test
Suppose there is a new virus such as swine flu that is difficult to detect. It is believed that one in one hundred people have swine flu (SF), but there is also a test (a blood test or something similar). Before anybody takes a test their probability of having SF is therefore 0.01. The test is ninety eight per cent accurate for those with SF: $P(\text{Positive}SF,X)=0.98$. The test produces two per cent false positives: $P(\text{Positive}\text{not}\text{}SF,X)=\mathrm{0.01.}$
Suppose he or she takes the test and gets a positive result: what is the probability that he or she has SF now? The prior odds of having SF are 1/99. Applying (A4.6) we have
So as a result of the prior population bias, a positive result means that the testee is still twice as likely NOT to have SF than to have it.
This example stresses the importance of not simply focussing on the model terms, which make the test look awesome, but to consider the bias on the conditions that each hypothesis is likely to occur.
The transposed Conditional
While writing these notes today, there was a good example of very sloppy plausible thinking in The Times (which had to be corrected by the President of the RSS). Drinkers look away now please! The Times reported that a high number of patients in high dependency clinics (IHDC) with (near) Liver Failure (LF) were middle class drinkers (MCDs) who drink more than a few glasses of wine at home every evening. The article implied that MCDs had an increased probability of suffering from LF.
But we simply do not have enough information: we know $P(MCDIHDC,LF,X)$. Suppose nine out ten patients with LF in HDCs are MDCs. We have $P(MCDIHDC,LF,X)=.9$.
The article sought to imply something about $P(LFMCD,X)$, transposing the conditional and the consequence; and dropping another condition (IHDC). It implied in particular that $P(LFMCD,X)>P(LFX)$, the probability of some random adult suffering with LF. In fact this example is like the knifecrime example discussed earlier, and it is still possible that $P(LFMCD,X)$ is less than $P(LFX)$.
Let us make some further assumptions. First we must deal with the IHDC proposition. In odds notation we already have
The middle classes have sharp elbows, so our model for tendency is that middle class adults suffering LF are, say, thirtysix times more likely to be IHDC than somebody from the lower classes (notMDC) suffering from LF: the MCDs always visit their doctors (p.227) when they are unwell. Perhaps the lower classes simply carry on drinking and die, or stay away from HDCs at any rate. We have
so that
Now directly from Bayes’ theorem again:
Suppose that $P(MCDX)=1/4$, i.e., MCDs make up a quarter of all adults. Then we have
Here we just made up the extra facts that were missing in the original article. But the point is clear. Even if $P(LFMCD,X)$ is large we must not confuse it with $P(MCDLF,X)$. This error is so common that it has a name: the ‘transposed conditional’.
Discrete Distributions: Multinomial Models
A multinomial model is simply a way of describing a set of probabilities that some (random) variable is drawn from a discrete set of mutually exclusive and exhaustive alternatives.
Suppose $\{{B}_{j}j=1,\dots ,m\}$ denotes such a discrete set of alternative possible classes or categories for an observable event, or quantity, b. A multinomial model for the random event b is a set of probabilities $\{{P}_{j}j=1,\dots ,m\}$, such that
Suppose that when we observe a number of such events, none of which effect any of the others (so the likelihood of each result is given by our model) then we say they are independent. Suppose that for exactly n_{j} of these, the result b is in B_{j}. Then we have
Hence we have a model for the likelihood of observed data, given the set of multinomial probabilities (that must sum to one).
If $m=2$ then we only talk about ${P}_{1}=p$, say, since ${P}_{2}=(1p)$, and we have a binomial distribution for the different outcomes (combinations of events) from a number N of experiments.
(p.228) Continuous Distributions
Often we will wish to run a continuum of hypotheses against one another.We will deal here with a distribution of probable values for some real parameter λ: the generalization to higher dimensions or other types of state space is obvious in almost all cases.
Suppose we have some real constant λ that is unknown. Let S be any subset of the real line and define the corresponding hypothesis, H_{s}, via
Then we introduce a probability density function, often called a pdf, $f(\mathrm{\lambda}X)$. This is a nonnegative realvalued function, and
Each hypothesis is intimately linked to its set of hypothesized values, S. Clearly hypotheses are mutually exclusive if the intersection between their sets is empty (or of measure zero).
If S contains all possible values (the entire support of f) then $P({H}_{S}X)=1$. Hence f must be of unit mass as well as nonnegative.
Bayes’ theorem still applies, as it must. So for any new event or evidence, E, we can write
This last follows by applying Bayes’ theorem to $P({H}_{S}X)$, and its posterior counterpart, $P({H}_{S}E,X)$, and observing that S can be chosen arbitrarily.
Normalizing probability density functions is tedious.
We know that the total mass of any posterior pdf must be unity. So often we will prefer to deal with nonnormalized density functions, and write simply
with the running proviso that we will own up and normalize such fs whenever that is required.
Of course we also may stray into the area of fs which are of infinite mass (not integrable) if we wish. These are called improper density functions. For example $f(\mathrm{\lambda}X)=1$ for all real λ, or $f(\mathrm{\lambda}X)=1/\mathrm{\lambda}$ for all positive λ. This idea can indeed be very useful, since a posterior pdf may be integrable (thus proper) even when the prior is improper.
Suppose we have absolutely no prior information to constrain our thoughts about λ: how will we set our prior then? Improperly? This is a subjective issue, a constant prior pdf for a variable λ will be constant, yet it induces a prior pdf for, say, $log\mathrm{\lambda}$ that is certainly not constant. Which variable are you relatively indifferent about? On which (p.229) scale will you speed out your prior mass uniformly? Let us put this issue into X for now. Interested readers should consult Jaynes and Bretthorst (2003) [83], especially on the Jeffreys prior.
As successive events are observed we will evolve a posterior pdf. So a prior pdf should not rule any values out that we might have otherwise accepted later. Eventually, with {enough} observations, the prior becomes {less important}. Again consult the references, especially Jaynes and Bretthorst (2003) [83], on this aspect.
Now consider an unknown real parameter λ which is used to {model} some observable z. We will have a model $P(z\mathrm{\lambda},X)$ which yields
• a probability density function for z, given a value for λ, if z is continuous

• a multinomial for z, given a value for λ, if z is categorical or discrete.
For example, our model may assert that z is normally distributed about λ with unit variance, say. So given λ we can estimate a possible values for z.
Now suppose we observe an actual value ${z}^{\ast}$. What does this tell us about λ? Well the event E is simply the observation itself: that z lies with any set S containing ${z}^{\ast}$.
So we have
But S was arbitrary, about ${z}^{\ast}$. So we have
Again we can normalize if we wish and write
Now let us start with some pleasant distribution for our prior ($f(yX)$), and some wellchosen model $P(z\mathrm{\lambda},X)$ for our observable, and suppose that we observe a set of m independent measurements: $E=\{{z}_{1},\dots ,{z}_{m}\}$.
Then we will have the (improper) posterior
This will soon get rather hairy! When the models are algebraically complicated these become difficulty to deal with. How will we summarize the posterior? We will probably struggle to find even its mean or mode.
There are two ways around this: a traditional approach as outlined in Appendix 8, which uses some trickery to reduce the amount of algebra, by a pragmatic choice of the prior; and the use of the computer to summarize and sample from the posterior, as written. This last possibility is only thirty or forty years’ old and is the subject of much progress, typically referred to as Monte Carlo Markov Chain methods.
(p.230) Algebraic Convenience: Conjugate Priors
In the days before computers could be used to summarize and sample from posteriors, in order to avoid excessively complicated functions a rather useful practice was developed: the use of conjugate priors.
The central and elegant idea is that if the model $P(z\mathrm{\lambda},X)$ is given, then rather than choose any prior, $f(\mathrm{\lambda},X)$ for λ, if we had no better reason then we could make life very easy for ourselves by choosing f from a particular family of functions, so that the posterior would be from the same family. Such a family of functions is called a conjugate prior distribution for the chosen model distribution.
Let $F(\mathrm{\lambda}\mathrm{\theta})$ be a family of pdfs (normalized or not) for λ parameterized by θ. Then F is conjugate to the model, $P(z\mathrm{\lambda},X)$, if the posterior is given by
where $\stackrel{\u02c6}{\mathrm{\theta}}=\stackrel{\u02c6}{\mathrm{\theta}}(\mathrm{\theta},z)$, is some welldefined function. Notice that if the model is given, then this last is a functional equation for F and $\stackrel{\u02c6}{\mathrm{\theta}}$.
For example, suppose z is a multinomial variable, with m categories. Then let λ be the vector $({P}_{1},{P}_{2},\dots ,{P}_{m})$ of the unknown probabilities in our multinomial model. Then λ lives on the simplex: $\mathrm{\lambda}\ge 0$, and ${\mathrm{\lambda}}^{T}\mathbf{\text{s}}=1$, where $\mathbf{\text{s}}=(1,1,\dots ,1{)}^{T}$.
As we observe instances of the multinomial variable z we will change our opinion as to where the $\mathrm{\lambda}=({P}_{1},{P}_{2},\dots ,{P}_{m})$ may lie.
Now for any real $\mathbf{\text{w}}=({w}_{1},{w}_{2},\dots ,{w}_{m}{)}^{T}\ge 0$ let
be defined on the simplex $\{{P}_{i}\ge 0,\text{}\sum {P}_{i}=1\}$. Strictly speaking we should have normalized G so that it integrates to one, but we can proceed with this improper form. Note that on this simplex $G({P}_{1},{P}_{2},\dots ,{P}_{m},\mathbf{\text{w}})$ has a maximum (modal value) at $\mathbf{\text{w}}/\mathbf{\text{w}}$ (hint: use Lagrange multiplier to maximize G while constraining to the simplex). If $\mathbf{\text{w}}=0$ then G is uniform.
Then suppose our prior ‘insight’, X, allows us to select some nonnegative real values for w and to take the prior
Now suppose that we observe some evidence, E, containing a, sets, of independent instances of the categorical variable z with exactly n_{i} of them within C_{i}.
Let $\mathbf{\text{n}}=({n}_{1},\dots ,{n}_{m}{)}^{T}$, then we have the posterior
which is of the same family as the prior. Hence G is the conjugate prior for the multinomial.
(p.231) For many, many observations the posterior becomes peaked around its modal value at $\sim \mathbf{\text{n}}/\mathbf{\text{n}}$.
Note that if $m=2$ and the multinomial is a binomial we usually write ${P}_{1}=p$ and ${P}_{2}=1p$, and abuse the notation to write
In this case, G is called a beta distribution (when normalized) and will be discussed in more detail in Appendix 9.
If a priori we think that a coin is likely to be a fair one we might select ${w}_{1}={w}_{2}=10$. But after observing Q consecutive heads (C_{1}), and no tails, then we have the posterior $G(p,(Q+10,10))={p}^{Q+10}(1p{)}^{10}$ which has a modal value at $p=(Q+10)/(Q+20)$.
In our next example of a conjugate prior suppose that we will observe some real quantity z. As a model for z we will choose a Poisson distribution with intensity λ, some unknown parameter. We have the model distribution
Suppose next that we feel able to choose the prior
for some nonnegative constants a and b. Again we have not bothered to normalize this f. It has a modal peak at $\mathrm{\lambda}=a/b$ (Calculus!).
Then observing a single value of z, say at ${z}^{\ast}$, we have the posterior
If we continue adding further independent observations, so that there are m in total, then the modal value approaches the inverse of the mean of the observed zvalues. We will have ${\mathrm{\lambda}}_{mode}=(1+a/m)/(\stackrel{\u02c9}{z}+b/m)$.
The twoparameter family ${\mathrm{\lambda}}^{a}{e}^{\mathrm{\lambda}b}$ is the conjugate prior for the Poisson distribution.
In Section 6.4 we show that the multivariate Gaussian distribution is conjugate to itself.
Laplace and Laplace’s Law of Succession
We often want to estimate probabilities by counting instances of different types of event. We should guard against setting event probabilities to one or zero, even if we have (so far) always observed them or have never yet observed them. The fact that we subjectively think alternative event are possible means we have to allow at least some small amount of probability.
Suppose we have an urn containing some red and some white balls. All probability theorists just love urns containing balls—it is the fault of the Bernoulli’s (look them up). Our prior information about them is that the balls are always well mixed within such urns and each has the same probability of being sampled though a single ‘draw’. But in (p.232) this case we do not know how many balls of each type are in the urn, though we will assume that both types of balls are contained within the urn. We draw a ball at random from the urn and examine its colour. We replace it and draw again. Suppose after N such draws, we have drawn a red ball exactly n times and a white ball exactly $m=Nn$ times. What is the probability p that the next ball to be drawn (and any ball subsequent ball drawn independently) will be red?
We do not know p at this stage. But we can express our knowledge about the possible values that the parameter p might take by using a probability density distribution, $f(px)$ say. As before, this is a nonnegative function defined over all of the possible values of p (in this case (0,1)) such that the probability that the true value of p lies within the interval between two constants $a>0$ and $b>a$ (but less than one) is given by the integral of the density distribution over the interval:
Note that we are using a probability distribution for the value of the parameter, p, which is itself a probability.
When probability distributions for some parameter are integrated over all of its possible values, we must obtain unity. In our present case we have always
If f has a maximum at some value ${p}^{\ast}$, then this represents the maximum likelihood value, or mode value, for p. If the distribution is narrow and highly spiked then we must think that the true value for p lies very close to the mode value. Conversely, if the distribution is rather flat with a large variance we must be unsure as to where the true value lies, and have no great preference for one estimate over another.
For any function of p, say G(p) which is given, the expected value for G is given by:
In particular, our expected value of p itself is just:
and our expected value of any power p^{q} is just:
Hence the variance of the distribution is
(p.233) Now let us return to thinking about our urn filled with some red and some white balls, where p is the probability that any ball to be drawn will be red. As our information about the possible true value of p changes, this will change the distribution, and hence our estimates. This information changes each time we draw a single ball.
At the start of this thought experiment we know only that p lies between zero and one. Let us assume a prior distribution, ${f}_{0}(pX)\equiv 1$, that is uniform (and equal to one) for all values between zero and one. That is, before any balls have been drawn, we will assume that the probability that the model parameter p lies between $a>0$ and $b>a$, but less than one, is
Using Bayes’ formula after drawing n reds and m whites we have a posterior distribution
But ${f}_{0}\equiv 1$, so by integrating both sides of this equation we have
Now if p is considered known, then the probability of drawing each red independently is p and the probability of drawing each white independently is $1p$. Therefore we have the simple ‘model’ for the event $(n,m)$:
This is called the ‘binomial distribution’: the probability of drawing the various results $(n,m)$, given a value for p. Considered as a function of p, for $(n,m)$ given, it is called the ‘beta distribution’. Above we see that our posterior distribution for p is just a normalized form of this function:
So what is our expected value $\u3008pX\u3009$ for p (the expected probability of drawing a red ball in the next draw)? We have the ratio of two integrals
(p.234) Using integration by parts again and again we have:
Using (A9.3) twice in (A9.2), we obtain:
This (and its generalizations) is called Laplace’s Law of Succession. Laplace first gave it in 1774 and it has played a major role in the story of Bayesian inference. Many books on probability theory try to ignore it, but it is remarkably useful for our purposes.
We can also see how good an estimate for the true value the expectation$\u3008pX\u3009$ might be by considering the standard deviation, σ, the square root of the variance. We have, again using (A9.3), and some rearrangement (exercise):
As is often the case, the standard deviation goes to zero like $(n+m{)}^{1/2}$: increase the sample size by one hundred to decrease the error by ten. But we are content with (A.13) even when $n+m$ is small, since it represents our expectation given our prior knowledge and the data we have, and now we also have an estimate, σ, as to how wide the resulting posterior distribution is.
Now let us consider some of the surprising things about Laplace’s Law. First, it does not depend on the number of balls in the urn, which we do not know. In fact the urn is just a mechanism for producing results: we simply {generate} a red ball with some unknown probability p. What can we say about p? The details of the shape of the urn or the number of balls in it do not matter. In our applications we will frequently need to estimate parameters based on a limited number of calibration results. Then we can use Laplace’s Law.
Second, it is never equal to zero or one. Even if we have never seen a white ball and have drawn N red balls in succession, $\u3008pX\u3009=(N+1)/(N+2)$. There is always a chance (based on our prior information that white balls can be in such urns) that the next draw will be white.
(p.235) If we have two identical urns, but with distinct mixes of balls, and have drawn one hundred red balls (no whites) from urn one and 1,000 red balls (no whites) from urn two, we can say that if we are to see a white ball appearing it is much more likely to come from urn one than urn two, because for the latter we have carried out more draws without a white appearing. This is consistent with our common sense.
As an exercise, find the modal value for p. This is where ${p}^{n}(1p{)}^{m}$ has a maximum. Consider how this differs from the expected value given by (A.13). What if we have made N draws and all of them are red balls?
The general case of Laplace’s Law of Succession is given as follows. (Just think of urns containing balls of many, K, distinct colours.)
Consider a mechanism where there are a number, K, of possible mutually exclusive types of result, A_{k} say, for $k=1,\dots ,K$; each generated by a corresponding causal process that remains constant. We suppose that each type of result, A_{k}, occurs with a probability p_{k} where $\sum _{k=1}^{K}{p}_{k}=1$. Then suppose we have obtained N results, or have made N observations, and obtained A_{k} exactly n_{k} times (of course $\sum _{k=1}^{K}{n}_{k}=N$). Then what estimates can we have for the p_{k}?
This involves making some tricky integrals over sets in Kdimensional space. But the argument is analogous to that given above for the case $K=2$. We obtain
This generalizes (A.13) (simply put $K=2,{p}_{1}=p,{n}_{1}=n$, $N=n+m$).
Equation (A.14) is very useful to us, because whenever we want to express a probabilistic model (for use in Bayes’ theorem) as a multinomial model, we need to estimate the probabilities based on some calibration data sets. These estimates avoid absolute zero (impossibility) and absolute unity (certainty), and allow for possibilities barely, or indeed never yet, observed in the data.
Now we come to another very interesting feature. (A.14) depends upon K, the number of results that are thought to be possible, even if some have never yet been observed. For example, suppose we believe that our urn contains yellow, red, and white balls, and we take ten draws obtaining four white, six red, and zero yellow. Then we have
Next, suppose that we are told that sometimes the urns may contain green balls also (so X is changed). If we believe this then we must recompute with $K=4$. So
Hence ${p}_{red}$ depends not just on (${n}_{red}$ and N) but on the number of possibilities we are prepared to entertain.
Suppose next we are convinced that we can observe and discriminate between 1,000 different colours and hues.
(p.236) But perhaps this is a desirable feature since if we want to admit the possibility of rare types of events (for which there is little or no hard evidence), we have to allocate a small chance to their occurrence. Before any data are observed, if we have no prior bias then all such events are to be deemed equally likely. Hence Laplace moves us seamlessly from the prior state of knowledge (all equally possible), through the small data situation, and on to the large data situation.
Unfortunately Laplace did not really help himself popularize this idea by applying his law to calculate the probability that the sun will rise tomorrow given that it has done so for 5,000 years (without assuming any prior knowledge of the workings of the solar system). The odds on tomorrow’s sunrise are
Good. But any additional information will also alter our knowledge, and Laplace himself knew a great deal about the mechanics of the heavens, so it is hard for us to put that aside. Our knowledge of what may or may not cause the failure of the sun to rise means that this is not a good example, and surely Laplace only meant this to represent a statement of knowledge given only the fact of n successes in N (independent) trials.
The debate that this formula stirred up has led to over 200 years of objections and woolly thinking. Nevertheless, we stress that this formula is an extremely useful rule for us, especially since it converts data into estimates for probabilistic model parameters that we can employ within models for random (biased) processes. In particular, we can use it to calibrate multinomial models and, by extension, Markov models.