Jump to ContentJump to Main Navigation
Causality in the Sciences$

Phyllis McKay Illari, Federica Russo, and Jon Williamson

Print publication date: 2011

Print ISBN-13: 9780199574131

Published to Oxford Scholarship Online: September 2011

DOI: 10.1093/acprof:oso/9780199574131.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2019. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use.  Subscriber: null; date: 23 October 2019

A new causal power theory

A new causal power theory

Chapter:
(p.628) 30 A new causal power theory
Source:
Causality in the Sciences
Author(s):

Kevin B. Korb

Erik P. Nyberg

Lucas Hope

Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780199574131.003.0030

Abstract and Keywords

The causal power of C over E is (roughly) the degree to which changes in C cause changes in E. A formal measure of causal power would be very useful, as an aid to understanding and modelling complex stochastic systems. Previous attempts to measure causal power, such as those of Good (1961), Cheng (1997), and Glymour (2001), while useful, suffer from one fundamental flaw: they only give sensible results when applied to very restricted types of causal system, all of which exhibit causal transitivity. Causal Bayesian networks, however, are not in general transitive. The chapter develops an information‐theoretic alternative, causal information, which applies to any kind of causal Bayesian network. Causal information is based upon three ideas. First, the chapter assumes that the system can be represented causally as a Bayesian network. Second, the chapter uses hypothetical interventions to select the causal from the non‐causal paths connecting C to E. Third, we use a variation on the information‐theoretic measure mutual information to summarize the total causal influence of C on E. The chapter's measure gives sensible results for a much wider variety of complex stochastic systems than previous attempts and promises to simplify the interpretation and application of Bayesian networks.

Keywords:   causal power, causal information, mutual information, causal Bayesian networks, intervention

Abstract

The causal power of C over E is (roughly) the degree to which changes in C cause changes in E. A formal measure of causal power would be very useful, as an aid to understanding and modelling complex stochastic systems. Previous attempts to measure causal power, such as those of Good (1961), Cheng (1997), and Glymour (2001), while useful, suffer from one fundamental flaw: they only give sensible results when applied to very restricted types of causal system, all of which exhibit causal transitivity. Causal Bayesian networks, however, are not in general transitive. We develop an information‐theoretic alternative, causal information, which applies to any kind of causal Bayesian network. Causal information is based upon three ideas. First, we assume that the system can be represented causally as a Bayesian network. Second, we use hypothetical interventions to select the causal from the non‐causal paths connecting C to E. Third, we use a variation on the information‐theoretic measure mutual information to summarize the total causal influence of C on E. Our measure gives sensible results for a much wider variety of complex stochastic systems than previous attempts and promises to simplify the interpretation and application of Bayesian networks.

30.1 Theories of causal power

Intuitively, causal power is the strength of the connection from a cause to an effect: the power of a drug to palliate a patient; the power of a medicine to cure a patient; the power of a carcinogen to kill a patient. Probably everyone concerned with understanding causal relationships would prefer to replace our intuitions about these powers with a formal measure of causal power. AI researchers would like a tool which could explain observed effects in terms of the most important observed causes in a causal Bayesian network. Cognitive psychologists studying human causal reasoning want a well‐founded account of causal power in order to better assess human judgment and the effectiveness of training in causal reasoning. Philosophers of science would like a criterion that could help with theories of causation (both type and token). Perhaps, eventually, we can hope for tools to help us assess moral and legal responsibility. While these aims are all distinct, they are also all related.

(p.629) Over the last century there have been several notable attempts at producing such an analysis of causal power. We begin by briefly reviewing these.

30.1.1 Wright's theory

The first such theory can fairly be attributed to Sewall Wright (1934). He developed the first formal theory of graphical causal models, namely linear path models, which gave rise to the structural equation modelling which has dominated causal analysis in the social and biological sciences ever since. Path models are standardized linear models, where every variable takes the unit normal distribution N(0,1), directed arcs between variables indicate direct causal connections (e.g. CE ), and each arc is assigned a path coefficient ρ EC relating its cause C (or parent) to its target E (or child). Wright demonstrated a strict relation between the path coefficients and linear correlations, allowing path coefficients to be calculated from observed correlations and vice versa. Between any two variables X and Z:

  1. 1. Each active path contributes a correlation equal to the product of the path coefficients along it (e.g. XYZ contributes ρ yx ρ zy).

  2. 2. The total correlation is the sum of the contributions of all active paths.

Φk is an active path from X to Z if and only if it is a sequence of arcs connecting X to Z where no arc points backwards after an arc has pointed forwards. Active paths, therefore, are either chains, forwards or backwards, or paths that relate two variables via a common ancestor. They cannot include collisions (e.g. XYZ).1

Wright did not explicitly attempt to characterize causal power. However, one straightforward way of doing so with path models would be to select only those paths that are forward chains from X to Y, and calculate the amount of correlation due to these paths. Another possibility is to apply the concept of intervention. Interventions can be represented by variables new to the modelled system, intentionally introduced to influence the value of one or more system variables. What we can call perfect interventions are those which successfully target a single variable, setting it to a particular distribution in a deterministic way, without regard for the original parents. While in reality perfect interventions are rare (Korb et al., 2004), they can be very useful in developing theory; for one thing, graphically they can be represented simply by setting the target variable to its intended distribution and cutting all in‐bound arcs to it. Suppose we apply such an intervention to C in a path model, imposing the unit normal distribution N(0,1). Only the forward chains from C to E will transmit the results of our intervention, and (p.630) thus the resultant correlation with E will be equivalent to picking out these paths by hand. Whichever way we apply Wright's theory for analysing causal power, the resultant formula is:

Definition 30.1 (Wright's (implicit) causal power measure)

The causal power of C for E is:

CP ( C , E ) = k ( l m ρ l m )

for all forward chains Φk = C →…→E and for all X mX i ∈ Φk.

We believe this is a perfectly fine theory of causal power, as far as it goes. It is limited to recursive models, since non‐recursive models lack directionality for some of their arcs. This can be interpreted as ignorance, either about arc orientation or about the possibility of unknown common causes relating the correlated variables. In either case, recursive models can be viewed as representing the underlying reality, with causal power being unknown until that reality is better revealed. So this limitation is not a defect in Wrightean power theory; it is a feature that all causal power theories ought to share. The problem is that Wright's theory is limited to linear Gaussian models, and many systems are nonlinear.

30.1.2 Good's theory

Good (1961) made the earliest explicit proposal for a causal power measure. He intended it to be more generally applicable than Wright's measure, by encompassing all kinds of multinomial networks (i.e. ones with discrete variables). Good made some general assumptions about the nature of a putative causal power measure, and thereby derived something equivalent to his Bayesian ‘weight of evidence’ formula (and vaguely analogous to electrical conductivity and resistance):

Definition 30.2 (Good's basic causal power measure)

The causal power of C = c to produce E = e is:

Q ( e : c ) = log P ( ¬ e | ¬ c ) P ( ¬ e | c )

provided that any dependency is entirely due to C affecting E.

Q(e : c) plays a similar role to CP(C, E) in Wright's theory. Like Wright, Good suggests a formula for calculating the causal power of a chain by combining the causal power of component arcs and also for additively calculating the causal power of multiple causal chains.

Unfortunately, Good's general assumptions do not hold in many multinomial networks. (p.631)

  • He assumes he can treat all variables as if they are binary, e.g. comparing c to ¬c while ignoring any differences between sub‐states of ¬c.

  • Hence, his method of calculating the power of a causal chain, while it may hold for genuine binary variables, fails in general for discrete networks. For example, it entails causal transitivity: positive causal power from c to d and d to e implies positive causal power from c to e. This is inevitable for linear causal models, and for some others discussed below, but not for all multinomial networks. His formula can also give different answers depending upon the precise path, even if the end‐to‐end dependency is the same (Salmon, 1980), which contradicts his own assumptions.

  • Good assumes that the power of multiple chains can simply be added, a method that fails wherever there is any causal interaction (even in binary networks).

  • Good also provides no way of distinguishing causal from non‐causal dependency paths.

Clearly, then, Good's theory is unsatisfactory as a general account of causal power.

30.1.3 Cheng's theory

Patricia Cheng (1997) developed her ‘power PC’ theory as an improvement over Rescorla and Wagner (1972), and it has been further developed by Glymour and Cheng (1998), Novick and Cheng (2004), and Glymour (2001). Cheng begins with a measure of positive statistical relevance, or ‘positive probabilistic contrast’

Δ P c = P ( e | c ) P ( e | ¬ c ) 0

which indicates ‘candidate generative causation’, echoing Suppes (1970), who called this prima facie causation. c is only a prima facie cause because the probabilistic contrast may actually be caused by a common ancestor that raises the probability of c and e occurring together. Suppes went on to lay down temporal and statistical conditions aiming to rule such cases out. These efforts have now been subsumed by developments in Bayesian network technology (cf. Twardy and Korb, 2004). Cheng rules these cases out by laying down some very stringent requirements for the causal relationships permitted in her models. First, the occurrence of c must be independent of all other parents of E. This implies either a limited causal structure in which there are no causal paths between C and these parents, or that the effect of these paths can be cancelled by fixing some background variables (which is not possible in some graphs). Second, the dependency between c and e must be independent of the dependency between e and any other parent, implying that there can be no causal interaction between C and any other parents of E.

(p.632) Given these restrictions, it is clear that the probabilistic contrast must be caused by c. In other words, the occurrences of c must be ‘generating’ the additional occurrences of e. Cheng now defines the causal power of c for e as the probability that any given occurrence of c will generate e. This causal power of c is labelled p c, leaving e implicit. Her basic insight is that ΔP is not a fair measure of p c. There is a specific background rate at which e occurs even without c, namely P(eǀ ¬ c). This means that we can only detect the impact of c on the remaining instances of E, by measuring how many background occurrences of ¬ e are converted to e. ¬ e occurs with a background frequency of 1 − P(eǀ ¬ c); it is converted with a frequency of Δ P; and therefore, the success rate of c must be the ratio of these two quantities.2 Hence:

p c = Δ P c 1 P ( e | ¬ c ) .
(30.1)

In contrast, a negative Δ P indicates ‘candidate preventive causation’, in which c appears to prevent e from occurring. To analyse this, Cheng places the same stringent restrictions on the parental relationships. She then defines the causal power of c to prevent e in an analogous way, as the probability that any given occurrence of c will prevent e. To distinguish prevention from generation, we write preventive powers as p̅c. By similar reasoning, we can only detect the success rate of c against the background rate of e, namely P(eǀ ¬ c). Thus:

p c ¯ = Δ P c P ( e | ¬ c ) .
(30.2)

Cheng claims that these formulae are a significant improvement on previous theories, such as that of Rescorla and Wagner (1972), because (among other reasons) the formula for p c provides the correct answer when e always occurs. If e always occurs, then the value for p c is undefined, rather than a power of zero, as Rescorla and Wagner had suggested. Cheng deems leaving p c unspecified to be correct because we should be unable to statistically assess the candidate causes of a universal event. Similarly, the value of p̅c is undefined when e never occurs. However, we do not see this feature of her theory as a significant advantage. Rescorla and Wagner might well reply that no candidate cause could demonstrate any statistical power over a universal event, and therefore in such cases zero is a reasonable statistical assessment of causal power.

(p.633) The fundamental problem with Cheng's measure is that it has an extremely limited range of application.

  • Like Good's theory, it is only applicable to questions about causal relations between values, as opposed to the variables themselves (which were addressed by Wright's theory).

  • Like Good, Cheng treats all variables as if they are binary. Admittedly, this does not lead to the same contradictions in calculating the causal power of chains and networks with complex relations or structures. But then Cheng does not offer any way to calculate causal power in such cases.

  • The structural independence restrictions upon parents are very severe, and will not be met by many Bayesian networks. This just leaves causal power undefined, despite the fact that C is clearly affecting E.

  • Cheng's blanket ban on any causal interactions between parent variables are necessary to make her derivations of (30.1) and (30.2) work, but as Glymour (2001) has shown, it limits Cheng's theory to noisy‐OR Bayesian networks. Novick and Cheng (2004) relax this last restriction by combining interactive parents in new variables. However, that ad hoc solution is, on the one hand, computationally infeasible when many parents of large arity are involved, and, on the other hand, leaves the restriction to non‐interaction between all remaining parents untouched.3

  • A notable consequence of the restriction to noisy‐OR networks is causal transitivity. This, in fact, is a property of all the accounts discussed so far. Yet any account of causal power that entails transitivity is misleading, since causation in general is not transitive–a fact which is reflected in other types of Bayesian network. Take, for example, Richard Neapolitan's case of finesteride (Neapolitan, 2003). Finesteride reduces testosterone levels; lowered testosterone levels can lead to erectile dysfunction. However, finesteride fails to reduce testosterone levels sufficiently to cause erectile dysfunction. Such threshold effects do not occur in linear or noisy‐OR networks, but they are common elsewhere.

Cheng's theory was intended, in part, to provide a psychological model for the causal attributions made by ordinary folk. Whatever its merits may be for this purpose, it lacks the generality that we would like from a sophisticated causal power measure.

30.1.4 Desiderata for causal power

(p.634) Having briefly reviewed prior accounts of causal power, we can invert the list of their several or collective drawbacks to generate a list of features that would be desirable in a new account:

  1. 1. Wright's implicit causal power theory for linear models appears to be fine within its domain. Hence, if any new theory is applied to linear models, we should like it to attribute powers that directly correspond to Wright's. Specifically, causal powers between variables in linear networks should be ranked in the same order.

  2. 2. The measure should additionally be more general than Wright's theory by being applicable to all kinds of Bayesian network: even those with complex variables, structures, and dependencies.

  3. 3. The measure should not entail transitivity–simply because causation is not, in general, transitive. Of course, the measure needs to reflect transitivity when it appears.

  4. 4. The measure should be compatible with intervention. It should support the fundamental idea, illustrated above with Wright's theory, that interventions test causal power.

  5. 5. The measure should have an information‐theoretic interpretation. This desideratum is not motivated by prior considerations, but we adopt it since causality gives rise to probabilistic relationships, which then ought to be interpretable using Shannon's information measure.4

Prior measures fulfil some of these requirements, but none of them successfully fulfils them all.

30.2 Causal Bayesian networks

There is a new paradigm for modelling probabilistic causal systems, arising from new technology, namely causal Bayesian networks. Such networks offer a powerful and general way to represent all kinds of stochastic causal relationships, and are being deployed in both theoretical and practical applications across a wide range of disciplines. Our own proposal for analysing causal power is (simultaneously) a proposal for reading the causal stories implicit within these networks, even if the entire network is too complex to fully comprehend.

Bayesian networks, popularized by Pearl (1988), use directed acyclic graphs to represent probabilistic relationships between random variables, e.g. CDE. There is an elementary conditional probability function P(Dǀπ D) (p.635)

                      A new causal power theory

Fig. 30.1 (a) Fisher's smoking model; (b) Fisher's model with intervention.

associated with each node, which specifies a probability distribution for its variable, D, that depends only upon its parents, π p (here only C), and not upon other variables (such as E). The linear models of Wright and the noisy‐ OR networks of Cheng are special cases. But in general the conditional probability functions are unrestricted and can include nonlinear interactions such as XOR relationships or threshold effects.

Bayesian networks were not originally intended to be interpreted causally: they were simply maps of probabilistic dependence, in which the arcs might be oriented in an anti‐causal direction (e.g. CD). But in a causal Bayesian network the arcs are also supposed to reflect the direction of causation, and this interpretation has become increasingly important. Many causal discovery algorithms have been developed to learn causal Bayesian networks from data, and they have been quite successful (e.g. Verma and Pearl, 1990; Spirtes, Glymour, and Scheines, 2000; Neapolitan, 2004; Korb and Nicholson, 2004).

Given an explicitly causal Bayesian network, it becomes possible to model the implied effects of interventions–and this can be crucial for determining the causal story. For example, the fact that smoking is correlated to cancer is usually explained by the model SmokingCancer. But Sir Ronald Fisher proposed Figure 30.1(a) as an alternative model. His point was two‐fold: (1) observational, correlational data alone cannot distinguish between the two models; (2) interventional data can do so. If his model were right and we intervened to force people to smoke, as in Figure 30.1(b), then there would be no resulting increase in cancers–whereas the usual model implies just such an increase. Thus, interventions can't be modeled without causation, and conversely, interventions can expose the difference between correlation and causation.

30.3 Causal information

Given these tools, we can now present our solution to the problem of measuring causal power: causal information.5 Given some causal Bayesian network, (p.636) the problem is to state the causal power of one variable, C, over another variable, E, implied by that network.6

30.3.1 Background conditions: ψ h

Causal questions are always put relative to some background conditions. The propensity of smoking to induce lung cancer, for example, is likely to be an issue for a large class of adult humans, but not, perhaps, for humans who already have cancer. A full account of how background context should be modelled is a difficult and unsolved problem and one which surely must involve a treatment of conversational implicature and psychological theory. What is counted as an appropriate context depends upon our interests, perhaps varying from moment to moment as we shift from a historical query to a counterfactual query. Without attempting to provide an analysis of such complex issues, we simply point out that Bayesian networks offer some useful resources for representing context. Here we will simply represent such conditions by identifying a set of network variables whose values should be fixed, Ψ = ψ h.7 Thus, all the probabilities discussed in the following sections will be conditional probabilities of the form P(∙ ǀψ h), but for brevity we will omit this condition in our formulae for causal power.

30.3.2 Hypothetical interventions: P*(C)

As we have seen, intervention upon C provides a straightforward way to distinguish between non‐causal paths to E, e.g. those that run through common ancestors, and causal paths to E, i.e. forward chains. For this purpose we will apply hypothetical interventions that are targeted strictly at C, independent of any other parents of C and overwhelm their effects, and which impose a specific distribution on C.8 So we augment the model M to M*, with just one new intervention node and arc I CC, and just one new elementary conditional probability function P*(Cǀ π c, I c) over C, replacing the original P(Cǀ π c). Since the intervention is overwhelming, when I C = Yes all inferential paths that begin backwards from C are cut. Since the intervention is stochastic, C still varies, and therefore dependency can still be transmitted by any path that begins forwards from C. For brevity, we will assume that I C = Yes has been added to ψ h whenever we refer to P*(∙).

(p.637) But what intervention distribution should be imposed upon C? There are three alternative choices that strike us as reasonable, each serving a slightly different purpose.

Original P*(C)

The first option is to reassert the original distribution over C. The new model M* will still differ from M; however, by reimposing the original distribution on C we minimize those differences. Not only are all the causal paths between C and E preserved, but also the variation in C itself. The similarity between M and M* means that the causal power of C over E in M* reflects the original situation in M as closely as possible. For example, we can use M* to consider, ‘Given the variation in blood pressure among the general population, how much is this variable affecting heart attack outcomes?’ We should note, however, that even if we impose the original distribution upon C, the resulting distribution upon E may still be considerably different in M* than it was in M, simply because (as intended) C is no longer dependent on its original parents.

Uniform P*(C)

We may not always wish to measure causal power relative to the original distribution over C. For example, in some subpopulation which regularly exercises and mostly eats fish, there may be very little natural variation in blood pressure. In consequence, the connection between blood pressure and heart attack outcomes will be concealed by this healthy lifestyle. So one way to bring out this latent feature of M is to consider a different intervention distribution over C, even though it is not the naturally occurring distribution in M. Any investigation into the power of blood pressure in M would certainly not randomize its subjects so that they all fell into the low blood pressure group; instead, it might mimic randomized experimental design by distributing C uniformly, with equal numbers across blood pressure categories. Thus, one plausible choice is a uniform distribution on C, so that there are equal numbers of subjects in every blood pressure group. In comparing the effects of different blood pressures, this provides a ‘level playing field’ in which the results are not biased by different actual frequencies for these blood pressures. Similarly, in comparing the influence of variables, it provides a standard distribution for comparison.

To be most exact, we would impose some intervention distribution P*(C) such that after we take into account the background conditions, the resulting distribution P*(C ǀ ψ h,) is uniform. That is, P * ( c i | ψ h ) = 1 C for each c i. We note that to achieve this, P*(C) itself will not always be uniform.

Maximizing P*(C)

Another reasonable question to ask is: what is the maximum impact that C could possibly have on E, according to M? To be precise, we can search the (p.638) space of possible intervention distributions P*(C), to find those that maximize our causal power measure given the background conditions.

Note that this will not always be the uniform distribution considered above, even though for unbiased channels the uniform distribution maximizes mutual information. The ‘channel’ here–of causal power–will frequently be biased. Suppose, for example, that there are only three blood pressure categories, and while both low and medium blood pressures result in a similar risk of heart attack, high blood pressure results in a much higher risk. Then the maximum probabilistic dependence between Blood Pressure and Heart Attack will result from a distribution in which nearly 50% of subjects have high blood pressure, rather than 33%.

These three possible intervention distributions are complementary, attempting to measure three different forms of causal power of C over E: the original causal power, a uniform causal power, and the maximum causal power. Either of the latter two standardize causal power comparisons between models in that they eliminate the influence of diversity in prior distributions over C. In the formulae that follow we leave open the choice of intervention distribution, which is simply denoted P*(C). But to make illustrative computations, we will imagine that the original distribution has been imposed upon Blood Pressure to measure its causal power over Heart Attack, as in Figure 30.2.

30.3.3 Causal information formulae

Two values: c and e

We begin with the simplest formula and then work our way through the more complicated ones. Each formula answers a slightly different causal question.

In particular, we begin with the question: what is the causal power of one value, c, to affect another value, e? Value‐to‐value questions such as ‘If I have high blood pressure, then how much does this affect my risk of having a heart attack?’ are quite common. The causal information answer is:

                      A new causal power theory

Fig. 30.2 An original intervention on blood pressure, to measure its causal power.

(p.639)
CI ( c , e ) = P ( e | c ) log P ( e | c ) P ( e ) .
(30.3)

In information theory, this formula gives the information about e that is provided by the discovery that C = c compared to knowing the prior distribution P*(C). Given that only causal paths are active, we suggest that this formula can also serve as a good measure of the causal power of C = c to affect the probability of E = e. For example, suppose we observe that a patient has high blood pressure, c. This increases the probability of having a fatal heart attack, e, from the average probability P*(e) = 0.14 to P*(eǀc) = 0.23. So P*(eǀc)/ P*(e) = 1.63. This is converted to a logarithm base 2,9 which takes the positive value 0.71. It is multiplied by the probability of having a heart attack given high blood pressure, 0.23. So the causal power of high blood pressure to promote heart attack is 0.16.

Causal information compares P*(eǀ c) to the marginal probability P*(e), rather than the complementary probability P*(eǀ¬c). This is similar to the standard formula for statistical relevance (SR) used in philosophy, rather than the standard formula used in psychology (Δ p). Causal information compares these two probabilities as a ratio rather than a difference, thus measuring the proportional change (like the Bayes factor in confirmation theory) rather than the absolute change (like SR). This proportion is converted to a logarithm, which is usual in information theory. This logarithm is then weighted by the probability P*(eǀc). The CI(c,e) measure is positive for promoting causes and negative for preventive causes, just like Δ p.

Note that in this formula the prior probability of high blood pressure, P*(c), does not feature as a weighting factor. C = c is treated as a given, as in the example question ‘If I have high blood pressure,…’, so we set P*(c) = 1.

Variable causes: C and e

The next formula addresses the question: what is the causal power of one variable, C, to affect a particular value, e?

CI ( C , e ) = c C P ( c ) P ( e | c ) log P ( e | c ) P ( e ) .
(30.4)

This formula gives the expected information about e that will be provided by discovering the value of C, whatever that turns out to be, compared to knowing the distribution P*(C). The difference between this and Equation 30.3 is that the value of C is no longer treated as a given. Instead, we take the information (or power) from each individual value c i, and weight this by the probability P*(c i), to calculate the expected value. We suggest that this formula can also serve as a good measure of the causal power of C to affect the probability of (p.640) E = e. For example, ‘How much does variation in blood pressure affect the risk of having a heart attack?’ is a variable‐to‐value question.

Note that some of the individual figures for causal power will be positive, and other figures will be negative. If we took a weighted average of the absolute magnitudes, then this would be the expected magnitude of the causal power exerted when C takes a specific value. However, the information‐theoretic formula given above does not use absolute magnitudes, and negative individual powers will partially offset the positive ones. Therefore, CI(C, e), while useful for comparing alternative models, cannot be directly compared to the magnitude of CI(c, e). The CI(C, e) measure will always be positive, provided that C has some effect, i.e. ∃i, j : P*(eǀ c i) ≠ P*(eǀ C j), and otherwise it will be zero (see Appendix B).

Variable effects: c and E

What is the causal power of one particular value, c, to affect a variable, E?

CI ( c , E ) = e E P ( e | c ) log P ( e | c ) P ( e ) .
(30.5)

This formula gives the total information about E that is provided by the discovery that C = c compared to knowing the distribution P*(C). The difference between this and Equation 30.3 is that we are interested in all the values of E, not just one e. So we take the information from c for each individual value e i, and add them to calculate the total value. We suggest that this formula can also serve as a good measure of the total causal power of c to affect the probability of E. For example, ‘How much does having high blood pressure affect heart attack outcomes?’ is a value‐to‐variable question. Again, note that our information‐theoretic formula does not use absolute magnitudes, and the negative individual powers will partially offset the positive ones. The CI(c, E) measure is equivalent to the Kullback–Leibler divergence between P*(Eǀ c) and P*(E), which is always positive, provided that there is some difference between the distributions.

Two variables: C and E

What is the causal power of one variable, C, to affect another variable, E?

CI ( C , E ) = c C , e E P ( c ) P ( e | c ) log P ( e | c ) P ( e ) .
(30.6)

This formula gives the expected information about E that will be provided by discovering the value of C, whatever that turns out to be. It uses both the weighted average over the values of C and the sum over the values of E. We suggest that this formula can also serve as a good measure of the total causal power of C to affect the probability of E. For example, ‘How much (p.641) does variation in blood pressure affect heart attack outcomes?’ is a variable‐ to‐variable question. Again, the negative individual powers will partially offset the positive ones, but the CI(C,E) measure will always be positive, provided that C has some effect on E.

The number of alternative formulae reflect the fact that there are several related questions about the causal power of C over E. So it is important to disambiguate informal queries such as ‘How much does blood pressure affect heart attacks?’ before attempting to find an answer.

30.3.4 Mutual information

This last variable‐to‐variable equation can be transformed as follows:

CI ( C , E ) = c C , e E P ( c ) P ( e | c ) log P ( e | c ) P ( e )
(30.7a)
= c C , e E P ( c , e ) log P ( e | c ) P ( e )
(30.7b)
= c C , e E P ( c , e ) log P ( c , e ) P ( c ) P ( e )
(30.7c)
= MI ( C , E ) .
(30.7d)

This shows that causal information is identical to the information‐theoretic quantity mutual information (MI), when applied to the two variables C and E, given the intervention upon C. The mutual information formula looks a little different. It compares the probability that c and e will occur together, P*(c,e), to the probability that they would occur together if the two variables were independent, P*(c) P*(e). Thus, it measures the amount of dependency that exists between each pair of variable values. The accumulated dependency for the two variables is obtained by weighting these ratios according to the probability that this pair of values will actually arise, P* (c, e). In fact, the causal information formula does the same thing, but it has been expressed in an asymmetrical fashion to suit the asymmetry between cause and effect.

By definition, mutual information is the expected amount of information that one variable provides about another (or the loss of information that arises by falsely assuming that they are independent).10 But, as above, it can also be interpreted as the amount of dependency between them. Therefore, it would be a good measure of causal power–except that some of this dependency can arise from non‐causal links. Causal information corrects this defect.

30.3.5 Entropy

(p.642) Mutual information is also closely related to the entropy measure of randomness. The information entropy on the variable E is defined as follows (Cover and Thomas, 1991):11

H ( E ) = e E P ( e ) log P ( e ) .
(30.8)

Entropy is zero when P(E = e i) = 1 for some value e i, when there is no uncertainty about the value of E. It is maximized when P(E) is uniform across all the possible values of E, when uncertainty is highest.

Similarly, conditional entropy measures the randomness of one variable given knowledge of another:

H ( E | C ) = c C e E P ( c , e ) log P ( e | c ) .
(30.9)

Thus:

MI ( C , E ) = H ( E ) H ( E | C ) .
(30.10)

This supports the interpretation of mutual information as the reduction in the uncertainty of E due to the knowledge of C.

30.4 Comparisons

Causal information has some clear advantages over the rival measures of causal power.

  • Causal information is well defined for all causal Bayesian networks. This includes all the restricted classes of network for which other measures were designed: linear models, Cheng's binary models and their extensions, and whatever models Good had in mind. But it also includes classes of network for which these rival measures are not well‐defined: e.g. ones with interactive causes, intransitivity, or multinomial (discrete) variables.

  • Causal information is well defined for a wider variety of questions. It relates any causal variable or value (either observed or observable) to any effect variable or value. It does so with a uniform approach, unlike Cheng's measure (for example), which uses a different formula for promoting and preventive causes.

  • Causal information yields appropriate results in all the restricted classes of network, where it mirrors the local properties. For example, in any (p.643) network that exhibits causal transitivity, CDE implies that E is dependent upon C. But it follows immediately that CI(C, D) ≠0, CI(D, E) ≠0, and CI(C, E) ≠0. So causal information itself exhibits causal transitivity, simply by accurately summarizing the true amount of dependency. Similarly, in linear path models, causal information is a monotonically increasing function of the magnitude of correlation (Hope, 2008, Chap 6). Therefore, the fact that other measures are necessarily transitive offers no advantage, even when they are applied to their own preferred class of network.

  • Causal information yields appropriate results in the other classes of network, where it does not impose inappropriate properties. For example, in any network that exhibits causal intransitivity or interaction, causal information itself exhibits intransitivity or interaction, since it is based directly upon the corresponding probability distributions. It follows that causal information can be applied uniformly, without making restrictive assumptions about the structure of the network or its dependencies.

30.4.1 Quantitative comparisons

We will now provide some quantitative comparisons of the competing causal power measures, by applying them to appropriate graphs.

30.4.2 CI versus Wright

Since Wright's coefficients are only defined for linear models, we have constructed a linear approximation of our discrete blood pressure graph, depicted in Figure 30.3. This includes just the variables Exercise, Blood Pressure, and Heart Attack. We have assumed that these variables have been converted to scalar quantities that are distributed according to the standardized Gaussian distribution N(0,1). Positive numbers therefore represent higher than average levels, and negative numbers represent lower than average levels.

                      A new causal power theory

Fig. 30.3 Linear approximation to the Blood Pressure graph.

(p.644) The size and direction of the dependencies in the discrete graph have been converted to similar path coefficients: −0.8 between Exercise and Blood Pressure; −0.2 between Exercise and Heart Attack; and +0.4 between Blood Pressure and Heart Attack.

The challenge is to measure the causal power of Blood Pressure over Heart Attack, despite their common cause Exercise, which forms the backpath

B l o o d   P r e s s u r e E x e r c i s e H e a r t   A t t a c k

Since there are two paths between Blood Pressure and Heart Attack, Wright's equations tell us that the total correlation between them is the sum of the correlations due to each of these paths:

r H B = ρ H B + ρ H X ρ B X = + 0.4 + ( 0.2 × 0.8 ) = + 0.56.
(30.11)

This figure obviously depends upon the strength of the backpath. Yet the arc directions imply that Blood Pressure does not actually exert any causal influence by this path. In fact, since this is a linear graph, either changing the path coefficients on this path or conditioning upon Exercise can have no effect upon the causal dependency between Blood Pressure and Heart Attack. So, it follows that any measure that is affected by the backpath is incorrect.

To convert Wright's approach to a measure of causal power, we can pick out just the directed causal paths from Blood Pressure to Heart Attack, and calculate their contribution alone:

r H B = ρ H B = + 0.4.
(30.12)

This is a plausible measure since it only depends upon the causal path and increases with the size of the correlation that it induces. Ignoring causal direction in this case would result in a 40% overestimate of causal power (as measured by correlation).

To apply the CI measure, we can simulate an overwhelming intervention upon Blood Pressure with an intervention distribution that is the original marginal distribution, i.e. N(0,1). This overwhelms the arc between Exercise and Blood Pressure. Since the intervention has removed the backpath, the total correlation is now equal to the causal correlation. As noted above, CI is a strictly monotonically increasing function of this restricted causal correlation; i.e. the CI approach is equivalent to Wright's approach for determining which paths are causal and combining them to form a total causal correlation. The only substantial differences are (a) the information‐theoretic rescaling, and (b) the CI approach does not need any additional algorithm to identify the paths or make the calculations–once the intervention is added, any standard Bayesian updating algorithm can do that work. (p.645)

                      A new causal power theory

Fig. 30.4 Noisy‐OR approximation to the blood pressure graph.

30.4.3 CI versus Cheng

Cheng's measure can be applied to any discrete graph, but, unlike CI, it will not generally be valid if her strict assumptions are not met. To illustrate, we have constructed an alternative version of our discrete blood pressure graph, which has been simplified to accommodate Cheng's restrictions (Figure 30.4):

  1. 1. We have removed the connection between Exercise and Blood Pressure, so that they are not marginally dependent.

  2. 2. Cheng's formulae are defined for binary variables. We can only apply the power PC theory to our ternary variables by partitioning the states into two mutually exclusive subsets. We shall let the causal event c be high blood pressure, so that ¬c is either medium or low blood pressure. We assume that the relative marginal probabilities of these two sub‐states of ¬c remain the same. We shall let the effect event e be a fatal heart attack, and the other possible cause of this event be low exercise.

  3. 3. We have altered the effects of low exercise and high blood pressure on fatal heart attacks so that their true interaction is replaced by a noisy‐ OR approximation. Specifically, there is a 20% chance that high blood pressure alone will result in a fatal attack; 12.5% for low exercise alone; 30% for both factors together; and 0% where neither factor is present. All necessary adjustments to the rates of fatal heart attack have been balanced by changes to the rates of non‐fatal heart attack.

These simplifications are not very realistic, but these or similar changes are necessary in order to satisfy Cheng's restrictions.

We can now apply Cheng's formula to assess the power of high blood pressure to promote fatal heart attacks. We are supposed to ignore Exercise, since it is assumed that this does not affect the causal power of Blood Pressure. So we shall just assume the marginal distribution over this variable. The result is (see Appendix A for computation details): (p.646)

p c = Δ p c 1 P ( e | ¬ c ) = 0.20.

Cheng's figure of 0.20 is a measure of causal power, but it can also be given a straightforward probabilistic interpretation. It is supposed to be the probability that high blood pressure will kill someone, given that without high blood pressure they would survive. Since this is a noisy‐OR graph, this figure is perfectly accurate relative to that graph and is a perfectly reasonable way to measure how much high blood pressure promotes fatal heart attacks. Nevertheless, even in this case, CI does interestingly differ from Cheng's measure, beyond its different scaling.

The CI formula, where we assume an intervention upon Blood Pressure that imposes the original marginal distribution, reports the causal power:

CI ( c , e ) = P ( e | c ) log P ( e | c ) P ( e ) = 0.37.

The CI figure of 0.37 is the expected number of bits saved by learning that C = c, when efficiently encoding the fatal outcome E = e. Whereas PC reflects only the proportional increase in risk, CI reflects the absolute increase in risk, through the weighting by P*(eǀ c): the absolute probability of a fatal heart attack given high blood pressure. In consequence, the CI measure, unlike power PC, can report different amounts of causal power depending upon the state of Exercise. For example, suppose that one does some exercise, so the chance of dying is low. In that case, high blood pressure would result in bigger absolute increase to the chance of dying, and accordingly, CI(c, e) = 0.44. On the other hand, if one does low exercise, then the chance dying is higher anyway, so having high blood pressure would result in a smaller absolute increase, and accordingly, CI(c, e) = 0.26. Clearly, there is a sense in which Exercise does affect the power of high blood pressure for fatal heart attacks. CI reflects this, whereas Cheng's measure does not.

So far we have found some differences between CI and power PC, without establishing any advantage over the latter. Suppose now that we reintroduce the original connection between Exercise and Blood Pressure. By violating Cheng's assumptions, this will increase the divergence between the CI and PC analyses. The power PC theory's assessment of causal power becomes:

p c = Δ p c 1 P ( e | ¬ c ) = 0.22.

The backpath has caused Cheng's causal power estimate to increase by 10%. But in fact the noisy‐OR interaction has not changed, so the true probability that high blood pressure will kill has not changed. Regardless of one's level of exercise, high blood pressure still reduces the chance of survival by 20%. The 10% difference therefore represents an error in Cheng's estimate of (p.647)

                      A new causal power theory

Fig. 30.5 The original Blood Pressure graph with messy interaction and backpath.

causal power, overstating the power of high blood pressure. In contrast, the CI measure is unchanged, since the reintroduced arc is overwhelmed by the intervention distribution. This is a clear advantage of CI.

Now let us revert to our original graph, relating Exercise, Blood Pressure, and Heart Attack, as depicted in Figure 30.5. The PC calculation for it yields:

p c = Δ p c 1 P ( e | ¬ c ) = 0.15.

While Cheng's measure can again be applied, it is now vulnerable to two sources of error: the backpath and the messy interaction. The interaction between Exercise and Blood Pressure creates an additional kind of problem for PC. Cheng's theory suggests that we can afford to ignore the state of Exercise and gives us only one figure for the power of Blood Pressure. But unlike the noisy‐OR graph, the true probability may differ depending upon one's specific level of Exercise–for example, low exercise and high blood pressure may have a synergistic effect in promoting fatal heart attacks. Given any such interaction, Cheng's measure will be unreliable for specific states of Exercise. The CI account acknowledges such possibilities and avoids errors due either to backpaths or to interaction effects.

Note also that for Figure 30.5 Cheng's causal power estimate has decreased from 0.20 to 0.15, a difference of 25%. The CI formula calculates causal power as CI(c, e) = 0.16, which is less than 50% of the 0.37 obtained from the noisy‐ OR graph. This shows that on either measure, major errors resulted from using the noisy‐OR approximation–and so we are much better off using a more general measure applicable to arbitrary causal networks.

30.4.4 Cl versus MI

Finally, we will contrast causal information with mutual information, again using the original graph in Figure 30.5. The MI formula applied to Blood Pressure and Heart Attack (assuming the original marginal distributions) is equivalent to: (p.648)

MI ( C , E ) = c C , e E P ( c ) P ( e | c ) log P ( e | c ) P ( e )
(30.13)

which reports the mutual information as 0.28. In contrast, the CI formula is:

CI ( C , E ) = c C , e E P ( c ) P ( e | c ) log P ( e | c ) P ( e )
(30.14)

calculating the causal power as 0.13, if we impose the original distribution on Blood Pressure. The difference between MI and CI is not due to scaling; it is because MI is affected by the backpath Blood PressureExerciseHeart Attack, while CI is not. So MI would be very misleading as a measure of causal power here: it doubles the realistic figure for the power of Blood Pressure over Heart Attack. It could lead to over‐prescription of blood pressure medication, rather than recommending lifelong exercise!

30.5 Conclusions

Causal information, our new measure of causal power, is theoretically well‐ founded. Causal Bayesian networks provide a very general and powerful way to represent complex stochastic systems. Hypothetical interventions, when properly modelled in causal Bayesian networks, provide a clear separation of causal from non‐causal paths. In mutual information, information theory provides an appropriate summary measure for cumulative causal influence, which applies to all sorts of networks and interventions, and can be tailored to specific purposes. The combination of the two, interventions and mutual information, yields causal information.

The result is a measure of causal power that has much wider application than previous accounts. Causal information can be applied to a wider variety of systems, including those with nonlinear probabilistic influences and intricate structural relationships between variables. In such cases it still yields sensible results, unlike the alternative measures put forward by Cheng (1997), Glymour (2001), and Good (1961). These alternative measures were designed for simpler cases, such as noisy‐OR networks that exhibit causal transitivity. But in these cases, too, our measure still yields appropriate results. And causal information is the only measure that is well defined for relating any combination of values and variables.

We look forward to applying causal information to theoretical problems in philosophy and AI. Causal information is also a promising measure for summarizing explanatory information encoded in a Bayesian network and so offers new means for simplifying the interpretation of complex Bayesian networks.

Acknowledgements

(p.649) We are grateful for the useful comments by the referees. This work was supported in part by a grant from Monash University.

References

Bibliography references:

Ay, N. and D. Polani (2008). Information flows in causal networks. Advances in Complex Systems 11, 17–41.

Cheng, P. W. (1997). From covariation to causation: A causal power theory. Psychological Review 104(2), 367–405.

Cover, T. M. and J. A. Thomas (1991). Elements of Information Theory. New York: John Wiley.

Glymour, C. (2001). The Mind's Arrows: Bayes Nets and Graphical Causal Models in Psychology. MIT: MIT Press.

Glymour, C. and P. Cheng (1998). Causal mechanism and probability: a normative approach. In M. Oaksford and N. Chater (Eds.), Rational Models of Cognition. Oxford: Oxford University Press.

Good, I. J. (1961). A causal calculus. British Journal for the Philosophy of Science 11, 305–318.

Hope, L. (2008). Information Measures for Causal Explanation and Prediction. PhD thesis, Monash University.

Hope, L. R. and K. B. Korb (2005). An information‐theoretic causal power theory. In Lecture Notes in Artificial Intelligence, Volume 3809, pp. 805–811. Berlin: Springer‐Verlag.

Korb, K. B., L. R. Hope, A. E. Nicholson, and K. Axnick (2004). Varieties of causal intervention. In PRICAI'04—Proceedings of the 8th Pacific Rim International Conference on Artificial Intelligence, Auckland, New Zealand, pp. 322–331.

Korb, K. B. and A. E. Nicholson (2004). Bayesian Artificial Intelligence. Boca Raton: Chapman & Hall/CRC.

Luhmann, C. C. and W. Ahn (2005). The meaning and computation of causal power. Psychological Review 112, 685–692.

Neapolitan, R. E. (2003). Stochastic causality. In International Conference on Cognitive Science, Sydney, Australia.

Neapolitan, R. E. (2004). Learning Bayesian Networks. Englewood Cliffs: Prentice Hall.

Novick, L and P. Cheng (2004). Assessing interactive causal influence. Psychological Review 111, 455–485.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kaufmann.

Rescorla, R. A. and A. R. Wagner (1972). A theory of Pavlovian conditioning. In A. H. Black and W Prokasy (Eds.), Classical Conditioning II: Current Theory and Research, pp. 64–99. Appleton‐Century‐Crofts.

Salmon, W. C. (1980). Probabilistic causality. Pacific Philosophical Quarterly 61, 50–74.

(p.650) Spirtes, P., C. Glymour, and R. Scheines (2000). Causation, Prediction and Search: Second Edition. Cambridge, MA: The MIT Press.

Suppes, P. (1970). A Probabilistic Theory of Causality. Amsterdam: North‐Holland.

Twardy, C. R. and K. B. Korb (2004). A criterion of probabilistic causality. Philosophy of Science 71, 241–62.

Verma, T. and J. Pearl (1990). Equivalence and synthesis of causal models. In Proceedings of the sixth conference on uncertainty in artificial intelligence, San Francisco, pp. 462–470. UAI: Morgan Kaufmann.

Wright, S. (1934). The method of path coefficients. Annals of Mathematical Statistics 5, 161–215.

Appendix A: CI versus Cheng computations

The following computations were made for the examples in Section 30.4.3. The first Cheng computation:

P ( e | c ) = 0.22 P ( e | B P = L o w ) = 0.025 P ( e | B P = M e d ) = 0.025 P ( e | ¬ c ) = P ( e | B P = L o w )[0 .32/0 .78] +  P ( e | B P = M e d )[0 .46/0 .78] = 0.025 × 0.41 + 0.025 × 0.59 = 0.025 Δ p c = P ( e | c ) P ( e | ¬ c ) = 0.22 0.025 = 0.195 p c = Δ p c / [ 1 P ( e | ¬ c ) ] = 0.195 / [ 1 0.025 ] = 0.20.

The first CI computation:

P * ( e | c ) = 0.22 P * ( e ) = 0.068 CI ( c , e ) = P ( e | c ) log 2 [ P ( e | c ) / P ( e ) ] = 0.22 log 2 [ 0.22 / 0.068 ] = 0.37.

The CI computation for high and medium exercisers:

P * ( e | c ) = 0.20 P * ( e ) = 0.0440 CI ( c , e ) = P * ( e | c ) log 2 [ P * ( e | c ) / P * ( e ) ] = 0.20 log 2 [ 0.20 / 0.0440 ] = 0.44

(p.651) The CI computation for low exercisers:

P * ( e | c ) = 0.30 P * ( e ) = 0.0164 CI ( c , e ) = P * ( e | c ) log 2 [ P * ( e | c ) / P * ( e ) ] = 0.30 log 2 [ 0.30 / 0.0164 ] = 0.26.

The Cheng computation with the backpath reintroduced:

P ( e | c ) = 0.24 P ( e | B P = L o w ) = 0.0078 P ( e | B P = M e d ) = 0.027 P ( e | ¬ c ) = P ( e | B P = L o w ) [ 0.32 / 0.78 ] + P ( e | B P = M e d ) [ 0.46 / 0.78 ] = 0.0078 × 0.41 + 0.027 × 0.59 = 0.0032 + 0.016 = 0.019 Δ p c = P ( e | c ) P ( e | ¬ c ) = 0.24 0.019 = 0.22 p c = Δ p c / [ 1 P ( e | ¬ c ) ] = 0.22 / [ 1 0.019 ] = 0.22.

The Cheng computation with both the backpath and interaction reintroduced:

P ( e | c ) = 0.24 P ( e | B P = L o w ) = 0.038 P ( e | B P = M e d ) = 0.16 P ( e | ¬ c ) = P( e | B P = L o w )[0 .32/0 .78] +  P ( e | B P = M e d )[0 .46/0 .78] = 0.038 × 0.41 + 0.16 × 0.59 = 0.016 + 0.095 = 0.11 Δ p c = P ( e | c ) P ( e | ¬ c ) = 0.24 0.11 = 0.13 p c = Δ p c / [ 1 P ( e | ¬ c ) ] = 0.13 / [ 1 0.11 ] = 0.15.

Appendix B: CI for variables is non‐negative

Theorem 30.1. CI(C, E), CI(c, E) and CI(C, e) are always non‐negative.

(p.652) Proof.

(a) CI(C, E) is equivalent to the mutual information MI(C, E) under the specified interventions, as shown in Section 30.3.4. Mutual information is always non‐negative (Cover and Thomas, 1991).

(b) CI(c, E) is equivalent to Kullback–Leibler divergence between P*(Eǀc) and P*(E). Kullback–Leibler divergence is always non‐negative (Cover and Thomas, 1991).

(c) The Kullback–Leibler divergence between P*(Cǀe) and P*(C) is

c P ( c | e ) log P ( c | e ) P ( c ) = c P ( c , e ) P ( e ) log P ( c , e ) P ( c ) P ( e ) = 1 P ( e ) c P ( c , e ) log P ( c , e ) P ( c ) P ( e ) = 1 P ( e ) c P ( c ) P ( e | c ) log P ( e | c ) P ( e ) = CI ( C , e ) P ( e ) .

Since P*(e) and the Kullback–Leibler divergence are always non‐negative, so is CI(C,e).                   ◻

Notes:

(1) So, Wright's rule for identifying paths that contribute to the correlation between pairs of variables is essentially the definition of d‐connection (Pearl, 1988), although Wright did not consider conditioning upon a collider, which will activate the path through it.

(2) Strictly speaking, this only gives us the success rate where ¬ e would otherwise have occurred. The success rate of c where e would have occurred anyway is a moot point. Cheng can either assume that it is the same, or else ignore these cases altogether, as being unimportant for assessing causal power.

(3) Luhmann and Ahn (2005) make the curious objection to power PC theory that it implies that all causes have powers of 0 or 1 to bring about their effects. They claim this follows from Cheng's assumptions ‘unless for some inexplicable reason, the causal link between c and e is intrinsically indeterminate’ (Luhmann and Ahn, 2005, p. 686). It is true that if all other causes of e are included in the model, and the model is deterministic, then powers of 0 or 1 will result. But this would not be unreasonable, nor is it a special problem for Cheng's account. The complaint amounts to the observation that Cheng plus determinism implies determinism!

(4) For an excellent introduction to information theory, see Cover and Thomas (1991).

(5) Causal information was first introduced by Hope and Korb (2005); here we further develop the account and compare it to earlier theories. Ay and Polani (2008) have subsequently developed a similar information‐theoretic approach.

(6) Thus, we are here concerned with accounting for the total power of one variable to influence another, under various background conditions, across all causal paths connecting them. It would also be of interest to account for the causal powers of individual paths and relate them to the total causal power of all paths; that is a matter of current research.

(7) We will assume that Ψ does not include any common effect variable that activates a non‐ causal path from X to Y.

(8) We emphasize that these interventions are only hypothetical. Their purpose is simply to reveal features already implicit in the given causal Bayesian network. So it is not necessary that they be practical or even physically possible; it is sufficient that they can be modeled by augmentation.

(9) The base is unimportant, and frequently natural logs are used. We use base 2 here to simplify the interpretation of causal information as code length.

(10) From Shannon (1948), the negative log of the probability of an event is the optimal code length to describe that event. Hence, mutual information can also be interpreted as the expected excess code length involved in recording the values of X and Y while wrongly assuming that they are independent.

(11) Entropy is defined subject to the common assumption that 0 log 0 = 0, which is justified by continuity arguments.