Jump to ContentJump to Main Navigation
Neural Networks and Brain Function$

Edmund Rolls and Alessandro Treves

Print publication date: 1997

Print ISBN-13: 9780198524328

Published to Oxford Scholarship Online: March 2012

DOI: 10.1093/acprof:oso/9780198524328.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2017. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in OSO for personal use (for details see http://www.oxfordscholarship.com/page/privacy-policy). Subscriber: null; date: 26 February 2017

(p.292) A2 Information theory

(p.292) A2 Information theory

Neural Networks and Brain Function
Oxford University Press

Information theory provides the means for quantifying how much neurons communicate to other neurons, and thus provides a quantitative approach to fundamental questions about information processing in the brain. If asking what in neural activity carries information, one must compare the amounts of information carried by different codes, that is different descriptions of the same activity, to provide the answer. If asking about the speed of information transmission, one must define and measure information rates from neuronal responses. If asking to what extent the information provided by different cells is redundant or instead independent, again one must measure amounts of information in order to provide a quantitative study.

This appendix briefly introduces the fundamental elements of information theory in the first section. A more complete treatment can be found in many books on the subject (e.g. Abramson, 1963; Hamming, 1990; Cover and Thomas, 1991), among which the recent one by Rieke, Warland, de Ruyter van Steveninck and Bialek (1996) is specifically about information transmitted by neuronal firing. The second section discusses the extraction of information measures from neuronal activity, in particular in experiments with mammals, in which the central issue is how to obtain accurate measures in conditions of limited sampling. The third section summarizes some of the results obtained so far. The material in the second and third section has not yet been treated in any book, and as concerns the third section, it is hoped that it will be considerably updated in future editions of this volume, to include novel findings. The essential terminology is summarized in a Glossary at the end of the appendix.

A2.1 Basic notions and their use in the analysis of formal models

Although information theory was a surprisingly late starter as a mathematical discipline, having being developed and formalized in 1948 by C. Shannon, the intuitive notion of information is immediate to us. It is also very easy to understand why, in order to quantify this intuitive notion, of how much we know about something, we use logarithms, and why the resulting quantity is always defined in relative rather than absolute terms. An introduction to information theory is provided next, with a more formal summary given in the third subsection.

A2.1.1 The information conveyed by definite statements.

Suppose somebody, who did not know, is told that Reading is a town west of London. How much information is he given? Well, that depends. He may have known it was a town in (p.293) England, but not whether it was east or west of London; in which case the new information amounts to the fact that of two a priori (i.e. initial) possibilities (E or W), one holds (W). It is also possible to interpret the statement in the more precise sense, that Reading is west of London, rather than east, north or south, i.e. one out offour possibilities; or else, west rather that north-west, north, etc. Clearly, the larger the number k of a priori possibilities, the more one is actually told, and a measure of information must take this into account. Moreover, we would like independent pieces of information to just add together. For example, our fellow may also be told that Cambridge is, out of l possible directions, north of London. Provided nothing was known on the mutual location of Reading and Cambridge, there were overall k×l a priori possibilities, only one of which remains a posteriori (after receiving the information). We then define the amount I of information gained, such that

(A2.1) A2 Information theory

because the log function has the basic property that

(A2.2) A2 Information theory

i.e. the information about Cambridge adds up to that about Reading. We choose to take logarithms in base 2 as a mere convention, so that the answer to a yes/no question provides one unit, or bit, of information. Here it is just for the sake of clarity that we used different symbols for the number of possible directions with respect to which Reading and Cambridge are localized; if both locations are specified for example in terms of E, SE, S, SW, W, NW, N, NE, then obviously k = l= 8,I(k) = I(l) = 3 bits, and I(kl) = 6 bits. An important point to note is that the resolution with which the direction is specified determines the amount of information provided, and that in this example, as in many situations arising when analysing neuronal codings, the resolution could be made progressively finer, with a corresponding increase in information proportional to the log of the number of possibilities.

A2.1.2 The information conveyed by probabilistic statements.

The situation becomes slightly less trivial, and closer to what happens among neurons, if information is conveyed in less certain terms. Suppose for example that our friend is told, instead, that Reading has odds of 9 to 1 to be west, rather than east, of London (considering now just two a priori possibilities). He is certainly given some information, albeit less than in the previous case. We might put it this way: out of 18 equiprobable a priori possibilities (9 west + 9 east), 8 (east) are eliminated, and 10 remain, yielding

(A2.3) A2 Information theory

as the amount of information given. It is simpler to write this in terms of probabilities

(A2.4) A2 Information theory

(p.294) This is of course equivalent to saying that the amount of information given by an uncertain statement is equal to the amount given by the absolute statement

A2 Information theory

minus the amount of uncertainty remaining after the statement, I =–log2 ppostenor(W). A successive clarification that Reading is indeed west of London carries

(A2.5) A2 Information theory

bits of information, because 9 out of 10 are now the a priori odds, while a posteriori there is certainty, PPosterior(W) = 1. In total we would seem to have

(A2.6) A2 Information theory

as if the whole information had been provided at one time. This is strange, given that the two pieces of information are clearly not independent, and only independent information should be additive. In fact, we have cheated a little. Before the clarification, there was still one residual possibility (out of 10) that the answer was ‘east’, and this must be taken into account by writing

(A2.7) A2 Information theory

as the information contained in the first message. This little detour should serve to emphasize two aspects which it is easy to forget when reasoning intuitively about information, and that in this example cancel each other. In general, when uncertainty remains, that is there is more than one possible a posteriori state, one has to average information values for each state with the corresponding a posteriori probability measure. In the specific example, the sum I+ I′ totals slightly more than 1 bit, and the amount in excess is precisely the information ‘wasted’ by providing correlated messages.

A2.1.3 Information sources, information channels, and information measures

In summary, the expression quantifying the information provided by a definite statement that event s, which had an a priori probability P(s), has occurred is

(A2.8) A2 Information theory

whereas if the statement is probabilistic, that is several a posteriori probabilities remain nonzero, the correct expression involves summing over all possibilities with the corresponding probabilities:

(A2.9) A2 Information theory

(p.295) When considering a discrete set of mutually exclusive events, it is convenient to use the metaphor of a set of symbols comprising an alphabet S. The occurrence of each event is then referred to as the emission of the corresponding symbol by an information source. The entropy of the source, H, is the average amount of information per source symbol, where the average is taken across the alphabet, with the corresponding probabilities

(A2.10) A2 Information theory

An information channel receives symbols s from an alphabet S and emits symbols s’ from alphabet S’. If the joint probability of the channel receiving s and emitting s’ is given by the product

(A2.11) A2 Information theory

for any pair s, s’ then the input and output symbols are independent of each other, and the channel transmits zero information. Instead of joint probabilities, this can be expressed with conditional probabilities: the conditional probability of s’ given s is written P(s’ | s), and if the two variables are independent, it is just equal to the unconditional probability P(s’). In general, and in particular if the channel does transmit information, the variables are not independent, and one can express their joint probability in two ways in terms of conditional probabilities

(A2.12) A2 Information theory

from which it is clear that

(A2.13) A2 Information theory

which is called Bayes’ theorem (although when expressed as here in terms of probabilities it is strictly speaking an identity rather than a theorem). The information transmitted by the channel conditional to its having emitted symbol s’ (or specific transformation, I(s’)) is given by Eq. A2.9, once the unconditional probability P(s) is inserted as the prior, and the conditional probability P(s | s’) as the posterior:

(A2.14) A2 Information theory

Symmetrically, one can define the transinformation conditional to the channel having received symbol s

(A2.15) A2 Information theory

Finally, the average transinformation, or mutual information, can be expressed in fully symmetrical form


(A2.16) A2 Information theory

The mutual information can also be expressed as the entropy of the source using alphabet S minus the equivocation of S with respect to the new alphabet S’ used by the channel, written

(A2.17) A2 Information theory

A channel is characterized, once the alphabets are given, by the set of conditional probabilities for the output symbols, P(s’|s), whereas the unconditional probabilities of the input symbols P(s) depend of course on the source from which the channel receives. Then, the capacity of the channel can be defined as the maximal mutual information across all possible sets of input probabilities P(s). Thus, the information transmitted by a channel can range from zero to the lower of two independent upper bounds: the entropy of the source, and the capacity of the channel.

A2.1.4 The information carried by a neuronal response and its averages

Considering the processing of information in the brain, we are often interested in the amount of information the response r of a neuron, or of a population of neurons, carries about an event happening in the outside world, for example a stimulus s shown to the animal. Once the inputs and outputs are conceived of as sets of symbols from two alphabets, the neuron(s) may be regarded as an information channel. We may denote with P(s) the a priori probability that the particular stimulus s out of a given set was shown, while the conditional probability P(s|r) is the a posteriori probability, that is updated by the knowledge of the response r. The response-specific transformation

(A2.18) A2 Information theory

takes the extreme values of I(r) =–log2 P(S(r)) if r unequivocally determines s(r) (that is, P(s |r) equals 1 for that one stimulus and 0 for all others); and I(r) = Σs P(S) log2 P(s)/P(s) = 0 if there is no relation between s and r, that is they are independent, so that the response tells us nothing new about the stimulus.

This is the information conveyed by each particular response. One is usually interested in further averaging this quantity over all possible responses r,

(A2.19) A2 Information theory

The angular brackets 〈 〉 are used here to emphasize the averaging operation, in this case over responses. Denoting with P(s,r) the joint probability of the pair of events s and r, and using Bayes’ theorem, this reduces to the symmetric form (Eq. A2.16) for the mutual information

(A2.20) A2 Information theory

(p.297) which emphasizes that responses tell us about stimuli just as much as stimuli tell us about responses. This is, of course, a general feature, independent of the two variables being in this instance stimuli and neuronal responses. In fact, what is of interest, beside the mutual information of Eqs A2.19A2.20, is often the information specifically conveyed about each stimulus,

(A2.21) A2 Information theory

which is a direct quantification of the variability in the responses elicited by that stimulus, compared to the overall variability. Since P(r) is the probability distribution of responses averaged across stimuli, it is again evident that the stimulus-specific information measure of Eq. A2.21 depends not only on the stimulus s, but also on all other stimuli used. Likewise, the mutual information measure, despite being of an average nature, is dependent on what set of stimuli has been used in the average. This emphasizes again the relative nature of all information measures. More specifically, it underscores the relevance of using, while measuring the information conveyed by a given neuronal population, stimuli that are either representative of real-life stimulus statistics, or of particular interest for the properties of the population being examined.

A numerical example

To make these notions clearer, we can consider a specific example in which the response of a cell to the presentation of, say, one of four odours (A, B, C, D) is recorded for 10 ms, during which the cell emits either 0, 1, or 2 spikes, but no more. Imagine that the cell tends to respond more vigorously to smell B, less to C, even less to A, and never to D, as described by the following table of conditional probabilities P(r | s):

A2 Information theory

then, if different odours are presented with equal probability, the table of joint probabilities P(s, r) will be

A2 Information theory

(p.298) From these two tables one can compute various information measures by directly applying the definitions above. Since odours are presented with equal probability, P(s)=l/4, the entropy of the stimulus set, which corresponds to the maximum amount of information any transmission channel, no matter how efficient, could convey on the identity of the odours, is Hs =–Σs[P(s)log2P(S)] =–4[(l/4)log2(l/4)] = log24 = 2 bits. There is a more stringent upper bound on the mutual information that this cell’s responses convey on the odours, however, and this second bound is the channel capacity T of the cell. Calculating this quantity involves maximizing the mutual information across prior odour probabilities, and it is a bit complicated to do, in general. In our particular case the maximum information is obtained when only odours B and D are presented, each with probability 0.5. The resulting capacity is T = 1 bit. We can easily calculate, in general, the entropy of the responses. This is not an upper bound characterizing the source, like the entropy of the stimuli, nor an upper bound characterizing the channel, like the capacity, but simply a bound on the mutual information for this specific combination of source (with its related odour probabilities) and channel (with its conditional probabilities). Since only three response levels are possible within the short recording window, and they occur with uneven probability, their entropy is considerably lower than Hs, at Hr =–ΣrP(r) log2P(r) =–P(0) log2P(0)–P(1) log2P(l)–P(2) log2P(2) =–0.5 log20.5–0.275 log20.275–0.225 log20.225 = 1.496 bits. The actual average information I that the responses transmit about the stimuli, which is a measure of the correlation in the variability of stimuli and responses, does not exceed the absolute variability of either stimuli (as quantified by the first bound) or responses (as quantified by the last bound), nor the capacity of the channel. An explicit calculation using the joint probabilities of the second table into expression A2.20 yields I=0.733 bits. This is of course only the average value, averaged both across stimuli and across responses.

The information conveyed by a particular response can be larger. For example, when the cell emits two spikes it indicates with a relatively large probability odour B, and this is reflected in the fact that it then transmits, according to expression A2.18,I(r = 2) = 1.497 bits, more than double the average value.

Similarly, the amount of information conveyed about each individual odour varies with the odour, depending on the extent to which it tends to elicit a differential response. Thus, expression A2.21 yields that only I(s = C) = 0.185 bits are conveyed on average about odour C, which tends to elicit responses with similar statistics to the average statistics across odours, and therefore not easily interpretable. On the other hand, exactly 1 bit of information is conveyed about odour D, since this odour never elicits any response, and when the cell emits no spike there is a probability of 1/2 that the stimulus was odour D.

A2.1.5 The information conveyed by continuous variables

A general feature, relevant also to the case of neuronal information, is that if, among a continuum of a priori possibilities, only one, or a discrete number, remains a posteriori, the information is strictly infinite. This would be the case if one were told, for example, that Reading is exactly 10′ west, 1′ north of London. The a priori probability of precisely this set of coordinates among the continuum of possible ones is zero, and then the information diverges to infinity. The problem is only theoretical, because in fact, with continuous (p.299) distributions, there is always one or several factors that limit the resolution in the a posteriori knowledge, rendering the information finite. Moreover, when considering the mutual information in the conjoint probability of occurrence of two sets, e.g. stimuli and responses, it suffices that at least one of the sets is discrete to make matters easy, that is, finite. Nevertheless, the identification and appropriate consideration of these resolution-limiting factors in practical cases may require careful analysis.

Example: the information retrieved from an autoassociative memory

One example is the evaluation of the information that can be retrieved from an autoassociative memory. Such a memory stores a number of firing patterns, each one of which can be considered, as in Chapter 3, as a vector r μ with components the firing rates A2 Information theory, where the subscript i indexes the cell (and the superscript μ indexes the pattern). In retrieving pattern μ, the network in fact produces a distinct firing pattern, denoted for example simply as r. The quality of retrieval, or the similarity between r μ and r, can be measured by the average mutual information

(A2.22) A2 Information theory

In this formula the ‘approximately equal’ sign ≈ marks a simplification which is not necessarily a reasonable approximation. If the simplification is valid, it means that in order to extract an information measure, one need not compare whole vectors (the entire firing patterns) with each other, and may instead compare the firing rates of individual cells at storage and retrieval, and sum the resulting single-cell information values. The validity of the simplification is a matter which will be discussed later and which has to be verified, in the end, experimentally, but for the purposes of the present discussion we can focus on the single-cell terms. If either ri or rμ i has a continuous distribution of values, as it will if it represents not the number of spikes emitted in a fixed window, but more generally the firing rate of neuron i computed by convolving the firing train with a smoothing kernel, then one has to deal with probability densities, which we denote as p(r)dr, rather than the usual probabilities P(r). Substituting p(r)dr for P(r) and p(rμ,r)drdrμ for P(rμ,r), one can write for each single-cell contribution (omitting the cell index i)

(A2.23) A2 Information theory

and we see that the differentials drμdr cancel out between numerator and denominator inside the logarithm, rendering the quantity well defined and finite. If, however, rμ were to exactly determine r, one would have

(A2.24) A2 Information theory

and, by losing one differential on the way, the mutual information would become infinite. It is therefore important to consider what prevents rμ from fully determining r in the case at hand—in other words, to consider the sources of noise in the system. In an autoassociative memory storing an extensive number of patterns (see Appendix A4), one source of noise (p.300) always present is the interference effect due to the concurrent storage of all other patterns. Even neglecting other sources of noise, this produces a finite resolution width p, which allows one to write some expression of the type p(r | rμ) dr = exp–(r–rμ))2/2ρ2 dr which ensures that the information is finite as long as the resolution ρ is larger than zero.

One further point that should be noted, in connection with estimating the information retrievable from an autoassociative memory, is that the mutual information between the current distribution of firing rates and that of the stored pattern does not coincide with the information gain provided by the memory device. Even when firing rates, or spike counts, are all that matters in terms of information carriers, as in the networks considered in this book, one more term should be taken into account in evaluating the information gain. This term, to be subtracted, is the information contained in the external input that elicits the retrieval. This may vary a lot from the retrieval of one particular memory to the next, but of course an efficient memory device is one that is able, when needed, to retrieve much more information than it requires to be present in the inputs, that is, a device that produces a large information gain.

Finally, one should appreciate the conceptual difference between the information a firing pattern carries about another one (that is, about the pattern stored), as considered above, and two different notions: (a) the information produced by the network in selecting the correct memory pattern and (b) the information a firing pattern carries about something in the outside world. Quantity (a), the information intrinsic to selecting the memory pattern, is ill defined when analysing a real system, but is a well-defined and particularly simple notion when considering a formal model. If p patterns are stored with equal strength, and the selection is errorless, this amounts to log2 p bits of information, a quantity often, but not always, small compared to the information in the pattern itself. Quantity (b), the information conveyed about some outside correlate, is not defined when considering a formal model that does not include an explicit account of what the firing of each cell represents, but is well defined and measurable from the recorded activity of real cells. It is the quantity considered in the numerical example with the four odours, and it can be generalized to the information carried by the activity of several cells in a network, and specialized to the case that the network operates as an associative memory. One may note, in this case, that the capacity to retrieve memories with high fidelity, or high information content, is only useful to the extent that the representation to be retrieved carries that amount of information about something relevant—or, in other words, that it is pointless to store and retrieve with great care largely meaningless messages. This type of argument has been used to discuss the role of the mossy fibres in the operation of the CA3 network in the hippocampus (Treves and Rolls, 1992; and Chapter 6).

A2.2 Estimating the information carried by neuronal responses

A2.2.1 The limited sampling problem

We now discuss more in detail the application of these general notions to the information transmitted by neurons. Suppose, to be concrete, that an animal has been presented with stimuli drawn from a discrete set 𝕊, and that the responses of a set of C cells have been (p.301) recorded following the presentation of each stimulus. We may choose any quantity or set of quantities to characterize the responses, for example let us assume that we consider the firing rate of each cell, ri calculated by convolving the spike response with an appropriate smoothing kernel. The response space is then C times the continuous set of all positive real numbers, (R/2)C. We want to evaluate the average information carried by such responses about which stimulus was shown. In principle, it is straightforward to apply the above formulas, e.g. in the form

(A2.25) A2 Information theory

where it is important to note that p(r) and p(r | s) are now probability densities defined over the high-dimensional vector space of multi-cell responses. The product sign Π signifies that this whole vector space has to be integrated over, along all its dimensions. p(r) can be calculated as Σs p(r | s) P(s), and therefore, in principle, all one has to do is to estimate, from the data, the conditional probability densities p(r| s)—the distributions of responses following each stimulus. In practice, however, in contrast to what happens with formal models, in which there is usually no problem in calculating the exact probability densities, real data come in limited amounts, and thus sample only sparsely the vast response space. This limits the accuracy with which from the experimental frequency of each possible response we can estimate its probability; in turn seriously impairing our ability to estimate 〈I〉 correctly. We refer to this as the limited sampling problem. This is a purely technical problem that arises, typically when recording from mammals, because of external constraints on the duration or number of repetitions of a given set of stimulus conditions. With computer simulation experiments, but also with recordings from, for example, insects, sufficient data can usually be obtained that straightforward estimates of information are accurate enough (Strong, Koberle, de Ruyter van Steveninck and Bialek, 1996; Golomb, Kleinfeld, Reid, Shapley and Shraiman, 1994). The problem is, however, so serious in connection with recordings from monkeys and rats, that it is worthwhile to discuss it, in order to appreciate the scope and limits of applying information theory to neuronal processing.

In particular, if the responses are continuous quantities, the probability of observing exactly the same response twice is infinitesimal. In the absence of further manipulation, this would imply that each stimulus generates its own set of unique responses, therefore any response that has actually occurred could be associated unequivocally with one stimulus, and the mutual information would always equal the entropy of the stimulus set. This absurdity shows that in order to estimate probability densities from experimental frequencies one has to resort to some regularizing manipulation, such as smoothing the point-like response values by convolution with suitable kernels, or binning them into a finite number of discrete bins.

A numerical example

The problem is illustrated in Fig. A2.1. In the figure, we have simulated a simple but typical situation, in which we try to estimate how much information, about which of a set of two stimuli was shown, can be extracted from the firing rate of a single cell. We have assumed that the true underlying probability density of the response to each stimulus is Gaussian, with width σ = 5 Hz and mean value r 1 = 8 Hz and r 2= 14 Hz respectively; and that we have sampled each Gaussian 10 times (i.e. 10 trials with each stimulus). How much information is present in the response, on average? If we had the underlying probability density available, we could easily calculate 〈I〉, which with our choice of parameters would turn out to be 0.221 bits. Our problem, however, is how to estimate the underlying probability distributions from the 10 trials available for each stimulus. Several strategies are possible. One is to discretize the response space into bins, and estimate the probability density as the histogram of the fraction of trials falling into each bin. Choosing bins of width ∆r = 1 Hz would yield 〈I〉 = 0.900 bits; the bins are so narrow that almost every response is in a different bin, and then the estimated information is close to its maximum value of 1 bit. Choosing bins of larger width, ∆r = 5 Hz, one would end up with 〈I〉 = 0.324; a value closer to the true one but still overestimating it by half, even though the bin width now matches the standard deviation of each underlying distribution. Alternatively, one may try to ‘smooth’ the data by convolving each response with a Gaussian kernel of width the standard deviation measured for each stimulus. In our particular case the measured standard deviations are very close to the true value of 5 Hz, but one would get 〈I〉 = 0.100, i.e. one would underestimate by more than half the true information value, by oversmoothing. Yet another possibility is to make a bold assumption as to what the general shape of the underlying densities should be, for example a Gaussian. Even in our case, in which the underlying density was indeed chosen to be Gaussian, this procedure yields a value, 〈I〉 = 0.187, off by 16% (and the error would be much larger, of course, if the underlying density had been of very different shape).


A2 Information theory

Fig. A2.1 The probability distributions used in the example in the text, (a) The two original Gaussian distributions (solid curve for the one centred at 8 Hz, long-dashed curve for the one centred at 14 Hz), the 10 points which provided the sampling of each (◊ and +, respectively), and the distributions obtained by binning the samples into 5 Hz bins, (b) The two original distributions compared with those obtained by smoothing the samples with Gaussian kernels (the short-dash curve applies to the lower distribution and the dotted curve to the higher one), (c) The original distributions compared with those obtained after a Gaussian fit (with the same curves as in (b)).

(p.303) The effects of limited sampling

The crux of the problem is that, whatever procedure one adopts, limited sampling tends to produce distortions in the estimated probability densities. The resulting mutual information estimates are intrinsically biased. The bias, or average error of the estimate, is upward if the raw data have not been regularized much, and is downward if the regularization procedure chosen has been heavier. The bias can be, if the available trials are few, much larger that the true information values themselves. This is intuitive, as fluctuations due to the finite number of trials available would tend, on average, to either produce or emphasize differences among the distributions corresponding to different stimuli, differences that are preserved if the regularization is ‘light’, and that are interpreted in the calculation as carrying genuine information. This is particularly evident if one considers artificial data produced by using the ‘same’ distribution for both stimuli, as illustrated in Fig. A2.2. Here, the ‘true’ information should be zero, yet all of our procedures—many more are possible, but the same effect would emerge—yield finite amounts of information. The actual values obtained with the four methods used above are, in order, 〈I〉 = 0.462, 0.115, 0.008, and 0.007 bits. Clearly, ‘heavier’ forms of regularization produce less of such artefactual information; however, if the regularization is heavy, even the underlying meaningful differences between distributions can be suppressed, and information is underestimated. Choosing the right amount of regularization, or the best regularizing procedure, is not possible a priori Hertz et al. (1992) have proposed the interesting procedure of using an artificial neural network to regularize the raw responses. The network can be trained on part of the data using backpropagation, and then used on the remaining part to produce what is in effect a clever data-driven regularization of the responses. This procedure is, however, rather computer intensive and not very safe, as shown by some self-evident inconsistency in the results (Heller et al.,1995). Obviously, the best way to deal with the limited sampling problem is to try and use as many trials as possible. The improvement is slow, however, and generating as many trials as would be required for a reasonably unbiased estimate is often, in practice, impossible.


A2 Information theory

Fig. A2.2 The same manipulations applied in Fig. A2.1 are applied here to samples derived from the same distribution a Gaussian centred at 11 Hz. Notation as in Fig. A2.1. The information values obtained from the various manipulations are reported in the text.

A2.2.2 Correction procedures for limited sampling

The above observation, that data drawn from a single distribution, when artificially paired, at random, to different stimulus labels, results in ‘spurious’ amounts of apparent information, (p.305) suggests a simple way of checking the reliability of estimates produced from real data (Optican et al., 1991). One can disregard the true stimulus associated with each response, and generate a randomly reshuffled pairing of stimuli and responses, which should therefore, being not linked by any underlying relationship, carry no mutual information about each other. Calculating, with some procedure of choice, the spurious information obtained in this way, and comparing with the information value estimated with the same procedure for the real pairing, one can get a feeling for how far the procedure goes into eliminating the apparent information due to limited sampling. Although this spurious information, Is,, is only indicative of the amount of bias affecting the original estimate, a simple heuristic trick (called ‘bootstrap’1) is to subtract the spurious from the original value, to obtain a somewhat ‘corrected’ estimate. The following table shows to what extent this trick—or a similar correction based on subtracting the ‘square’ of the spurious fraction of information (Optican et al.,1991)—works on our artificial data of Fig. A2.1, when averaged over 1000 different random instantiations of the two sets of ten responses. The correct value, again, is 〈I〉 =0.221 bits.


Binning ∆r= 1 Hz

Binning ∆r= 5 Hz

Gaussian smoothing

Gaussian fit

I (raw)















A different correction procedure (called ‘jack-knife’) is based on the assumption that the bias is proportional to 1/N, where N is the number of responses (data points) used in the estimation. One computes, beside the original estimate 〈IN, N auxiliary estimates 〈 IN-1k, by taking out from the data set response k, where k runs across the data set from 1 to N. The corrected estimate

(A2.26) A2 Information theory

is free from bias (to leading order in 1/N), if the proportionality factor is more or less the same in the original and auxiliary estimates. This procedure is very time-consuming, and it suffers from the same imprecision of any algorithm that tries to determine a quantity as the result of the subtraction of two large and nearly equal terms; in this case the terms have been made large on purpose, by multiplying them by N and N-1.

A more fundamental approach (Miller, 1955) is to derive an analytical expression for the bias (or, more precisely, for its leading terms in an expansion in 1/N, the inverse of the sample size). This allows the estimation of the bias from the data itself, and its subsequent subtraction, as discussed in Treves and Panzeri (1995) and Panzeri and Treves (1996). Such a procedure produces satisfactory results, thereby lowering the size of the sample required for a given accuracy in the estimate by about an order of magnitude (Golomb et al., 1997). However, it does not, in itself, make possible measures of the information contained in very complex responses with few trials. As a rule of thumb, the number of trials per stimulus required for a reasonable estimate of information, once the subtractive correction is applied, is of the order of the effectively independent (and utilized) bins in which the response space can be partitioned (Panzeri and Treves, 1996).

(p.306) A2.2.3 Decoding procedures for multiple cell responses

The bias of information measures grows with the dimensionality of the response space, and for all practical purposes the limit on the number of dimensions that can lead to reasonably accurate direct measures, even when applying a correction procedure, is quite low, two to three. This implies, in particular, that it is not possible to apply Eq. A2.25 to extract the information content in the responses of several cells (more than two to three) recorded simultaneously. One way to address the problem is then to apply some strong form of regularization to the multiple cell responses. Smoothing has already been mentioned as a form of regularization which can be tuned from very soft to very strong, and which preserves the structure of the response space. Binning is another form, which changes the nature of the responses from continuous to discrete, but otherwise preserves their general structure, and which can also be tuned from soft to strong. Other forms of regularization involve much more radical transformations, or changes of variables. Of particular interest for information estimates is a change of variables that transforms the response space into the stimulus set, by applying an algorithm that derives a predicted stimulus from the response vector, i.e. the firing rates of all the cells, on each trial. Applying such an algorithm is called decoding. Of course, the predicted stimulus is not necessarily the same as the actual one. Therefore the term decoding should not be taken to imply that the algorithm works successfully, each time identifying the actual stimulus. The predicted stimulus is simply a function of the response, as determined by the algorithm considered. Just as with any regularizing transform, it is possible to compute the mutual information between actual and predicted stimuli, instead of the original one between stimuli and responses. Since information about (real) stimuli can only be lost and not be created by the transform, the information measured in this way is bound to be lower in value than the real information in the responses. If the decoding algorithm is efficient, it manages to preserve nearly all the information contained in the raw responses, while if it is poor, it loses a large portion of it. If the responses themselves provided all the information about stimuli, and the decoding is optimal, then predicted stimuli coincide with the actual stimuli, and the information extracted equals the entropy of the stimulus set.

The procedure of extracting information values after applying a decoding algorithm is schematized below

A2 Information theory

(p.307) where the double line indicates the transformation from stimuli to responses operated by the nervous system, while the single line indicates the further transformation operated by the decoding procedure.

A slightly more complex variant of this procedure is a decoding step that extracts from the response on each trial not a single predicted stimulus, but rather probabilities that each of the possible stimuli was the actual one. The joint probabilities of actual and posited stimuli can be averaged across trials, and information computed from the resulting probability matrix (S x S). Computing information in this way takes into account the relative uncertainty in assigning a predicted stimulus to each trial, an uncertainty that is instead not considered by the previous procedure based solely on the identification of the maximally likely stimulus (Treves, 1997). Maximum likelihood information values I m1 tend therefore to be higher than probability information values I P, although in very specific situations the reverse could also be true.

The same correction procedures for limited sampling can be applied to information values computed after a decoding step. Values obtained from maximum likelihood decoding, I m1, suffer from limited sampling more than those obtained from probability decoding, I P, since each trial contributes a whole ‘brick’ of weight 1/N (N being the total number of trials), whereas with probabilities each brick is shared among several slots of the (S xS) probability matrix. The neural network procedure devised by Hertz et al. (1992) and co-workers can in fact be thought of as a decoding procedure based on probabilities, which deals with limited sampling not by applying a correction but rather by strongly regularizing the original responses. When decoding is used, the rule of thumb becomes that the minimal number of trials per stimulus required for accurate information measures is roughly equal to the size of the stimulus set, if the subtractive correction is applied (Panzeri and Treves, 1996).

Decoding algorithms

Any transformation from the response space to the stimulus set could be used in decoding, but of particular interest are the transformations that either approach optimality, so as to minimize information loss and hence the effect of decoding, or else are implementable by mechanisms that could conceivably be operating in the real system, so as to extract information values that could be extracted by the system itself.

The optimal transformation is in theory well-defined: one should estimate from the data the conditional probabilities P(r | s), and use Bayes’ rule to convert them into the conditional probabilities P(s’ | r). Having these for any value of r, one could use them to estimate I p, and, after selecting for each particular real response the stimulus with the highest conditional probability, to estimate I m1. To avoid biasing the estimation of conditional probabilities, the responses used in estimating P(r | s) should not include the particular response for which P(s’| r) is going to be derived (jack-knife cross validation). In practice, however, the estimation of P(r | s) in usable form involves the fitting of some simple function to the responses. This need for fitting, together with the approximations implied in the estimation of the various quantities, prevents us from defining the really optimal decoding, and leaves us with various algorithms, depending essentially on the fitting function used, which are hopefully close to optimal in some conditions. We have experimented extensively with (p.308) two such algorithms, that both approximate Bayesian decoding (Rolls, Treves and Tovee, 1997). Both these algorithms fit the response vectors produced over several trials by the cells being recorded to a product of conditional probabilities for the response of each cell given the stimulus. In one case the single cell conditional probability is assumed to be Gaussian (truncated at zero), in the other it is assumed to be Poisson (with an additional weight at zero). Details of these algorithms are given by Rolls, Treves and Tovee (1997).

Biologically plausible decoding algorithms are those that limit the algebraic operations used to types that could be easily implemented by neurones, e.g. dot product summations, thresholding and other single-cell nonlinearities, competition and contrast enhancement among the outputs of nearby cells. There is then no need for ever fitting functions or other sophisticated approximations, but of course the degree of arbitrariness in selecting a particular algorithm remains substantial, and a comparison among different choices based on which yields the higher information values may favour one choice in a given situation and another choice with a different data set.

To summarize, the key idea in decoding, in our context of estimating information values, is that it allows substitution of a possibly very high-dimensional response space (which is difficult to sample and regularize) with a reduced object much easier to handle, that is with a discrete set equivalent to the stimulus set. The mutual information between the new set and the stimulus set is then easier to estimate even with limited data, and if the assumptions about population coding, underlying the particular decoding algorithm used, are justified, the value obtained approximates the original target, the mutual information between stimuli and responses. For each response recorded, one can use all the responses except for that one to generate estimates of the average response vectors (the response for each unit in the population) to each stimulus. Then one considers how well the selected response vector matches the average response vectors, and uses the degree of matching to estimate, for all stimuli, the probability that they were the actual stimuli. The form of the matching embodies the general notions about population encoding, for example the ‘degree of matching’ might be simply the dot product between the current vector and the average vector (rav), suitably normalized over all average vectors to generate probabilities

(A2.27) A2 Information theory

where s″ is a dummy variable. One ends up, then, with a table of conjoint probabilities P(S, S’), and another table obtained by selecting for each trial the most likely (or predicted) stimulus sp, P(s, sp). Both s’ and sp stand for all possible stimuli, and hence belong to the same set S. These can be used to estimate mutual information values based on probability decoding (I p) and on maximum likelihood decoding (I m1)

(A2.28) A2 Information theory


(A2.29) A2 Information theory

(p.309) A2.3 Main results obtained from applying information-theoretic analyses

Although information theory provides the natural mathematical framework for analysing the performance of neuronal systems, its applications in neuroscience have been for many years rather sparse and episodic (e.g. MacKay and McCulloch, 1952; Eckhorn and Pöpel, 1974, 1975; Eckhorn et al.,1976). One reason for this lukewarm interest in information theory applications has certainly been the great effort that was apparently required, due essentially to the limited sampling problem, in order to obtain reliable results. Another reason has been the hesitation towards analysing as a single complex ‘black-box’ large neuronal systems all the way from some external, easily controllable inputs, down to neuronal outputs in some central cortical area of interest, for example including all visual stations from the periphery to the end of the ventral visual stream in the temporal lobe. In fact, two important bodies of work, that have greatly helped revive interest in applications of the theory in recent years, both sidestep these two problems. The problem with analyzing a huge black-box is avoided by considering systems at the sensory periphery; the limited sampling problem is avoided either by working with insects, in which sampling can be extensive (Bialek et al., 1991; de Ruyter van Steveninck and Laughlin, 1996; Rieke et al.,1996) or by utilizing a formal model instead of real data (Atick and Redlich, 1990; Atick, 1992). Both approaches have provided insightful quantitative analyses that are in the process of being extended to more central mammalian systems (see e.g. Atick et al., 1996).

A2.3.1 Temporal codes versus rate codes (at the single unit level)

In the third of a series of papers which analyse the response of single units in the primate inferior temporal cortex to a set of static visual stimuli, Optican and Richmond (1987) have applied information theory in a particularly direct and useful way. To ascertain the relevance of stimulus-locked temporal modulations in the firing of those units, they have compared the amount of information about the stimuli that could be extracted from just the firing rate, computed over a relatively long interval of 384 ms, with the amount of information that could be extracted from a more complete description of the firing, that included temporal modulation. To derive this latter description (the temporal code) they have applied principal component analysis (PCA) to the temporal response vectors recorded for each unit on each trial. A temporal response vector was defined as a vector with as components the firing rates in each of 64 successive 6 ms time bins. The (64 x 64) covariance matrix was calculated across all trials of a particular unit, and diagonalized. The first few eigenvectors of the matrix, those with the largest eigenvalues, are the principal components of the response, and the weights of each response vector on these four to five components can be used as a reduced description of the response, which still preserves, unlike the single value giving the mean firing rate along the entire interval, the main features of the temporal modulation within the interval. Thus a four to five-dimensional temporal code could be contrasted with a one-dimensional rate code, and the comparison made quantitative by measuring the respective values for the mutual information with the stimuli. Although the initial claim, that the temporal code carried nearly three times as much information as the rate code, was later found to be an artefact of limited sampling, and more recent analyses tend to minimize the additional information in (p.310) the temporal description (Optican et al.,1991; Eskandar et al., 1992; Tovee et al., 1993; Heller et al., 1995), this type of application has immediately appeared straightforward and important, and it has led to many developments. By concentrating on the code expressed in the output rather than on the characterization of the neuronal channel itself, this approach is not affected much by the potential complexities of the preceding black box. Limited sampling, on the other hand, is a problem, particularly because it affects much more codes with a larger number of components, for example the four to five components of the PCA temporal description, than the one-dimensional firing rate code. This is made evident in the Heller et al. (1995) paper, in which the comparison is extended to several more detailed temporal descriptions, including a binary vector description in which the presence or not of a spike in each 1 ms bin of the response constitutes a component of a 320-dimensional vector. Obviously, this binary vector must contain at least all the information present in the reduced descriptions, whereas in the results of Heller et al. (1995), despite the use of the sophisticated neural network procedure to control limited sampling biases, the binary vector appears to be the code that carries the least information of all. In practice, with the data samples available in the experiments that have been done, and even when using the most recent procedures to control limited sampling, reliable comparison can be made only with up to two- to three-dimensional codes.

Overall, the main result of these analyses applied to the responses to static stimuli in the temporal visual cortex of primates is that not much more information (perhaps only up to 10% more) can be extracted from temporal codes than from the firing rate measured over a judiciously chosen interval (Tovee et al., 1993; Heller et al., 1995). In earlier visual areas this additional fraction of information can be larger, due especially to the increased relevance, earlier on, of precisely locked transient responses (Kjaer et al., 1994; Golomb et al., 1994; Heller et al., 1995). This is because if the responses to some stimuli are more transient and to others more sustained, this will result in more information in the temporal modulation of the response. A similar effect arises from differences in the mean response latency to different stimuli (Tovee et al., 1993). However, the relevance of more substantial temporal codes for static visual stimuli remains to be demonstrated. For non-static visual stimuli and for other cortical systems, similar analyses have largely yet to be carried out, although clearly one expects to find much more prominent temporal effects e.g. in the auditory system (Nelken et al., 1994; deCharms and Merzenich, 1996).

A2.3.2 The speed of information transfer

It is intuitive that if short periods of firing of single cells are considered, there is less time for temporal modulation effects. The information conveyed about stimuli by the firing rate and that conveyed by more detailed temporal codes become similar in value. When the firing periods analysed become shorter than roughly the mean interspike interval, even the statistics of firing rate values on individual trials cease to be relevant, and the information content of the firing depends solely on the mean firing rates across all trials with each stimulus. This is expressed mathematically by considering the amount of information provided as a function of the length t of the time window over which firing is analysed, and taking the limit for t→0 (Skaggs et al., 1993; Panzeri et al., 1996b). To first order in t, only two responses can occur in (p.311) a short window of length t: either the emission of an action potential, with probability trs, where rs is the mean firing rate calculated over many trials using the same window and stimulus; or no action potential, with probability 1-trs. Inserting these conditional probabilities into Eq. A2.21, taking the limit and dividing by t, one obtains for the derivative of the stimulus-specific transformation

(A2.30) A2 Information theory

where 〈 r 〉 is the grand mean rate across stimuli. This formula thus gives the rate, in bits/s, at which information about a stimulus begins to accumulate when the firing of a cell is recorded. Such information rate depends only on the mean firing rate to that stimulus and on the grand mean rate across stimuli. As a function of rs, it follows the U-shaped curve in Fig. A2.3. The curve is universal, in the sense that it applies irrespective of the detailed firing statistics of the cell, and it expresses the fact that the emission or not of a spike in a short window conveys information inasmuch as the mean response to a given stimulus is above or below the overall mean rate. No information is conveyed about those stimuli the mean response to which is the same as the overall mean. In practice, although the curve describes only the universal behaviour of the initial slope of the specific information as a function of time, it approximates well the full specific information computed even over rather long periods (Rolls, Critchley and Treves, 1996; Rolls, Treves, Tovee and Panzeri, 1997).

A2 Information theory

Fig. A2.3 Time derivative of the stimulus specific information as a function of firing rate, for a cell firing at a grand mean rate of 50 Hz. For different grand mean rates, the graph would be simply rescaled.

Averaging Eq. A2.30 across stimuli one obtains the time derivative of the mutual information. Further dividing by the overall mean rate yields the adimensional quantity

(A2.31) A2 Information theory

which measures, in bits, the mutual information per spike provided by the cell (Bialek et al., 1991; Skaggs et al., 1993). One can prove that this quantity can range from 0 to log2 (1/a)

(A2.32) A2 Information theory

(p.312) where a is the sparseness defined in Chapter 1. For mean rates rs distributed in a nearly binary fashion, χ is close to its upper limit log2(l/a), whereas for mean rates that are nearly uniform, or at least unimodally distributed, χ is relatively close to zero (Panzeri et al., 1996b). In practice, whenever a large number of more or less ‘ecological’ stimuli are considered, mean rates are not distributed in arbitrary ways, but rather tend to follow stereotyped distributions (Panzeri et al., 1996a), and as a consequence χ and a (or, equivalently, its logarithm) tend to covary rather than to be independent variables (Skaggs and McNaughton, 1992). Therefore, measuring sparseness is in practice nearly equivalent to measuring information per spike, and the rate of rise in mutual information, χ 〈 r 〉, is largely determined by the sparseness a and the overall mean firing rate 〈 r 〉. Another quantity measuring which is equivalent to measuring the information per spike is the breadth of tuning B (Smith and Travers, 1979; see Section 10.4, where the breadth of tuning index is defined), at least when n stimuli are presented with equal frequency. It is easy to show that

(A2.33) A2 Information theory

so that extreme selectivity corresponds to B = 0 and χ = log2 n, whereas absence of any tuning corresponds to B= 1 and χ =0. Equations A2.32 and A2.33 can be turned into an inequality for the breadth of tuning as a function of sparseness, or vice versa, e.g.

(A2.34) A2 Information theory

The important point to note about the single-cell information rate χ 〈r 〉 is that, to the extent that different cells express non-redundant codes, as discussed below, the instantaneous information flow across a population of C cells can be taken to be simply Cχ 〈 r 〉, and this quantity can easily be measured directly without major limited sampling biases, or else inferred indirectly through measurements of the sparseness a. Values for the information rate χ 〈 r 〉 that have been published range from 2–3 bits/s for rat hippocampal cells (Skaggs et al., 1993) to 10–30 bits/s for primate temporal cortex visual cells (Rolls, Treves, Tovee and Panzeri, 1997), and could be compared with analogous measurements in the sensory systems of frogs and crickets, in the 100–300 bits/sec range (Rieke et al., 1993).

If the first time-derivative of the mutual information measures information flow, successive derivatives characterize, at the single-cell level, different firing modes. This is because whereas the first derivative is universal and depends only on the mean firing rates to each stimulus, the next derivatives depend also on the variability of the firing rate around its mean value, across trials, and take different forms in different firing regimes. Thus they can serve as a measure of discrimination among firing regimes with limited variability, for which, for example, the second derivative is large and positive, and firing regimes with large variability, for which the second derivative is large and negative. Poisson firing, in which in every short period of time there is a fixed probability of emitting a spike irrespective of previous firing, is an example of large variability, and the second derivative of the mutual information can be calculated to be

(A2.35) A2 Information theory

(p.313) where a is the sparseness defined in Chapter 1. This quantity is always negative. Strictly periodic firing is an example of zero variability, and in fact the second time-derivative of the mutual information becomes infinitely large in this case (although actual information values measured in a short time interval remain of course finite even for exactly periodic firing, because there is still some variability, ± 1, in the number of spikes recorded in the interval). Measures of mutual information from short intervals of firing of temporal cortex visual cells have revealed a degree of variability intermediate between that of periodic and of Poisson regimes (Rolls, Treves, Tovee and Panzeri, 1997). Similar measures can also be used to contrast the effect of the graded nature of neuronal responses, once they are analysed over a finite period of time, with the information content that would characterize neuronal activity if it reduced to a binary variable (Panzeri et al.,1996b). A binary variable with the same degree of variability would convey information at the same instantaneous rate (the first derivative being universal), but in for example 20–30% reduced amounts when analysed over times of the order of the interspike interval or longer.

A2.3.3 Redundancy versus independence across cells

The rate at which a single cell provides information translates into an instantaneous information flow across a population (with a simple multiplication by the number of cells) only to the extent that different cells provide different (independent) information. To verify whether this condition holds, one cannot extend to multiple cells the simplified formula for the first time-derivative, because it is made simple precisely by the assumption of independence between spikes, and one cannot even measure directly the full information provided by multiple (more than two to three) cells, because of the limited sampling problem discussed above. Therefore one has to analyse the degree of independence (or conversely of redundancy) either directly among pairs—at most triplets—of cells, or indirectly by using decoding procedures to transform population responses. Obviously, the results of the analysis will vary a great deal with the particular neural system considered and the particular set of stimuli, or in general of neuronal correlates, used. For many systems, before undertaking to quantify the analysis in terms of information measures, it takes only a simple qualitative description of the responses to realize that there is a lot of redundancy and very little diversity in the responses. For example, if one selects pain-responsive cells in the somatosensory system and uses painful electrical stimulation of different intensities, most of the recorded cells are likely to convey pretty much the same information, signalling the intensity of the stimulation with the intensity of their single-cell response. Therefore, an analysis of redundancy makes sense only for a neuronal system which functions to represent, and enable discriminations between, a large variety of stimuli, and only when using a set of stimuli representative, in some sense, of that large variety.

Redundancy can be defined with reference to a multiple channel of capacity T(C) which can be decomposed into C separate channels of capacities Ti, i= 1,…, C:

(A2.36) A2 Information theory

so that when the C channels are multiplexed with maximal efficiency, T(C) = ΣiTi and R = 0. What is measured more easily, in practice, is the redundancy defined with reference to a (p.314) specific source (the set of stimuli with their probabilities). Then in terms of mutual information

(A2.37) A2 Information theory

Gawne and Richmond (1993) have measured the redundancy R’ among pairs of nearby primate inferior temporal cortex visual neurons, in their response to a set of 32 Walsh patterns. They have found values with a mean〈R’〉=0.1 (and a mean single-cell transformation of 0.23 bits). Since to discriminate 32 different patterns takes 5 bits of information, in principle one would need at least 22 cells providing each 0.23 bits of strictly orthogonal information to represent the full entropy of the stimulus set. Gawne and Richmond have reasoned, however, that, because of the overlap, y, in the information they provided, more cells would be needed than if the redundancy had been zero. They have constructed a simple model based on the notion that the overlap, y, in the information provided by any two cells in the population always corresponds to the average redundancy measured for nearby pairs. A redundancy R’ = 0.1 corresponds to an overlap y = 0.2 in the information provided by the two neurons, since, counting the overlapping information only once, two cells would yield 1.8 times the amount transmitted by one cell alone. If a fraction l–y = 0.8 of the information provided by a cell is novel with respect to that provided by another cell, a fraction (1-y)2 of the information provided by a third cell will be novel with respect to what was known from the first pair, and so on, yielding an estimate of I(C) = I(1) Σi–0 C–1(1–y)i for the total information conveyed by C cells. However such a sum saturates, in the limit of an infinite number of cells, at the level I(∞) = I(1)/y, implying in their case that even with very many cells, no more than 0.23/0.2= 1.15 bits could be read off their activity, or less than a quarter of what was available as entropy in the stimulus set! Gawne and Richmond have concluded, therefore, that the average overlap among non-nearby cells must be considerably lower than that measured for cells close to each other.

The model above is simple and attractive, but experimental verification of the actual scaling of redundancy with number of cells entails collecting the responses of several cells interspersed in a population of interest. Gochin et al. (1994) have recorded from up to 58 cells in the primate temporal visual cortex, using sets of two to five visual stimuli, and have applied decoding procedures to measure the information content in the population response. The recordings were not simultaneous, but comparison with simultaneous recordings from a reduced number of cells indicated that the effect of recording the individual responses on separate trials was minor. The results were expressed in terms of the novelty N in the information provided by C cells, which being defined as the ratio of such information to C times the average single-cell information, can be expressed as

(A2.38) A2 Information theory

and is thus the complement of the redundancy. An analysis of two different data sets, which included three information measures per data set, indicated a behaviour N(C) ≈1/ √C, reminiscent of the improvement in the overall noise-to-signal ratio characterizing C independent processes contributing to the same signal. The analysis neglected however to (p.315) consider limited sampling effects, and more seriously it neglected to consider saturation effects due to the information content approaching its ceiling, given by the entropy of the stimulus set. Since this ceiling was quite low, for 5 stimuli at log25 = 2.32 bits, relative to the mutual information values measured from the population (an average of 0.26 bits, or 1/9 of the ceiling, was provided by single cells), it is conceivable that the novelty would have taken much larger values if larger stimulus sets had been used.

A simple formula describing the approach to the ceiling, and thus the saturation of information values as they come close to the entropy of the stimulus set, can be derived from a natural extension of the Gawne and Richmond (1993) model. In this extension, the information provided by single cells, measured as a fraction of the ceiling, is taken to coincide with the average overlap among pairs of randomly selected, not necessarily nearby, cells from the population. The actual value measured by Gawne and Richmond would have been, again, 1/22 = 0.045, below the overlap among nearby cells, y = 0.2. The assumption that y, measured across any pair of cells, would have been as low as the fraction of information provided by single cells is equivalent to conceiving of single cells as ‘covering’ a random portion y of information space, and thus of randomly selected pairs of cells as overlapping in a fraction (y)2 of that space, and so on, as postulated by the Gawne and Richmond (1993) model, for higher numbers of cells. The approach to the ceiling is then described by the formula

(A2.39) A2 Information theory

that/is, a simple exponential saturation to the ceiling. This simple law indeed describes remarkably well the trend in the data analysed by Rolls, Treves and Tovee (1997). Although the model has no reason to be exact, and therefore its agreement with the data should not be expected to be accurate, the crucial point it embodies is that deviations from a purely linear increase in information with the number of cells analysed are due solely to the ceiling effect. Aside from the ceiling, due to the sampling of an information space of finite entropy, the information contents of different cells’ responses are independent of each other. Thus, in the model, the observed redundancy (or indeed the overlap) is purely a consequence of the finite size of the stimulus set. If the population were probed with larger and larger sets of stimuli, or more precisely with sets of increasing entropy, and the amount of information conveyed by single cells were to remain approximately the same, then the fraction of space ‘covered’ by each cell, again y, would get smaller and smaller, tending to eliminate redundancy for very large stimulus entropies (and a fixed number of cells). The actual data were obtained with limited numbers of stimuli, and therefore cannot probe directly the conditions in which redundancy might reduce to zero. The data are consistent, however, with the hypothesis embodied in the simple model, as shown also by the near exponential approach to lower ceilings found for information values calculated with reduced subsets of the original set of stimuli (Rolls, Treves and Tovee, 1997).

The picture emerging from this set of analyses, all performed towards the end of the ventral visual stream of the monkey, is that the representation of at least some classes of objects in those areas is achieved with minimal redundancy by cells that are allocated each to analyse a different aspect of the visual stimulus. This minimal redundancy is what would be expected of a self-organizing system in which different cells acquired their response selectivities through a random process, with or without local competition among nearby cells (see Chapter 4). At the (p.316) same time, such low redundancy could also very well result in a system that is organized under some strong teaching input, so that the emerging picture is compatible with a simple random process, but by no means represents evidence in favour of its occurrence. What appears to be more solid evidence is that towards the end of one part of the visual system redundancy may be effectively minimized, a finding consistent with the general idea that one of the functions of the early visual system is indeed that of progressively minimizing redundancy in the representation of visual stimuli (Attneave, 1954; Barlow, 1961).

A2.3.4 The metric structure of neuronal representations

Further analyses can be made on the results obtained by extracting information measures from population responses, using decoding algorithms. One such analysis is that of the metric content of the representation of a set of stimuli, and is based on the comparison of information and percentage correct measures. The percentage correct, or fraction of correct decodings, is immediately extracted from decoding procedures by collating trials in which the decoded stimulus coincided with the actual stimulus presented. Mutual information measures, as noted above, take into account further aspects of the representation provided by the population of cells than percent correct measures, because they depend on the distribution of wrong decodings among all the other stimuli in the set (the mutual information I p based on the distribution of probabilities also takes into account the likelihood with which stimuli are decoded, but we focus here on maximum likelihood mutual information measures I m1, see above). For a given value of percentage correct f corr, the mutual information, which can be written

(A2.40) A2 Information theory

can range from a minimum to a maximum value. The minimum value I min is attained when wrong decodings are distributed equally among all stimuli except the correct one, thus providing maximum entropy to the distribution of wrong decodings, and in this sense interpreting all stimuli as being equally similar or dissimilar from each other. The maximum transinformation value I max, in contrast, is attained when all wrong decodings are concentrated, with minimal entropy, on a subset of stimuli which thus comprise a neighbourhood of the correct stimulus, while all remaining stimuli are then more distant, or dissimilar, from the correct one. The position of the actual value found for the mutual information within this range can be parametrized by the metric content

(A2.41) A2 Information theory

which measures the extent to which relations of similarity or dissimilarity, averaged across the stimulus set, are relevant in generating the distribution of wrong decodings found in the analysis. λm thus ranges from 0 to 1, and represents a global measure of the entropy in decoding, which can be extracted whatever the value found for f corr. Of particular interest are the values of metric content found for the representations of the same set of stimuli by different cortical areas, as they indicate, beyond the accuracy with which stimuli are represented (measured by f corr), the extent to which the representation has a tight or instead a loose metric structure. For example, a comparison of the representations of spatial views by hippocampal and parahippocampal cells indicates more metric content for the latter, consistent with a more semantic, possibly less (p.317) episodic encoding of space ‘out there’ by the neocortical cells in comparison with their hippocampal counterparts (Treves, Panzeri, et al., 1996)

Further descriptors of the detailed structure of the representations embodied in neuronal activity can be derived from the analysis of the decoding afforded by population responses (Treves, 1997).

Information theory terms—a short glossary

  1. 1. The amount of information, or surprise, in the occurrence of an event (or symbol) si of probability P(si) is

    A2 Information theory

    (the measure is in bits if logs to the base 2 are used). This is also the amount of uncertainty removed by the occurrence of the event.

  2. 2. The average amount of information per source symbol over the whole alphabet (S) of symbols is the entropy,

    A2 Information theory

    (or a priori entropy).

  3. 3. The probability of the pair of symbols si and sj is denoted P(si, sj), and is P(si) P(sj) only when the two symbols are independent.

  4. 4. Bayes theorem (given the output s’,what was the input s?) states that

    A2 Information theory

    where P(s’|s) is the forward conditional probability (given the input s, what will be the output s’?), and P(s|s’) is the backward conditional probability (given the output s’, what was the input s?).

  5. 5. Mutual information. Prior to reception of s’, the probability of the input symbol s wasP(s). This is the a priori probability of s. After reception of s’, the probability that the inputsymbol was s becomes P(s|s’), the conditional probability that s was sent given that s’ wasreceived. This is the a posteriori probability of s. The difference between the a priori and aposteriori uncertainties measures the gain of information due to the reception of s’. Onceaveraged across the values of both symbols s and s’ this is the mutual information, ortransformation

    A2 Information theory


    A2 Information theory

    H(S\S’) is sometimes called the equivocation (of S with respect to S’).


(1) In technical usage bootstrap procedures utilize random pairings of responses with stimuli with replacement, while shuffling procedures utilize random pairings of responses with stimuli without replacement.