(p.337) A4 Autoassociators
(p.337) A4 Autoassociators
The formal analysis of the operation of autoassociative memories is more developed than for other types of network, which are either simpler (and mathematically not as interesting) to describe quantitatively, or, at the opposite end, tend to be mathematically intractable, at least with sufficient generality. Autoassociative memories, instead, combine conceptual simplicity with a requirement for some sophisticated mathematics, and have therefore provided material for a rather large amount of formal, analytical work. At a deeper level, some intriguing analogies with notions arising in statistical physics, such as the reduction of the dynamical behaviour, over long times, to a description of attractor states, have appealed to physicists, who have been able to apply to the study of model networks analytical techniques developed in their field.
A4.1 Autoassociation versus short-term memory: the role of feedback
As described in Chapter 3, the prototypical autoassociative memory network is based on recurrent collaterals, and thus its neurons can sustainedly activate each other even after the termination of any external input. The network can thus serve as a short-term memory, where the attribute of short-term implies, more than a definite time-scale, precisely the fact that it is implemented as the sustained activation of one specific neuronal firing pattern. This is in addition to the network functioning as a long-term memory (for many patterns simultaneously), implemented via specific modifications of a matrix of synaptic efficacies.
There are, however, several different possible architectures for an autoassociative memory, and not all, or even most of them, can operate as short-term memories. What is required, to produce sustained activation, is obviously a substantial degree of internal feedback, and one element that differentiates between autoassociative networks is therefore the relative amount of feedback.
We have suggested that many, of all possible autoassociative architectures, can be synthetically described as points on a two-dimensional continuum (Treves and Rolls, 1991; Fig. A4.1). In the lower part of Fig. A4.1, one dimension (vertical) corresponds to the degree to which there exists a preferred direction for information flow within the network, from none at the top to fully directional at the bottom. The second dimension (horizontal in the lower part of Fig. A4.1) corresponds to the degree to which synaptic connections exist, among all possible pairs of cells in the network. This is referred to as the degree of ‘dilution’ and ranges from the full connectivity of the prototypical architecture (Ml) to the extremely (p.338) sparse one (M3). One may realize that the existence of a preferred direction (in the connectivity, without regard to the values of the synaptic weights) in itself implies that not quite all the possible connections are present; and also that in the limit of extreme dilution, the extent to which a preferred direction is imposed a priori becomes less relevant, since the dilution already limits the possible paths for information flow. Hence, the continuum is depicted in the lower part of Fig. A4.1 as a triangle. The vertices of the triangle are three limiting (extreme) cases, which have been studied extensively in the literature, and which we refer to as the prototypical architecture (Ml), the fully connected but layered architecture (M2), and the one with extremely sparse connectivity (M3). These are, again, conceptual extremes and any biologically relevant architecture is likely to fall somewhere in the middle. At the latter two vertices, and on the line joining them, there is no feedback, that is they are purely feedforward networks. Some feedback is present as soon as there are loops whereby a cell can indirectly affect itself, that is anywhere in the inside of the triangle (in the lower part of Fig. A4.1). Substantial feedback, sufficient for sustained activation, is only present towards the top left corner.
Two other effects are caused by feedback, beyond its potential for subserving short-term memory, and in addition a third effect is related in part to feedback. The first effect is that it makes a substantially more complex analysis required, in order to understand the operation of the network quantitatively. Any degree of feedback is enough to produce this complication. The second effect is that interference effects among concurrently stored memories are more serious, increasingly with more feedback but decreasingly as the coding becomes sparser. The third effect, related only in part to feedback, is that the learning phase is much simpler to implement when all units in the network receive external inputs setting, at least in broad terms, a firing level for them to encode in the memory. Recapitulating, feedback
(1)makes short-term memory possible, but only if it is dominant;
(2)helps make learning easy, with the appropriate afferent connectivity;
(3)makes interference, or cross-talk among memories, more serious;
(4)requires more sophisticated formal analyses.
Effect (1) is self-evident. Effect (2) is intuitive, but has not been properly quantified in general. Effect (3) has been quantified, and will be illustrated at the end of this appendix. Effect (4) is addressed here in the following.
A4.2 Feedback requires the use of self-consistent statistics
Analysing the retrieval operation performed by a feedforward memory model amounts to nothing more than elementary statistics. The system is fully described by equations of the type
and to find the output firing rates one must simply start from the input rates, work through the intermediate units (if any), and get to the output by successive multiplications, summations, and applications of the single-unit transform, or activation function, f (h). The need for statistics is just in the fact that usually one is not interested in the output given an input exactly specified in each incoming firing rate (as was the case with the illustratory examples of Chapter 2), but rather in the likely output given some more generic specification of the input. For example, the input may be specified only in its correlation or overlap with one or several input firing patterns previously learned by the net. These correlations or overlaps act as ‘macroscopic’ parameters, while ‘microscopic’ details (individual input firing rates) are left free to vary, provided they are consistent with the specified parameters. Doing statistics means, then, averaging over those unspecified details. Also on the output side, global parameters, such as the overlap of the output rates with a previously learned pattern, are usually more interesting than individual rates, and to find them one needs only to carry out the corresponding sum.
(p.340) When feedback loops are present, some of the units whose rates are given as the left side of Eq. A4.1 are the same as some whose rates appear on the right side of Eq. A4.2. At steady state, their rates are also the same. (We have mentioned in Chapter 3 why the operation of autoassociators is best considered at steady state, which, if the system has functioned properly, is just a recall state.) Therefore, the same variables appear on both sides of a system of equations, and as a consequence the system has to be solved, not just read out from right to left as before. One has to find a solution, that is a set of rates that are mutually consistent with each other (and with the true inputs, those which are never on the left side). Again, one normally wants answers for a generic specification of macroscopic input parameters, and also in terms of macroscopic output parameters. An example of a macroscopic input parameter that one may want to specify is the correlation of the cue, implemented by the firing r’ of those axons that are true input lines, and not recurrent ones, with a specific stored pattern. We may denote such input parameters symbolically as IP(r’), since they depend on the input rates r’, and the specific values they are required to take as IP . An example, instead, of a macroscopic output parameter, which characterizes the steady recall state reached by the network, is the final correlation with that same stored pattern that the cue was correlated with. We may denote such output parameters symbolically as OP(r), since they are functions of the output firing rates r, and the values they end up taking as OP .
The way, in the formal analysis, that one obtains the solutions is quite brutal: one sums up (integrates) all the possible values of each firing rate (times all the units in the system), and also over all values of the global output parameters, and one just throws away any set of rates that does not satisfy the following types of constraints:
(2)that the input parameters take the required values, that is IP(r’) = IP; and also one throws away inconsistent values for output parameters, that is enforces
(3)that output parameters take the values corresponding to solutions, OP = OP(r).
The constraints are enforced by inserting in the overall integration δ-functions:(A4.3)
where the δ-functions are by definition zero when their argument differs from zero, but yield one when integrated over values that allow the argument to be zero. This is the basic step in (p.341) performing self-consistent statistics; the remaining steps are largely complicated technicalities. For example, usually the next step is to write the 5-functions, in turn, as integrals over some new variables, for example
using a standard procedure of mathematical physics. Another usual ingredient of the analysis is the use of saddle-point approximations. Whereas all values of microscopic variables have the potential for appearing in a solution, and therefore have to be duly integrated over, generally values of the macroscopic ones, that correspond to solutions, are all tightly concentrated around one or more points (the saddle-points in the complex plane) which give the dominating contribution to the integrals over OP . Those integrals are then not carried out, in practice, but estimated with the value of the integrand at the saddle point(s), using again a standard technique. Details of these procedures can be learned from the Amit (1989) and Hertz, Krogh and Palmer (1991) books, or, even better, by following the calculations in the original articles, such as the seminal one by Amit, Gutfreund and Sompolinsky (1987). A collection of important papers on the physics of disordered systems, which includes some on neural networks and is complete with explicatory text by the authors, is the book by Mezard, Parisi and Virasoro (1987).
A4.3 Averaging over fixed parameters and the replica trick
Apart from the need to check the stability of the solutions found, the above would more or less describe the methods required, were it not for the need to average over microscopic parameters, whose values are fixed but unspecified. In autoassociative memories such parameters are the input firing rates r’, if they are sustained during the retrieval operation, but also the values of the synaptic weights w. The latter are in turn taken to be determined by the firing rates produced in the cells during the learning of each memory pattern. One may give some statistics of these learned firing patterns; for example, if they are binary, the probability that each unit is ‘on’ in a pattern. Much more than that, it is in general not possible to specify. Even if it were possible, one would be in a situation in which the formal analysis would yield answers that are not valid in general, for a class of networks described by certain overall parameters, but are only valid for a specific network, which has gone through its own specific history of learning, and is therefore loaded with its own ‘baggage’ of microscopic parameters. A new calculation would have to be performed for each individual network, much as, in the minuscule-size examples of Chapters 2 and 3, one has to calculate the properties for each example separately, in order to yield numbers.
Fortunately, what saves the day for analytical calculations, and allows them to be carried out only once (or rather, once for each type of network, instead of once for each individual network) is that some of the quantitative properties of systems characterized by a large number of unspecified parameters are known to self-average. This term means that the values attained by those quantities, if the system is sufficiently large, coincide with their average over many equal systems, that differ only in the unspecified details. Therefore, the answer produced by a calculation, which includes an average over those details, is not only an (p.342) average answer for the type of systems (as such it would be of limited use), but it is the answer for each system, provided each system is large enough (in what is referred in the physicists’ jargon as the thermodynamic limit). Not all quantities self-average, however; in general, those that are expected to self-average are extensive quantities, that is quantities whose magnitude is proportional to the size of the system.
A4.3.1 Free energy and mutual information
Some models of autoassociative memory networks (those of the class introduced by Hopfield (1982), and which elicited the interest of many statistical physicists) are designed to operate, at retrieval, in close analogy with physical systems that reach thermal equilibrium. In particular, their synaptic connections are chosen to be always reciprocal, and with equal weights, which results in the possibility to define an energy function that is minimal when the system has reached equilibrium. To complete the correspondence, such model networks can be considered to operate under the influence of a temperature, T, which represents a special way to account for some noise in the input-output transform operated by each unit. In such networks, extensive quantities are the value E taken by the energy function, and the so called free-energy, F, which is defined for each equilibrium state
In this formula, the fact that noise is present (T not zero) implies that several (for low noise very similar) configurations of the system concur in the equilibrium state (a configuration is the set of values taken by all the variables in the system, in the case of the network by the firing rates); therefore, the energy function needs to be averaged over those configurations, and to obtain the free-energy one has to further subtract the temperature times the entropy S of the state, which is roughly the logarithm of the number of configurations that contribute to it. We refer for example to the Amit (1989) book for a thorough discussion of the analogy between certain network models and physical systems. The only point to be made here is that the retrieval operation in such models can be described using thermodynamic notions, and that the free-energy is the important quantity to consider, not only because thermodynamic theory shows that many other interesting quantities can be obtained by taking derivatives of the free-energy, but also because it is an extensive quantity (like, in fact, the energy and the entropy) that, as such, is expected to self-average.
Another quantity that is extensive, and therefore should self-average, is mutual information, for example in an autoassociative network the mutual information between the current firing pattern and some previously stored pattern that the net is retrieving. Mutual information is also related to entropy, in fact it can be written simply as an entropy difference. Some type of mutual information can usually be defined for any associative network model, independently of whether the analogy with physical systems holds, and moreover, mutual information is, more directly than free-energy, a measure of the performance of the network as an information processing system.
In any case, what makes both free-energy (when definable) and mutual information extensive quantities also makes them depend logarithmically on probabilities. For mutual (p.343) information this can be immediately traced to its being an entropy difference. To see that the same applies to the free-energy, which is an energy minus an entropy, one has to remember that the probabilities of different configurations in an equilibrium state are exponential in their energies, and therefore the energy, and the free-energy, are logarithmic in the probabilities.
A4.3.2 Averaging the logarithms
The bottom line is that in systems with ’microscopic’ unspecified parameters one wants to calculate self-averaging quantities, and that these are generally logarithmic functions of probabilities. Now, averaging over probabilities is usually something that can be done directly, whereas averaging over logarithms of probabilities cannot be done directly. One uses therefore the replica trick, which amounts to writing the log as the limit of a power
where the power of the probability implies that one must fictitiously consider n identical copies (’replicas’) of the system, calculate the required probabilities, average them over microscopic details, and, at the end, take the limit in which the number of replicas tends to zero! The replica trick was originally introduced in physics as an analytical expedient, but it yields answers that appear to be correct, and may in some cases be confirmed with independent analytical methods (and with computer simulations). Many technicalities are involved in the use of the replica trick, which would take us beyond the scope of this book (see Mezard et al., 1987; Amit, 1989; Hertz et al., 1991).
A last remark on the analogy between models of associative memory networks and certain physical systems is the special role, among the latter, historically played by spin-glasses. A spin-glass is a system of magnetic moments that interact with each other with coupling energies that are positive or negative as a complicated function of the exact location of each pair of spins. As a result, for all practical purposes each coupling can be taken to be random (but fixed in value) and the system is said to be disordered. As such it must be described in generic terms, that is in terms of macroscopic parameters and probability distributions for microscopic ones, leaving out what we called unspecified details (the value of each and every coupling). Spin-glasses, because of these characteristics, resisted a theoretical understanding based on ordinary notions and techniques of thermodynamics, and stimulated the refinement of new concepts (and techniques) that, following the Hopfield (1982) paper, have been found applicable to associative networks models as well. The mathematics of spin-glasses has thus proven very useful for developing formal analyses of autoassociative nets. Their physics, however, that is the peculiar properties of equilibrium spin-glass states at low temperature, which initially had seemed to be relevant to neuronal nets too (Amit, Gutfreund and Sompolinsky, 1987; Crisanti, Amit and Gutfreund, 1986), has turned out to be less relevant once elements of realism such as graded response are introduced in the models previously strictly adherent to the physical analogy (Treves, 1991a,b). An example is what is referred to as the peculiar way in which spin-glasses freeze into one of several disordered states, unrelated by any symmetry, due to the frustration in their couplings (Toulouse, 1977). This was thought (p.344) to apply also to autoassociative nets, that would, whenever failing to reach a recall state, get stuck, as it were, into a meaningless spin-glass state, in which each unit would carry on firing at its own steady-state firing rate, but the overall firing pattern would not match any stored pattern. This indeed occurs, but only in autoassociators that are fairly similar to the original Hopfield model, with binary units which have both excitatory and inhibitory influences on other units. In autoassociators made up of graded-response excitatory units, typically the network, whenever it fails to retrieve a stored pattern, falls into a unique uniform state, in which no firing pattern is discernible (Treves, 1990, 1991a,b).
A4.4 Results from the analysis: storage capacity and information content
Several quantitative relationships can be extracted from the analysis of any particular model of autoassociator. Of these, three represent fundamental properties that characterize the performance of the network, and turn out to be relatively independent of the exact model of autoassociator analysed. One of these three basic properties is the time required for the operation of the network. Its quantitative analysis requires, however, a discussion of the appropriate level of realism at which to model individual processing units, and is the subject of Appendix A5. The other basic properties are the number of memories that can be stored in the network, and the amount of information each of them contains.
We show here graphs that describe the results of our analyses (Treves and Rolls, 1991) based on the use of threshold linear units as models of single neurons, and in which a number of other minor specifications of the models have been made with biological realism in mind. These specifications concern, for example, the way the thresholds are set, uniform across units, or the way the learning of different patterns produces purely superimposed modifications on a common baseline value for synaptic weights. Very similar results, qualitatively identical, were obtained with earlier analyses based on binary units (Amit, Gutfreund and Sompolinsky, 1987; Derrida, Gardner and Zippelius, 1987; Tsodyks and Feigel’man, 1988, Buhmann, Divko and Schulten, 1989; Evans, 1989). Minor differences in the specifications do produce minor differences in the results. For example, optimizing the threshold independently for each unit results in some improvements over what can be achieved with uniform thresholds, equal for all the units (Perez-Vicente and Amit, 1989). Even relatively major differences, such as constraining synaptic efficacies to take only two values, as in the Willshaw model of a pattern associator (Willshaw et al., 1969), yield much smaller differences in performance than might have been expected, once the appropriate comparisons are made (Nadal and Toulouse, 1990; Nadal, 1991). The deep reason for this relative insensitivity to exact specifications appears to be that in all the models the amount of information each synapse can store and contribute at retrieval is nearly the same, a fraction of a bit, and largely independent of the range of values the synaptic efficacy can take, continuous or discrete or binary. This aspect will be considered again in the following.
A4.4.1 The number of memories that can be stored
For partly historical reasons, this has traditionally been the main aim of physicists’ analyses, the target that could not be reached with connectionist approaches based on the use of (p.345) examples (McClelland and Rumelhart, 1986) and even not quite with quantitative approaches based on less suitable mathematics (Weisbuch and Fogelman-Soulie, 1985; McEliece, Posner, Rodemich and Venkatesh, 1987). It is referred to as the storage capacity calculation, although the information content of the memories (see below) is also an element to be taken into account in discussing the capacity of the network—sometimes referred to as the information capacity.
The number of memories that can be simultaneously in storage, and that can be individually retrieved from the autoassociative net, is usually calculated in the situation, discussed in Appendix A3, in which all memories are discrete representations, independent of each other, and each unit codes for independent information, that is its firing rate is statistically independent from that of other units, in every aspect except for the correlation induced by the memory firing patterns themselves. These conditions simplify the calculation, and also fill the need, in the absence of more structured a priori hypotheses, for a complete definition of the model. Exceptions that have been considered include the situations in which memories are organized into classes (Dotsenko, 1985; Parga and Virasoro, 1986), or in which they are correlated spatially or temporally (Monasson, 1992; Tarkowski and Loewenstein 1993) or, but then one deals with something different from an autoassociator, are organized into temporal sequences (see Chapter 3). In addition, (a) the firing of each unit across memory patterns is taken to follow the same statistics, that is it is described by the same probabilities or probability distributions for different firing levels; and (b) each memory pattern is taken to be encoded with the same mean strength. These are just convenient assumptions, which could be lifted without effort and substituted with (a) several different probability distributions for different cells, or even with a probability distribution of probability distributions; and (b) a distribution of coding strengths or, as one does to model the effects of forgetting, with coding strengths that decay in time.
As for pattern associators, if one demands absolute or nearly absolute fidelity in the retrieval of a previously stored memory pattern, the number of patterns that can be retrieved is a function of the fidelity required. Absolute fidelity is in any case an undefined concept in the case of continuously graded firing rates, and even for, for example, binary rates it is defined only with certain qualifications. Contrary to pattern associators, however, in autoassociators there exists a critical loading level, that is a critical number of memories in storage, beyond which no memory can be retrieved at all, in the sense that the retrieval operation, instead of completing the information present in the cue, erases it down to zero. Thus one can define a storage capacity in terms of this critical load. Of course it remains to be analysed how well completion operates, if it does, when the network is loaded close to the critical level. This second aspect will be discussed in the following subsection.
Within these conditions, the most important factor that determines the storage capacity is the sparseness a of the distribution of firing rates in each memory pattern, as defined in Appendix A3. Recall that for binary firing patterns, in which one of the two levels corresponds to no firing, the sparseness is just the fraction of the distribution at the other level, that is the firing one; whereas in general, if 〈…〉 denotes averages over the distribution of rates, the sparseness was defined as
(p.346) that is simply in terms of the first two moments of the distribution. Sparser encodings lead to higher capacity. The reason can be approximately understood by considering again the signal-to-noise analysis sketched in Appendix A3 for pattern associators, which can be reproduced here for autoassociators: In conditions of efficient storage, that is when the learning rule is such as to make the mean of the presynaptic G factor vanish, the (mean square) noise is proportional to a power of the sparseness higher by one than the (mean square) signal (the precise powers depend on the normalization). This can be seen by substituting in Eqs. A3.11 and A3.12 for example F(r) = r and G(r) = r- 〈r〉. Apart from constant factors, the mean square signal S2 is then proportional to C 2 〈 r 2 2(1 -a)2, while the mean square noise N2 is proportional to pC〈r 2 〉 3(1 -a)∼apS2/C. Therefore, to maintain a constant signal-to-noise ratio the number of memories p has to be of the order of the number of inputs per cell C divided by the sparseness a. The full analysis yields
in the limit in which a≪ 1 (Treves and Rolls, 1991).
The analysis based on the signal-to-noise ratio can be refined and used for architectures with no feedback, in particular the M2 and M3 models of Fig. A4.1. When feedback is present, a self-consistent analysis is required, as discussed above. This could be theoretically obtained for any architecture that includes feedback (see the analysis developed by Fukai and Shiino, 1992), but in practice it has been done extensively only in cases in which one can use the thermodynamics formalism, that is when intrinsic connections are always reciprocal and of equal strength (the so-called symmetric networks). Using a common notation one can see, however, that the thermodynamics formalism also reduces to similar equations yielding the storage capacity, with differences in some terms that express the effect of reverberating noise through feedback loops (Treves and Rolls, 1991). Formula Eq. A4.8 holds for all the architectures in the two-dimensional continuum introduced above, in the sparse coding limit. In this limit the architectures do not differ in the number of patterns they can store; whereas differences, which can be large, appear if the coding is not sparse. The reason is that the reverberation of the noise is negligible for very sparse coding, while it increases interference and therefore lowers the critical value of p/C for non-sparse coding. When a ∼ 1, architecture Ml has a substantially lower capacity than M2, and this in turn is lower than M3, with intermediate cases falling in the middle (Amit, Gutfreund and Sompolinsky, 1987; Derrida, Gardner and Zippelius, 1987; Domany et al., 1989; Treves and Rolls, 1991). A plot of the critical values of p/C versus a for the extreme cases Ml and M3 is given in Fig. A4.2. The common nearly inverse proportionality for a ∼ 0 is evident, along with the marked differences for larger values of a. It should also be noted that for very distributed codes, not only the exact type of connectivity, but also the exact type of pattern statistics strongly influences the storage capacity. As an example of ‘realistic’ statistics, we report in the graph the result obtained for an exponential distribution of firing rates, as shown in Fig. A3.1. However, probability distributions with a ∼ 0.7, a single peak around the spontaneous firing rate, and a long nearly exponential tail, which are common in the primate temporal cortex (Rolls and Tovee, 1995a; Rolls, Treves, Tovee and Panzeri, 1997) cannot quite be modelled by the purely exponential form of Eq. A3.10 (the one used in these graphs), which is peaked at (p.347) strictly zero rate, and which has a maximal possible a = 0.5. Also, the exact way in which the threshold is set would be important in this situation. One can easily insert any specification of the statistics, of the threshold setting, etc., in the formulas determining the capacity (see Treves and Rolls, 1991), but the main point here is that for non-sparse codes only a relatively limited number of memories can be stored in an autoassociator. Therefore, an exact calculation of the capacity for a non-sparse coding situation is likely to be not very relevant, since one is considering a network unlikely to be optimized in terms of the number of patterns it could store.
It should be noted that, while the calculations above were carried out using the independent pattern model, essentially the same results are expected to apply to models in which the representations stored are not independent firing patterns, but are instead organized in multiple spatial maps, or charts (Samsonovich and McNaughton, 1996). Work now in progress (Battaglia and Treves, unpublished) confirms that the maximum number of such charts that can be stored remains proportional to C and inversely proportional to the appropriate parameter defining the sparseness of each chart.
A4.4.2 Information content of each memory, and information capacity
The notion of the information in a representation has been introduced in Appendix A2, and discussed again in Appendix A3. In this appendix on autoassociators, in discussing the information in a memory pattern, we assume, without further justification, that different cells (p.348) code independent aspects of the representation, and the information they provide just adds up. Thus the total information is proportional to the number of cells present, I = Ni, and what is considered is the proportionality constant i. For a binary unit, the information it contributes on average to each representation is just a function of the entropy of the states the unit can take, i = alog2(1/a) + (1 -a) log2[l/(l -a)]. The information retrieved is not identical to this quantity, as it depends also on the imprecision in the retrieval. If, on average, retrieval results in a fraction ab of bits that should be ‘on’ and instead are ‘off’, and in a fraction (1 -a)c of bits that should be ‘off’ and instead are ‘on’, the information retrieved is (see for example Nadal and Toulouse, 1990)
which coincides with the information stored if b = c = 0. In the more realistic situation of continuously-valued units, the information stored in the representation cannot in general even be defined, as it is strictly infinite. The value obtained after subtracting the average information lost in retrieval is, however, finite, and it is just the mutual information between the distribution of stored rates and that of firing rates at retrieval
The amount of information retrieved for each memory representation depends on the storage load, but what turns out to be the case for networks with a substantial load is that this amount, per cell, when graded response units are considered, is again in the order of what would be provided by binary units, or approximately
which is the trend plotted in Fig. A4.3 as a function of a. The reason lies in the fact that the total information retrievable from the net is relatively insensitive to the type of units considered, and amounts to 0.2–0.3 bits per synapse. The total information retrievable, without considering how much information is provided with the retrieval cue, is just what can be retrieved of each pattern times the number of patterns stored simultaneously,
and its maximal value per synapse (that is, divided by CN) is an indication of the efficiency, in information terms, of the network as a storage device. One can write ir = (𝛪/CN)/(p/C), and use Eq. A4.8 together with the approximately constant value for Im (the maximum value of 𝛪/CN) to derive Eq. A4.11, apart from a numerical factor which turns out to be of order unity.
Again, the important point is that from all quantitative analyses performed so far on autoassociators that function on the same basic lines we have been discussing, the total information per synapse has always resulted to be, at its maximum, a similar fraction (0.2–0.3) of a bit. It should be noted, though, that it is only for binary-valued synapses that a bit is a self-evident upper limit; apart from binary synapses, which are of very remote interest in the context of the brain, it is only for binary-valued units (again, a rather artificial situation) that an upper bound (of 2 bits per synapse) can in some sense be derived using the Gardner (1987) approach. In this situation, therefore, simple binary models that are not in direct correspondence with biological networks have been used to demonstrate with mathematical rigour quantitative properties that have been later, and with much less rigour, found to apply also to more realistic models.
(p.350) Generally, across models, as the storage load p increases, 𝛪 increases initially proportionally to p, but eventually saturates and then goes down, as more load results in more interference, less precision at retrieval, and a greater fraction of the stored information i lost in the retrieved information ir. The maximum for 𝛪 is generally attained when the number of representations stored is of the order, but not very close to, the critical load value corresponding to the storage capacity. This is true independently of the sparseness.
In summary, therefore, the information retrieved, even when representations are encoded in terms of continuous variables, is roughly the same as in the case of binary variables. This is illustrated in Fig. A4.4, in which Im is shown as a function of the sparseness a. It is evident how the effects of sparse coding on the critical value p c for p, Fig. A4.2 and Eq. A4.8, and the effects of sparse coding on ir, described by Fig. A4.3 and 4.11, nearly cancel out in the what is essentially their product, the total information encoded by the net (Fig. A4.4; Treves and Rolls, 1991).