(p.454) Appendix 1 Neural networks and emotion-related learning
(p.454) Appendix 1 Neural networks and emotion-related learning
A.1 Neurons in the brain, the representation of information, and neuronal learning mechanisms
In Chapters 3 and 4, the type of learning that is very important in learned emotional responses was characterized as stimulus–reinforcer association learning. This is a particular case of pattern-association learning, in which the to-be-associated or conditioned stimulus is the potential secondary reinforcer, and the unconditioned stimulus is the primary reinforcer (see Fig. 4.5). In Chapter 4, it was indicated that many of the properties required of emotional learning (e.g. generalization and graceful degradation) arise in pattern associators if the correct type of distributed representation is present (Sections 4.4.1 and 4.4). In this Appendix the relevant properties of biologically plausible pattern-association memories (such as may be present in the orbitofrontal cortex and amygdala and used for stimulus–reinforcer association learning) are presented more formally, to provide a foundation for future research into the neural basis of emotional learning. In Section A.3 an introduction to autoassociation or attractor networks is given, as this type of network may be relevant to understanding how mood states are maintained. In Section A.4 an introduction to how attractor networks can interact is given, as this may be relevant to understanding how mood states influence cognitive processing, and vice versa. A fuller analysis of these neural networks and their properties, and of other neural networks which, for example, build representations of sensory stimuli, is provided by Rolls and Treves (1998) in Neural Networks and Brain Function and by Rolls and Deco (2002) in Computational Neuroscience of Vision.
Before starting the description of pattern-association neuronal networks, a brief review of the evidence on synaptic plasticity, and the rules by which synaptic strength is modified, much based on studies with long-term potentiation, is provided.
After describing pattern-association and autoassociation neural networks, an overview of another learning algorithm, called reinforcement learning, which might be relevant to learning in systems that receive rewards and punishers and which has been supposed to be implemented using the dopamine pathways (Barto 1995, Schultz et al. 1995b, Houk et al. 1995), is provided in Section A.5.
A.1.2 Neurons in the brain, and their representation in neuronal networks
Neurons in the vertebrate brain typically have, extending from the cell body, large dendrites which receive inputs from other neurons through connections called synapses. The synapses (p.455)
Once firing is initiated in the cell body (or axon initial segment of the cell body), the action potential is conducted in an all-or-nothing way to reach the synaptic terminals of the neuron, whence it may affect other neurons. Any inputs the neuron receives that cause it to become hyperpolarized make it less likely to fire (because the membrane potential is moved away from the critical threshold at which an action potential is initiated), and are described as inhibitory. The neuron can thus be thought of in a simple way as a computational element that sums its inputs within its time constant and, whenever this sum, minus any inhibitory effects, exceeds a threshold, produces an action potential that propagates to all of its outputs. This simple idea is incorporated in many neuronal network models using a formalism of a type described in the next Section.
A.1.3 A formalism for approaching the operation of single neurons in anetwork
Let us consider a neuron i as shown in Fig. A.2, which receives inputs from axons that we label j through synapses of strength wij. The first subscript (i) refers to the receiving neuron, and the second subscript (j) to the particular input. j counts from 1 to C, where C is the number of synapses or connections received by the neuron. The firing rate of the ith neuron is denoted as yi, and that of the jth input to the neuron as xj. To express the idea that the neuron makes a simple linear summation of the inputs it receives, we can write the activation of neuron i, denoted hi, as
A property implied by equation A.1 is that the postsynaptic membrane is electrically short, and so summates its inputs irrespective of where on the dendrite the input is received. In real neurons, the transduction of current intofiring frequency (the analogue of the transfer function of equation A.2) is generally studied not with synaptic inputs but by applying a steady current through an electrode into the soma. Examples of the resulting curves, which illustrate the additional phenomenon of firing rate adaptation, are reproduced in Fig. A.4.
A.1.4 Synaptic modification
For a neuronal network to perform useful computation, that is to produce a given output when it receives a particular input, the synaptic weights must be set up appropriately. This is often performed by synaptic modification occurring during learning.
A simple learning rule that was originally presaged by Donald Hebb (1949) proposes that synapses increase in strength when there is conjunctive presynaptic and postsynaptic activity. The Hebb rule can be expressed more formally as follows: (p.459)
The Hebb rule is expressed in this multiplicative form to reflect the idea that both presynaptic and postsynaptic activity must be present for the synapses to increase in strength. The multiplicative form also reflects the idea that strong pre-and postsynaptic firing will produce a larger change of synaptic weight than smaller firing rates. The Hebb rule thus captures what is typically found in studies of associative Long-Term Potentiation (LTP) in the brain, described in Section A.1.5.
One useful property of large neurons in the brain, such as cortical pyramidal cells, is that with their short electrical length, the postsynaptic term, yi, is available on much of the dendrite of a cell. The implication of this is that once sufficient postsynaptic activation has been produced, any active presynaptic terminal on the neuron will show synaptic strengthening. This enables associations between coactive inputs, or correlated activity in input axons, to be learned by neurons using this simple associative learning rule.
A.1.5 Long-Term Potentiation and Long-Term Depression as models of synaptic modification
Long-Term Potentiation (LTP) and Long-Term Depression (LTD) provide useful models of some of the synaptic modifications that occur in the brain. The synaptic changes found appear to be synapse–specific, and to depend on information available locally at the synapse. LTP and LTD may thus provide a good model of the biological synaptic modifications involved in real neuronal network operations in the brain. We next therefore describe some of the properties of LTP and LTD, and evidence that implicates them in learning in at least some brain systems. Even if they turn out not to be the basis for the synaptic modifications that occur during learning, they have many of the properties that would be needed by some of the synaptic modification systems used by the brain.
Long-term potentiation (LTP) is a use-dependent and sustained increase in synaptic strength that can be induced by brief periods of synaptic stimulation. It is usually measured as a sustained increase in the amplitude of electrically evoked responses in specific neural pathways following brief trains of high-frequency stimulation (see Fig. A.5b). For example, high frequency stimulation of the Schaffer collateral inputs to the hippocampal CA1 cells results in a larger response recorded from the CA1 cells to single test pulse stimulation of the pathway. LTP is long-lasting, in that its effect can be measured for hours in hippocampal slices, and in chronic in vivo experiments in some cases it may last for months. LTP becomes evident rapidly, typically in less than 1 minute. LTP is in some brain systems associative. This is illustrated in Fig. A.5c, in which a weak input to a group of cells (e.g. the commissural input to CA1) does not show LTP unless it is given at the same time as (i.e. associatively with) another input (which could be weak or strong) to the cells. The associativity arises because it is only when sufficient activation of the postsynaptic neuron to exceed the threshold of NMDA receptors (see below) is produced that any learning can occur. The two weak inputs summate to produce sufficient depolarization to exceed the threshold. This associative (p.460)
These spatiotemporal properties of LTP can be understood in terms of actions of the inputs on the postsynaptic cell, which in the hippocampus has two classes of receptor, NMDA (N-methyl-D-aspartate) and K–Q(kainate–quisqualate), both activated by the glutamate released by the presynaptic terminals. The NMDA receptor channels are normally blocked by Mg2+, but when the cell is strongly depolarized by strong tetanic stimulation of the type necessary to (p.461)
There are a number of possibilities about what change is triggered by the entry of Ca2+ to the postsynaptic cell to mediate LTP. One possibility is that somehow a messenger reaches the presynaptic terminals from the postsynaptic membrane and, if the terminals are active, causes them to release more transmitter in future whenever they are activated by an action potential. Consistent with this possibility is the observation that, after LTP has been induced, more transmitter appears to be released from the presynaptic endings. Another possibility is that the postsynaptic membrane changes just where Ca2+ has entered, so that K–Q receptors become more responsive to glutamate released in future. Consistent with this possibility is the observation that after LTP, the postsynaptic cell may respond more to locally applied glutamate (using a microiontophoretic technique).
The rule that underlies associative LTP is thus that synapses connecting two neurons become stronger if there is conjunctive presynaptic and (strong) postsynaptic activity. This learning rule for synaptic modification is sometimes called the Hebb rule, after Donald Hebb of McGill University who drew attention to this possibility, and its potential importance in learning (Hebb 1949).
In that LTP is long-lasting, develops rapidly, is synapse-specific, and is in some cases associative, it is of interest as a potential synaptic mechanism underlying some forms of memory. Evidence linking it directly to some forms of learning comes from experiments in which it has been shown that the drug AP5, infused so that it reaches the hippocampus to block NMDA receptors, blocks spatial learning mediated by the hippocampus (see Morris (1989), Martin, Grimwood and Morris (2000)). The task learned by the rats was to find the location relative to cues in a room of a platform submerged in an opaque liquid (milk). Interestingly, if the rats had already learned where the platform was, then the NMDA infusion did not block performance of the task. This is a close parallel to LTP, in that the learning, but not the subsequent expression of what had been learned, was blocked by the NMDA antagonist AP5. Although there is still some uncertainty about the experimental evidence that links LTP to learning (see for example Martin, Grimwood and Morris (2000)), there is a need for a synapse-specific modifiability of synaptic strengths on neurons if neuronal networks are to learn (see Section A.2). If LTP is not always an exact model of the synaptic modification that occurs during learning, then something with many of the properties of LTP is nevertheless needed, and is likely to be present in the brain given the functions known to be implemented in many brain regions (see Rolls and Treves (1998)).
In another model of the role of LTP in memory, Davis (2000) has studied the role of the amygdala in learning associations to fear-inducing stimuli. He has shown that blockade of NMDA synapses in the amygdala interferes with this type of learning, consistent with the idea that LTP also provides a useful model of this type of learning (see further Chapter 4).
Long-Term Depression (LTD) can also occur. It can in principle be associative or non-associative. In associative LTD, the alteration of synaptic strength depends on the pre-and post-synaptic (p.463) activities. There are two types. Heterosynaptic LTD occurs when the postsynaptic neuron is strongly activated, and there is low presynaptic activity (see Fig. A.5b input B, and Table A.1). Heterosynaptic LTD is so-called because the synapse that weakens is other than (hetero-) the one through which the postsynaptic neuron is activated. Heterosynaptic LTD is important in associative neuronal networks, and in competitive neuronal networks (see Chapter 7 of Rolls and Deco (2002)). In competitive neural networks it would be helpful if the degree of heterosynaptic LTD depended on the existing strength of the synapse, and there is some evidence that this may be the case (see Chapter 7 of Rolls and Deco (2002)). Homosynaptic LTD occurs when the presynaptic neuron is strongly active, and the postsynaptic neuron has some, but low, activity (see Fig. A.5d and Table A.1). Homosynaptic LTD is so-called because the synapse that weakens is the same as (homo-) the one that is active. Heterosynaptic and homosynaptic LTD are found in the neocortex (Artola and Singer 1993, Singer 1995, Frégnac 1996) and hippocampus (Christie 1996), and in many cases are dependent on activation of NMDA receptors (see also Fazeli and Collingridge (1996)). LTD in the cerebellum is evident as weakening of active parallel fibre to Purkinje cell synapses when the climbing fibre connecting to a Purkinje cell is active (Ito 1984, Ito 1989, Ito 1993b, Ito 1993a).
An interesting time-dependence of LTP and LTD has been observed, with LTP occurring especially when the presynaptic spikes precede by a few ms the postsynaptic activation, and LTD occurring when the presynaptic spikes follow the postsynaptic activation by a few ms (Markram, Lübke, Frotscher and Sakmann 1997, Bi and Poo 1998). This type of temporally asymmetric Hebbian learning rule, demonstrated in the neocortex and the hippocampus, can induce associations over time, and notjust between simultaneous events. Networks of neurons with such synapses can learn sequences (Minai and Levy 1993), enabling them to predict the future state of the postsynaptic neuron based on past experience (Abbott and Blum 1996) (see further Koch (1999), Markram, Pikus, Gupta and Tsodyks (1998) and Abbott and Nelson (2000)). This mechanism, because of its apparent time-specificity for periods in the range of tens of ms, could also encourage neurons to learn to respond to temporally synchronous presynaptic firing (Gerstner, Kreiter, Markram and Herz 1997), and indeed to decrease the synaptic strengths from neurons that fire at random times with respect to the synchronized group. This mechanism might also play a role in the normalization of the strength of synaptic connection strengths onto a neuron. Under the somewhat steady state conditions of thefiring of neurons in the higher parts of the ventral visual system on the 10 ms timescale that are observed not only when single stimuli are presented for 500 ms (see Fig. 4.11), but also when macaques have found a search target and are looking at it (in the experiments described in Section 184.108.40.206), the average of the presynaptic and postsynaptic rates are likely to be the important determinants of synaptic modification. Part of the reason for this is that correlations between the firing of simultaneously recorded inferior temporal cortex neurons are not common, and if present are not very strong or typically restricted to a short time window in the order of 10 ms (see Rolls and Deco (2002), Franco, Rolls, Aggelopoulos and Treves (2004) and Aggelopoulos, Franco and Rolls (2005)). This point is also made in the context that each neuron has thousands of inputs, several tens of which are normally likely to be active when a cell is firing above its spontaneous firing rate and is strongly depolarized. This may make it unlikely statistically that there will be a strong correlation between a particular presynaptic spike and postsynaptic firing, and thus that this is likely to be a main determinant of synaptic strength under these natural conditions.
(p.464) A.1.6 Distributed representations
When considering the operation of many neuronal networks in the brain, it is found that many useful properties arise if each input to the network (arriving on the axons as a firing rate vector x) is encoded in the activity of an ensemble or population of the axons or input lines (distributed encoding), and is not signalled by the activity of a single input, which is called local encoding. We start with some definitions, and then highlight some of the differences, and summarize some evidence that shows the type of encoding used in some brain regions. Then in Section A.2.8 (e.g. Table A.2), we show how many of the useful properties of the neuronal networks described depend on distributed encoding. Rolls and Deco (2002) (in Chapter 5) review evidence on the encoding actually found in visual cortical areas.
A local representation is one in which all the information that a particular stimulus or event occurred is provided by the activity of one of the neurons. In a famous example, a single neuron might be active only if one's grandmother was being seen. An implication is that most neurons in the brain regions where objects or events are represented would fire only very rarely. A problem with this type of encoding is that a new neuron would be needed for every object or event that has to be represented. There are many other disadvantages of this type of encoding, many of which are made apparent in this book. Moreover, there is evidence that objects are represented in the brain by a different type of encoding.
A fully distributed representation is one in which all the information that a particular stimulus or event occurred is provided by the activity of the full set of neurons. If the neurons are binary (e.g. either active or not), the most distributed encoding is when half the neurons are active for any one stimulus or event.
A sparse distributed representation is a distributed representation in which a small proportion of the population of neurons is active at any one time. In a sparse representation with binary neurons, less than half of the neurons are active for any one stimulus or event. For binary neurons, we can use as a measure of the sparseness the proportion of neurons in the active state. For neurons with real, continuously variable, values of firing rates, the sparseness a of the representation can be measured, by extending the binary notion of the proportion of neurons that are firing, as
Coarse coding utilizes overlaps of receptive fields, and can compute positions in the input space using differences between the firing levels of coactive cells (e.g. colour-tuned cones in the retina). The representation implied is very distributed. Fine coding (in which for example a neuron may be ‘tuned’ to the exact orientation and position of a stimulus) implies more local coding.
A.1.6.2 Advantages of different types of coding
One advantage of distributed encoding is that the similarity between two representations can be reflected by the correlation between the two patterns of activity that represent the different stimuli. We have already introduced the idea that the input to a neuron is represented by the (p.465) activity of its set of input axons xj, where j indexes the axons, numbered from j = 1,C (see Fig. A.2 and equation A.1). Now the set of activities of the input axons is a vector (a vector is an ordered set of numbers; Appendix 1 of Rolls and Treves (1998) and of Rolls and Deco (2002) provides a summary of some of the concepts involved). We can denote as x1 the vector of axonal activity that represents stimulus 1, and x2 the vector that represents stimulus 2. Then the similarity between the two vectors, and thus the two stimuli, is reflected by the correlation between the two vectors. The correlation will be high if the activity of each axon in the two representations is similar; and will become more and more different as the activity of more and more of the axons differs in the two representations. Thus the similarity of two inputs can be represented in a graded or continuous way if (this type of) distributed encoding is used. This enables generalization to similar stimuli, or to incomplete versions of a stimulus (if it is for example partly seen or partly remembered), to occur. With a local representation, either one stimulus or another is represented, and similarities between different stimuli are not encoded.
Another advantage of distributed encoding is that the number of different stimuli that can be represented by a set of C components (e.g. the activity of C axons) can be very large. A simple example is provided by the binary encoding of an 8-element vector. One component can code for which of two stimuli has been seen, 2 components (or bits in a computer byte) for 4 stimuli, 3 components for 8 stimuli, 8 components for 256 stimuli, etc. That is, the number of stimuli increases exponentially with the number of components (or in this case, axons) in the representation. (In this simple binary illustrative case, the number of stimuli that can be encoded is 2C.) Put the other way round, even if a neuron has only a limited number of inputs (e.g. a few thousand), it can nevertheless receive a great deal of information about which stimulus was present. This ability of a neuron with a limited number of inputs to receive information about which of potentially very many input events is present is probably one factor that makes computation by the brain possible. With local encoding, the number of stimuli that can be encoded increases only linearly with the number C of axons or components (because a different component is needed to represent each new stimulus). (In our example, only 8 stimuli could be represented by 8 axons.)
In the real brain, there is now good evidence that in a number of brain systems, including the high-order visual and olfactory cortices, and the hippocampus, distributed encoding with the above two properties, of representing similarity, and of exponentially increasing encoding capacity as the number of neurons in the representation increases, is found (Rolls and Tovee 1995b, Abbott, Rolls and Tovee 1996a, Rolls, Treves and Tovee 1997b, Rolls, Treves, Robertson, Georges-François and Panzeri 1998b, Rolls, Aggelopoulos, Franco and Treves 2004). For example, in the primate inferior temporal visual cortex, the number of faces or objects that can be represented increases approximately exponentially with the number of neurons in the population (see Chapter 4). If we consider instead the information about which stimulus is seen, we see that this rises approximately linearly with the number of neurons in the representation (see Chapter 4). This corresponds to an exponential rise in the number of stimuli encoded, because information is a log measure (see Appendix B of Rolls and Deco (2002)). A similar result has been found for the encoding of position in space by the primate hippocampus (Rolls, Treves, Robertson, Georges-François and Panzeri 1998b). It is particularly important that the information can be read from the ensemble of neurons using a simple measure of the similarity of vectors, the correlation (or dot product) between two vectors. The importance of this is that it is essentially vector similarity operations that characterize (p.466) the operation of many neuronal networks (see Section A.2). The neurophysiological results show that both the ability to reflect similarity by vector correlation, and the utilization of exponential coding capacity, are properties of real neuronal networks found in the brain.
To emphasize one of the points being made here, although the binary encoding used in the 8-bit vector described above has optimal capacity for binary encoding, it is not optimal for vector similarity operations. For example, the two very similar numbers 127 and 128 are represented by 01111111 and 10000000 with binary encoding, yet the correlation or bit overlap of these vectors is 0. The brain in contrast uses a code that has the attractive property of exponentially increasing capacity with the number of neurons in the representation, though it is different from the simple binary encoding of numbers used in computers; and at the same time the brain codes stimuli in such a way that the code can be read off with simple dot product or correlation-related decoding, which is what is specified for the elementary neuronal network operation shown in equation A.1 (see Rolls and Deco (2002) Chapter 5).
A.2 Pattern association memory
A fundamental operation of most nervous systems is to learn to associate a first stimulus with a second that occurs at about the same time, and to retrieve the second stimulus when the first is presented. The first stimulus might be the sight of food, and the second stimulus the taste of food. After the association has been learned, the sight of food would enable its taste to be retrieved. In classical conditioning, the taste of food might elicit an unconditioned response of salivation, and if the sight of the food is paired with its taste, then the sight of that food would by learning come to produce salivation. Pattern associators are thus used where the outputs of the visual system interface to learning systems in the orbitofrontal cortex and amygdala that learn associations between the sight of objects and their taste or touch in stimulus–reinforcer association learning (see Chapter 4). Pattern association is also used throughout the visual processing cortical areas, as it is the architecture that describes the backprojection connections from one cortical area to the preceding cortical area (Rolls and Deco 2002). Pattern association thus contributes to implementing top-down influences in vision, including the effects of attention from higher to lower cortical areas, and thus between the object and spatial processing streams (Rolls and Deco 2002); the effects of mood on memory and visual information processing (see Section 4.10); the recall of visual memories; and the operation of visual short-term memory (Rolls and Deco 2002).
A.2.1 Architecture and operation
The essential elements necessary for pattern association, forming what could be called a prototypical pattern associator network, are shown in Fig. A.7. What we have called the second or unconditioned stimulus pattern is applied through unmodifiable synapses generating an input to each neuron, which, being external with respect to the synaptic matrix we focus on, we can call the external input ei for the ith neuron. (We can also treat this as a vector, e, as indicated in the legend to Fig. A.7. Vectors and simple operations performed with them are summarized in Appendix A of Rolls and Deco (2002)). This unconditioned stimulus is dominant in producing or forcing the firing of the output neurons (yi for the ith neuron, or the vector y). At the same time, the first or conditioned stimulus pattern consisting of the set of firings on the horizontally running input axons in Fig. A.7 (xj for the jth axon) (or equivalently (p.467)
Next we introduce a more precise description of the above by writing down explicit mathematical rules for the operation of the simple network model of Fig. A.7, which will help us to understand how pattern association memories in general operate. (In this description (p.468) we introduce simple vector operations, and, for those who are not familiar with these, refer the reader to for example Appendix 1 of Rolls and Deco (2002).) We have denoted above a conditioned stimulus input pattern as x. Each of the axons has a firing rate, and if we count or index through the axons using the subscript j, the firing rate of the first axon is x 1, of the second x 2, of the jth xj, etc. The whole set of axons forms a vector, which is just an ordered (1,2,3, etc.) set of elements. The firing rate of each axon xj is one element of the firing rate vector x. Similarly, using i as the index, we can denote the firing rate of any output neuron as yi, and the firing rate output vector as y. With this terminology, we can then identify any synapse onto neuron i from neuron j as wij (see Fig. A.7). In this book, the first index, i, always refers to the receiving neuron (and thus signifies a dendrite), while the second index, j, refers to the sending neuron (and thus signifies a conditioned stimulus input axon in Fig. A.7). We can now specify the learning and retrieval operations as follows:
The firing rate of every output neuron is forced to a value determined by the unconditioned (or external or forcing stimulus) input. In our simple model this means that for any one neuron i,
The Hebb rule can then be written as follows:
The Hebb rule is expressed in this multiplicative form to reflect the idea that both presynaptic and postsynaptic activity must be present for the synapses to increase in strength. The multiplicative form also reflects the idea that strong pre-and postsynaptic firing will produce a larger change of synaptic weight than smaller firing rates. It is also assumed for now that before any learning takes place, the synaptic strengths are small in relation to the changes that can be produced during Hebbian learning. We will see that this assumption can be relaxed later when a modified Hebb rule is introduced that can lead to a reduction in synaptic strength under some conditions.
When the conditioned stimulus is present on the input axons, the total activation hi of a neuron i is the sum of all the activations produced through each strengthened synapse wij by each active neuron xj. We can express this as
The multiplicative form here indicates that activation should be produced by an axon only if it is firing, and only if it is connected to the dendrite by a strengthened synapse. It also indicates that the strength of the activation reflects how fast the axon xj is firing, and how strong the synapse wij is. The sum of all such activations expresses the idea that summation (of synaptic currents in real neurons) occurs along the length of the dendrite, to produce activation at the cell body, where the activation hi is converted into firing yi. This conversion can be expressed as
A.2.2 A simple model
An example of these learning and recall operations is provided in a simple form as follows. The neurons will have simple firing rates, which can be 0 to represent no activity, and 1 to indicate high firing. They are thus binary neurons, which can assume one of two firing rates. If we have a pattern associator with six input axons and four output neurons, we could represent the network before learning, with the same layout as in Fig. A.7, as shown in Fig. A.8:
After pairing the CS with the UCS during one learning trial, some of the synaptic weights will be incremented according to Eqn A.6, so that after learning this pair the synaptic weights will become as shown in Fig. A.9:
We can represent what happens during recall, when, for example, we present the CS that has been learned, as shown in Fig. A.10:
(p.471) We can now illustrate how a number of different associations can be stored in such a pattern associator, and retrieved correctly. Let us associate a new CS pattern 110001 with the UCS 0101 in the same pattern associator. The weights will become as shown next in Fig. A.11 after learning:
This illustration shows the value of some threshold non-linearity in the activation function of the neurons. In this case, the activations did reflect some small cross-talk or interference from the previous pattern association of CS1 with UCS1, but this was removed by the threshold operation, to clean up the recall firing. The example also shows that when further associations are learned by a pattern associator trained with the Hebb rule, Eqn A.6, some synapses will reflect increments above a synaptic strength of 1. It is left as an exercise to the reader to verify that recall is still perfect to CS1, the vector 101010. (The activation vector h is 3401, and the output firing vector y with the same threshold of 2 is 1100, which is perfect recall.)
(p.472) A.2.3 The vector interpretation
The way in which recall is produced, equation A.7, consists for each output neuron i of multiplying each input firing rate xj by the corresponding synaptic weight wij and summing the products to obtain the activation hi. Now we can consider the firing rates xj where j varies from 1 to N′, the number of axons, to be a vector. (A vector is simply an ordered set of numbers – see Appendix 1 of Rolls and Deco (2002).) Let us call this vector x. Similarly, on aneuron i, the synaptic weights can be treated as a vector, wi. (The subscript i here indicates that this is the weight vector on the ith neuron.) The operation we have just described to obtain the activation of an output neuron can now be seen to be a simple multiplication operation of two vectors to produce a single output value (called a scalar output). This is the inner product or dot product of two vectors, and can be written
It can now be seen that a fundamental operation many neurons perform is effectively to compute how similar an input pattern vector x is to their stored weight vector wi. The similarity measure they compute, the dot product, is a very good measure of similarity, and indeed, the standard (Pearson product-moment) correlation coefficient used in statistics is the same as a normalized dot product with the mean subtracted from each vector, as shown in Appendix 1 of Rolls and Deco (2002). (The normalization used in the correlation coefficient results in the coefficient varying always between +1 and −1, whereas the actual scalar value of a dot product clearly depends on the length of the vectors from which it is calculated.)
With these concepts, we can now see that during learning, a pattern associator adds to its weight vector a vector δwi that has the same pattern as the input pattern x, if the postsynaptic neuron i is strongly activated. Indeed, we can express equation A.6 in vector form as
During recall, pattern associators generalize, and produce appropriate outputs if a recall cue vector xr is similar to a vector that has been learned already. This occurs because the recall (p.473) operation involves computing the dot (inner) product of the input pattern vector xr with the synaptic weight vector wi, so that the firing produced, yi, reflects the similarity of the current input to the previously learned input pattern x. (Generalization will occur to input cue or conditioned stimulus patterns xr that are incomplete versions of an original conditioned stimulus x, although the term completion is usually applied to the autoassociation networks described in Section A.3.)
This is an extremely important property of pattern associators, for input stimuli during recall will rarely be absolutely identical to what has been learned previously, and automatic generalization to similar stimuli is extremely useful, and has great adaptive value in biological systems.
Generalization can be illustrated with the simple binary pattern associator considered above. (Those who have appreciated the vector description just given might wish to skip this illustration.) Instead of the second CS, pattern vector 110001, we will use the similar recall cue 110100, as shown in Fig. A.13:
A.2.4.2 Graceful degradation or fault tolerance
If the synaptic weight vector wi (or the weight matrix, which we can call W) has synapses missing (e.g. during development), or loses synapses, then the activation hi or h is still reasonable, because hi is the dot product (correlation) of x with wi. The result, especially after passing through the activation function, can frequently be perfect recall. The same property arises if for example one or some of the conditioned stimulus (CS) input axons are lost or damaged. This is a very important property of associative memories, and is not a property of conventional computer memories, which produce incorrect data if even only 1 storage location (for 1 bit or binary digit of data) of their memory is damaged or cannot be accessed. This property of graceful degradation is of great adaptive value for biological (p.474) systems.
We can illustrate this with a simple example. If we damage two of the synapses in Fig. A.12 to produce the synaptic matrix shown in Fig. A.14 (where x indicates a damaged synapse which has no effect, but was previously 1), and now present the second CS, the retrieval is as follows:
A.2.4.3 The importance of distributed representations for pattern associators
A distributed representation is one in which the firing or activity of all the elements in the vector is used to encode a particular stimulus. For example, in a conditioned stimulus vector CS1 that has the value 101010, we need to know the state of all the elements to know which stimulus is being represented. Another stimulus, CS2, is represented by the vector 110001. We can represent many different events or stimuli with such overlapping sets of elements, and because in general any one element cannot be used to identify the stimulus, but instead the information about which stimulus is present is distributed over the population of elements or neurons, this is called a distributed representation (see Section A.1.6). If, for binary neurons, half the neurons are in one state (e.g. 0), and the other half are in the other state (e.g. 1), then the representation is described as fully distributed. The CS representations above are thus fully distributed. If only a smaller proportion of the neurons is active to represent a stimulus, as in the vector 100001, then this is a sparse representation. For binary representations, we can quantify the sparseness by the proportion of neurons in the active (1) state.
In contrast, a local representation is one in which all the information that a particular stimulus or event has occurred is provided by the activity of one of the neurons, or elements in the vector. One stimulus might be represented by the vector 100000, another stimulus by the vector 010000, and a third stimulus by the vector 001000. The activity of neuron or element 1 would indicate that stimulus 1 was present, and of neuron 2, that stimulus 2 was present. The representation is local in that if a particular neuron is active, we know that the stimulus (p.475) represented by that neuron is present. In neurophysiology, if such cells were present, they might be called ‘grandmother cells’ (cf. Barlow (1972), (1995)), in that one neuron might represent a stimulus in the environment as complex and specific as one's grandmother. Where the activity of a number of cells must be taken into account in order to represent a stimulus (such as an individual taste), then the representation is sometimes described as using ensemble encoding.
The propertiesjust described for associative memories, generalization, and graceful degradation are only implemented if the representation of the CS or x vector is distributed. This occurs because the recall operation involves computing the dot (inner) product of the input pattern vector xr with the synaptic weight vector wi. This allows the activation hi to reflect the similarity of the current input pattern to a previously learned input pattern x only if several or many elements of the x and xr vectors are in the active state to represent a pattern. If local encoding were used, e.g. 100000, then if the first element of the vector (which might be the firing of axon 1, i.e. x 1, or the strength of synapse i 1, w i1) is lost, the resulting vector is not similar to any other CS vector, and the activation is 0. In the case of local encoding, the important properties of associative memories, generalization and graceful degradation do not thus emerge. Graceful degradation and generalization are dependent on distributed representations, for then the dot product can reflect similarity even when some elements of the vectors involved are altered. If we think of the correlation between Y and X in a graph, then this correlation is affected only a little if a few X, Y pairs of data are lost (see Appendix 1 of Rolls and Deco (2002)).
A.2.5 Prototype extraction, extraction of central tendency, and noise reduction
If a set of similar conditioned stimulus vectors x are paired with the same unconditioned stimulus ei, the weight vector wi becomes (or points towards) the sum (or with scaling, the average) of the set of similar vectors x. This follows from the operation of the Hebb rule in equation A.6. When tested at recall, the output of the memory is then best to the average input pattern vector denoted < x >. If the average is thought of as a prototype, then even though the prototype vector < x > itself may never have been seen, the best output of the neuron or network is to the prototype. This produces ‘extraction of the prototype’ or ‘central tendency’. The same phenomenon is a feature of human memory performance (see McClelland and Rumelhart (1986) Chapter 17), and this simple process with distributed representations in a neural network accounts for the psychological phenomenon.
If the different exemplars of the vector x are thought of as noisy versions of the true input pattern vector < x > (with incorrect values for some of the elements), then the pattern associator has performed ‘noise reduction’, in that the output produced by any one of these vectors will represent the output produced by the true, noiseless, average vector < x >.
Recall is very fast in a real neuronal network, because the conditioned stimulus input firings xj(j=1,C axons) can be applied simultaneously to the synapses wij, and the activation hi can be accumulated in one or two time constants of the dendrite (e.g. 10–20 ms). Whenever the threshold of the cell is exceeded, it fires. Thus, in effectively one step, which takes the brain no more than 10–20 ms, all the output neurons of the pattern associator can be firing (p.476) with rates that reflect the input firing of every axon. This is very different from a conventional digital computer, in which computing hi in equation A.7 would involve C multiplication and addition operations occurring one after another, or 2C time steps.
The brain performs parallel computation in at least two senses in even a pattern associator. One is that for a single neuron, the separate contributions of the firing rate xj of each axon j multiplied by the synaptic weight wij are computed in parallel and added in the same time step. The second is that this can be performed in parallel for all neurons i = 1, N in the network, where there are N output neurons in the network. It is these types of parallel processing that enable these classes of neuronal network in the brain to operate so fast, in effectively so few steps.
Learning is also fast (‘one-shot’) in pattern associators, in that a single pairing of the conditioned stimulus x and the unconditioned stimulus (UCS) e which produces the unconditioned output firing y enables the association to be learned. There is no need to repeat the pairing in order to discover over many trials the appropriate mapping. This is extremely important for biological systems, in which a single co-occurrence of two events may lead to learning that could have life-saving consequences. (For example, the pairing of a visual stimulus with a potentially life-threatening aversive event may enable that event to be avoided in future.) Although repeated pairing with small variations of the vectors is used to obtain the useful properties of prototype extraction, extraction of central tendency, and noise reduction, the essential properties of generalization and graceful degradation are obtained with just one pairing. The actual time scales of the learning in the brain are indicated by studies of associative synaptic modification using long-term potentiation paradigms (LTP, see Section A.1.5). Co-occurrence or near simultaneity of the CS and UCS is required for periods of as little as 100 ms, with expression of the synaptic modification being present within typically a few seconds.
A.2.7 Local learning rule
The simplest learning rule used in pattern association neural networks, a version of the Hebb rule, is, as shown in equation A.6 above,
Evidence that a learning rule with the general form of equation A.6 is implemented in at least some parts of the brain comes from studies of long-term potentiation, described in Section A.1.5. Long-term potentiation (LTP) has the synaptic specificity defined by equation A.6, in that only synapses from active afferents, not those from inactive afferents, become strengthened. Synaptic specificity is important for a pattern associator, and most other types of neuronal network, to operate correctly. The number of independently modifiable synapses (p.477)
Another useful property of real neurons in relation to equation A.6 is that the postsynaptic term, yi, is available on much of the dendrite of a cell, because the electrotonic length of the dendrite is short. In addition, active propagation of spiking activity from the cell body along the dendrite may help to provide a uniform postsynaptic term for the learning. Thus if a neuron is strongly activated with a high value for yi, then any active synapse onto the cell will be capable of being modified. This enables the cell to learn an association between the pattern of activity on all its axons and its postsynaptic activation, which is stored as an addition to its weight vector wi. Then later on, at recall, the output can be produced as a vector dot product operation between the input pattern vector x and the weight vector wi, so that the output of the cell can reflect the correlation between the current input vector and what has previously been learned by the cell.
The question of the storage capacity of a pattern associator is considered in detail in Appendix A3 of Rolls and Treves (1998). It is pointed out there that, for this type of associative network, the number of memories that it can hold simultaneously in storage has to be analysed together with the retrieval quality of each output representation, and then only for a given quality of the representation provided in the input. This is in contrast to autoassociative nets (Section A.3), in which a critical number of stored memories exists (as a function of various parameters of the network), beyond which attempting to store additional memories results in it becoming impossible to retrieve essentially anything. With a pattern associator, instead, one will always (p.478) retrieve something, but this something will be very small (in information or correlation terms) if too many associations are simultaneously in storage and/or if too little is provided as input.
The conjoint quality-capacity input analysis can be carried out, for any specific instance of a pattern associator, by using formal mathematical models and established analytical procedures (see e.g. Treves (1995)). This, however, has to be done case by case. It is anyway useful to develop some intuition for how a pattern associator operates, by considering what its capacity would be in certain well-defined simplified cases.
Linear associative neuronal networks These networks are made up of units with a linear activation function, which appears to make them unsuitable to represent real neurons with their positive-only firing rates. However, even purely linear units have been considered as provisionally relevant models of real neurons, by assuming that the latter operate sometimes in the linear regime of their transfer function. (This implies a high level of spontaneous activity, and may be closer to conditions observed early on in sensory systems rather than in areas more specifically involved in memory.) As usual, the connections are trained by a Hebb (or similar) associative learning rule. The capacity of these networks can be defined as the total number of associations that can be learned independently of each other, given that the linear nature of these systems prevents anything more than a linear transform of the inputs. This implies that if input pattern C can be written as the weighted sum of input patterns A and B, the output to C will be just the same weighted sum of the outputs to A and B. If there are N′ input axons, then there can be only at most N′ mutually independent input patterns (i.e. none able to be written as a weighted sum of the others), and therefore the capacity of linear networks, defined above, is just N′, or equal to the number of inputs to each neuron. In general, a random set of less than N′ vectors (the CS input pattern vectors) will tend to be mutually independent but not mutually orthogonal (at 90deg to each other) (see Appendix 1 of Rolls and Deco (2002)). If they are not orthogonal (the normal situation), then the dot product of them is not 0, and the output pattern activated by one of the input vectors will be partially activated by other input pattern vectors, in accordance with how similar they are (see equations A.9 and A.10). This amounts to interference, which is therefore the more serious the less orthogonal, on the whole, is the set of input vectors.
Since input patterns are made of elements with positive values, if a simple Hebbian learning rule like the one of equation A.6 is used (in which the input pattern enters directly with no subtraction term), the output resulting from the application of a stored input vector will be the sum of contributions from all other input vectors that have a non-zero dot product with it (see Appendix 1 of Rolls and Deco (2002)), and interference will be disastrous. The only situation in which this would not occur is when different input patterns activate completely different input lines, but this is clearly an uninteresting circumstance for networks operating with distributed representations. A solution to this issue is to use a modified learning rule of the following form:
Table A.1 Effects of pre- and post-synaptic activity on synaptic modification
This modified learning rule can also be described in terms of a contingency table (Table A.1) showing the synaptic strength modifications produced by different types of learning rule, where LTP indicates an increase in synaptic strength (called Long-Term Potentiation in neurophysiology), and LTD indicates a decrease in synaptic strength (called Long-Term Depression in neurophysiology). Heterosynaptic long-term depression is so-called because it is the decrease in synaptic strength that occurs to a synapse that is other than that through which the postsynaptic cell is being activated. This heterosynaptic long-term depression is the type of change of synaptic strength that is required (in addition to LTP) for effective subtraction of the average presynaptic firing rate, in order, as it were, to make the CS vectors appear more orthogonal to the pattern associator. The rule is sometimes called the Singer–Stent rule, after work by Singer (1987) and Stent (1973), and was discovered in the brain by Levy (Levy (1985); Levy and Desmond (1985); see Brown, Kairiss and Keenan (1990)). Homosynaptic long-term depression is so-called because it is the decrease in synaptic strength that occurs to a synapse which is (the same as that which is) active. For it to occur, the postsynaptic neuron must simultaneously be inactive, or have only low activity. (This rule is sometimes called the BCM rule after the paper of Bienenstock, Cooper and Munro (1982); see Rolls and Deco (2002), Chapter 7).
Associative neuronal networks with non-linear neurons With non-linear neurons, that is with at least a threshold in the activation function so that the output firing yi is 0 when the activation hi is below the threshold, the capacity can be measured in terms of the number of different clusters of output pattern vectors that the network produces. This is because the non-linearities now present (one per output neuron) result in some clustering of the outputs produced by all possible (conditioned stimulus) input patterns x. Input patterns that are similar to a stored input vector can produce, due to the non-linearities, output patterns even closer to the stored output; and vice versa sufficiently dissimilar inputs can be assigned to different output clusters thereby increasing their mutual dissimilarity. As with the linear counterpart, in order to remove the correlation that would otherwise occur between the patterns because the elements can take only positive values, it is useful to use a modified Hebb rule of the form shown in equation A.11.
(p.480) With fully distributed output patterns, the number p of associations that leads to different clusters is of order C, the number of input lines (axons) per output neuron (that is, of order N′ for a fully connected network), as shown in Appendix A3 of Rolls and Treves (1998). If sparse patterns are used in the output, or alternatively if the learning rule includes a non-linear postsynaptic factor that is effectively equivalent to using sparse output patterns, the coefficient of proportionality between p and C can be much higher than one, that is, many more patterns can be stored than inputs onto each output neuron (see Appendix A3 of Rolls and Treves (1998)). Indeed, the number of different patterns or prototypes p that can be stored can be derived for example in the case of binary units (Gardner 1988) to be
The non-linearity inherent in the NMDA receptor-based Hebbian plasticity present in the brain may help to make the stored patterns more sparse than the input patterns, and this may be especially beneficial in increasing the storage capacity of associative networks in the brain by allowing participation in the storage of especially those relatively few neurons with high firing rates in the exponential firing rate distributions typical of neurons in sensory systems (see Rolls and Deco (2002)).
Interference occurs in linear pattern associators if two vectors are not orthogonal, and is simply dependent on the angle between the originally learned vector and the recall cue or CS vector (see Appendix 1 of Rolls and Deco (2002)), for the activation of the output neuron depends simply on the dot product of the recall vector and the synaptic weight vector (equation A.9). Also in non-linear pattern associators (the interesting case for all practical purposes), interference may occur if two CS patterns are not orthogonal, though the effect can be controlled with sparse encoding of the UCS patterns, effectively by setting high thresholds for the firing of output units. In other words, the CS vectors need not be strictly orthogonal, but if they are too similar, some interference will still be likely to occur.
The fact that interference is a property of neural network pattern associator memories is of interest, for interference is a major property of human memory. Indeed, the fact that interference is a property of human memory and of neural network association memories is entirely consistent with the hypothesis that human memory is stored in associative memories of the type described here, or at least that network associative memories of the type described represent a useful exemplar of the class of parallel distributed storage network used in human memory.
It may also be suggested that one reason that interference is tolerated in biological memory is that it is associated with the ability to generalize between stimuli, which is an invaluable feature of biological network associative memories, in that it allows the memory to cope with (p.481)
A.2.7.3 Expansion recoding
If patterns are too similar to be stored in associative memories, then one solution that the brain seems to use repeatedly is to expand the encoding to a form in which the different stimulus patterns are less correlated, that is, more orthogonal, before they are presented as CS stimuli to a pattern associator. The problem can be highlighted by a non-linearly separable mapping (which captures part of the eXclusive OR (XOR) problem), in which the mapping that is desired is as shown in Fig. A.16. The neuron has two inputs, A and B.
This is a mapping of patterns that is impossible for a one-layer network, because the patterns are not linearly separable49. A solution is to remap the two input lines A and B to three input lines 1–3, that is to use expansion recoding, as shown in Fig. A.17. This can be performed by a competitive network (see Rolls and Deco (2002) Chapter 7). The synaptic
Rolls and Treves (1998) show that competitive networks could help with this type of recoding, and could provide very useful preprocessing for a pattern associator in the brain. It is possible that the lateral nucleus of the amygdala performs this function, for it receives inputs from the temporal cortical visual areas, and may preprocess them before they become the inputs to associative networks at the next stage of amygdala processing (see Fig. 4.52).
A.2.8 Implications of different types of coding for storage in pattern associators
Throughout this Section, we have made statements about how the properties of pattern associators – such as the number of patterns that can be stored, and whether generalization and graceful degradation occur – depend on the type of encoding of the patterns to be associated. (The types of encoding considered, local, sparse distributed, and fully distributed, are described above.) We draw together these points in Table A.2.
Table A.2 Coding in associative memories*
Generalization, Completion, Graceful degradation
Number of patterns that can be stored
of order C/[a o log(1/a o)] (can be larger)
of order C (usually smaller than N)
Amount of information in each pattern (values if binary)
Minimal (log(N) bits)
Intermediate (Na o log(1/a o) bits)
Large (N bits)
(*) N refers here to the number of output units, and C to the average number of inputs to each output unit. a o is the sparseness of output patterns, or roughly the proportion of output units activated by a UCS pattern. Note: logs are to the base 2.
The amount of information that can be stored in each pattern in a pattern associator is considered in Appendix A3 of Rolls and Treves (1998).
In conclusion, the architecture and properties of pattern association networks make them very appropriate for stimulus–reinforcer association learning. Their high capacity enables them to learn the correct reinforcement associations for very large numbers of different stimuli.
(p.483) A.3 Autoassociation memory: attractor networks
In this Section an introduction to autoassociation or attractor networks is given, as this type of network may be relevant to understanding how mood states are maintained.
Autoassociative memories, or attractor neural networks, store memories, each one of which is represented by a different set of the neurons firing. The memories are stored in the recurrent synaptic connections between the neurons of the network, for example in the recurrent collateral connections between cortical pyramidal cells. Autoassociative networks can then recall the appropriate memory from the network when provided with a fragment of one of the memories. This is called completion. Many different memories can be stored in the network and retrieved correctly. A feature of this type of memory is that it is content addressable: that is, the information in the memory can be accessed if just the contents of the memory (or a part of the contents of the memory) are used. This is in contrast to a conventional computer, in which the address of what is to be accessed must be supplied, and used to access the contents of the memory. Content addressability is an important simplifying feature of this type of memory, which makes it suitable for use in biological systems. The issue of content addressability will be amplified below.
An autoassociation memory can be used as a short-term memory, in which iterative processing round the recurrent collateral connections between the principal neurons in the network keeps a representation active by continuing, persistent, neuronal firing. Used in this way, attractor networks provide the basis for the implementation of short-term memory in the dorsolateral prefrontal cortex. In this cortical area, the short-term memory provides the basis for keeping a memory active even while perceptual areas such as the inferior temporal visual cortex must respond to each incoming visual stimulus in order for it to be processed, to produce behavioural responses, and for it to be perceived (Renart, Moreno, Rocha, Parga and Rolls 2001). The implementation of short-term memory in the prefrontal cortex which can maintain neuronal firing even across intervening stimuli provides an important foundation for attention, in which an item or items must be held in mind for a period and during this time bias other brain areas by top-down processing using cortico-cortical backprojections (Rolls and Deco 2002, Deco and Rolls 2004, Deco and Rolls 2005b), or determine how stimuli are mapped to responses (Deco and Rolls 2003) or to rewards (see Appendix 2 and Deco and Rolls (2005d)) with rapid, one-trial, task switching and decision making. This dorsolateral prefrontal cortex short-term memory system also provides a computational foundation for executive function, in which several items must be held in a working memory so that they can be performed with the correct priority and order (Rolls and Deco 2002). In brain areas involved in emotion, attractor networks may play a role in maintaining a mood state, at least in the short-term after for example frustrative non-reward (see Chapters 2 and 3), and possibly in the longer term. Other functions for autoassociation networks including perceptual short-term memory which may be used in the learning of invariant representations, constraint satisfaction, and episodic memory are described by Rolls and Treves (1998) and Rolls and Deco (2002).
A.3.1 Architecture and operation
The prototypical architecture of an autoassociation memory is shown in Fig. A.19. The external input ei is applied to each neuron i by unmodifiable synapses. This produces firing yi of each neuron, or a vector of firing on the output neurons y. Each output neuron i is connected by a recurrent collateral connection to the other neurons in the network, via modifiable connection (p.484)
Next we introduce a more precise and detailed description of the above, and describe the properties of these networks. Ways to analyse formally the operation of these networks are introduced in Appendix A4 of Rolls and Treves (1998) and by Amit (1989).
The firing of every output neuron i is forced to a value yi determined by the external input ei. Then a Hebb-like associative local learning rule is applied to the recurrent synapses in the network:
It is a factor that is sometimes overlooked that there must be a mechanism for ensuring that during learning yi does approximate ei, and must not be influenced much by activity in the recurrent collateral connections, otherwise the new external pattern e will not be stored in the network, but instead something will be stored that is influenced by the previously stored memories. Mechanisms that may facilitate this are described by Rolls and Treves (1998) and Rolls and Deco (2002).
(p.485) A.3.1.2 Recall
During recall, the external input ei is applied, and produces output firing, operating through the non-linear activation function described below. The firing is fed back by the recurrent collateral axons shown in Fig. A.19 to produce activation of each output neuron through the modified synapses on each output neuron. The activation hi produced by the recurrent collateral effect on the ith neuron is, in the standard way, the sum of the activations produced in proportion to the firing rate of each axon yj operating through each modified synapse wij,
The output firing yi is a function of the activation hi produced by the recurrent collateral effect (internal recall) and by the external input (ei):
A.3.2 Introduction to the analysis of the operation of autoassociation networks
With complete connectivity in the synaptic matrix, and the use of a Hebb rule, the matrix of synaptic weights formed during learning is symmetric. The learning algorithm is fast, ‘one-shot’, in that a single presentation of an input pattern is all that is needed to store that pattern.
During recall, a part of one of the originally learned stimuli can be presented as an external input. The resulting firing is allowed to iterate repeatedly round the recurrent collateral system, gradually on each iteration recalling more and more of the originally learned pattern. Completion thus occurs. If a pattern is presented during recall that is similar but not identical to any of the previously learned patterns, then the network settles into a stable recall state in which the firing corresponds to that of the most similar previously learned pattern. The (p.486) network can thus generalize in its recall to the most similar previously learned pattern. The activation function of the neurons should be non-linear, since a purely linear system would not produce any categorization of the input patterns it receives, and therefore would not be able to effect anything more than a trivial (i.e. linear) form of completion and generalization.
Recall can be thought of in the following way, relating it to what occurs in pattern associators. The external input e is applied, produces firing y, which is applied as a recall cue on the recurrent collaterals as yT. (The notation yT signifies the transpose of y, which is implemented by the application of the firing of the neurons y back via the recurrent collateral axons as the next set of inputs to the neurons.) The activity on the recurrent collaterals is then multiplied with the synaptic weight vector stored during learning on each neuron to produce the new activation hi which reflects the similarity between yT and one of the stored patterns. Partial recall has thus occurred as a result of the recurrent collateral effect. The activations hi after thresholding (which helps to remove interference from other memories stored in the network, or noise in the recall cue) result in firing yi, or a vector of all neurons y, which is already more like one of the stored patterns than, at the first iteration, the firing resulting from the recall cue alone, y = f(e). This process is repeated a number of times to produce progressive recall of one of the stored patterns.
Autoassociation networks operate by effectively storing associations between the elements of a pattern. Each element of the pattern vector to be stored is simply the firing of a neuron. What is stored in an autoassociation memory is a set of pattern vectors. The network operates to recall one of the patterns from a fragment of it. Thus, although this network implements recall or recognition of a pattern, it does so by an association learning mechanism, in which associations between the different parts of each pattern are learned. These memories have sometimes been called autocorrelation memories (Kohonen 1977), because they learn correlations between the activity of neurons in the network, in the sense that each pattern learned is defined by a set of simultaneously active neurons. Effectively each pattern is associated by learning with itself. This learning is implemented by an associative (Hebb-like) learning rule.
The internal recall in autoassociation networks involves multiplication of the firing vector of neuronal activity by the vector of synaptic weights on each neuron. This inner product vector multiplication allows the similarity of the firing vector to previously stored firing vectors to be provided by the output (as effectively a correlation), if the patterns learned are distributed. As a result of this type of ‘correlation computation’ performed if the patterns are distributed, many important properties of these networks arise, including pattern completion (because part of a pattern is correlated with the whole pattern), and graceful degradation (because a damaged synaptic weight vector is still correlated with the original synaptic weight vector). Some of these properties are described next.
One important and useful property of these memories is that they complete an incomplete input vector, allowing recall of a whole memory from a small fraction of it. The memory recalled in response to a fragment is that stored in the memory that is closest in pattern similarity (as measured by the dot product, or correlation). Because the recall is iterative and (p.487) progressive, the recall can be perfect.
The network generalizes in that an input vector similar to one of the stored vectors will lead to recall of the originally stored vector, provided that distributed encoding is used. The principle by which this occurs is similar to that described for a pattern associator.
A.3.3.3 Graceful degradation or fault tolerance
If the synaptic weight vector wi on each neuron (or the weight matrix) has synapses missing (e.g. during development), or loses synapses (e.g. with brain damage or aging), then the activation hi (or vector of activations h) is still reasonable, because hi is the dot product (correlation) of yT with wi. The same argument applies if whole input axons are lost. If an output neuron is lost, then the network cannot itself compensate for this, but the next network in the brain is likely to be able to generalize or complete if its input vector has some elements missing, as would be the case if some output neurons of the autoassociation network were damaged.
The recall operation is fast on each neuron on a single iteration, because the pattern yT on the axons can be applied simultaneously to the synapses wi, and the activation hi can be accumulated in one or two time constants of the dendrite (e.g. 10–20 ms). If a simple implementation of an autoassociation net such as that described by Hopfield (1982) is simulated on a computer, then 5–15 iterations are typically necessary for completion of an incomplete input cue e. This might be taken to correspond to 50–200 ms in the brain, rather too slow for any one local network in the brain to function. However, it transpires (see Rolls and Deco (2002), Treves (1993), Battaglia and Treves (1998), Appendix A5 of Rolls and Treves (1998), and Panzeri, Rolls, Battaglia and Lavis (2001)) that if the neurons are treated not as McCulloch-Pitts neurons which are simply ‘updated’ at each iteration, or cycle of time steps (and assume the active state if the threshold is exceeded), but instead are analysed and modelled as ‘integrate-and-fire’ neurons in real continuous time, then the network can effectively ‘relax’ into its recall state very rapidly, in one or two time constants of the synapses50.This corresponds to perhaps 20 ms in the brain. One factor in this rapid dynamics of autoassociative networks with brain-like ‘integrate-and-fire’ membrane and synaptic properties is that with some spontaneous activity, some of the neurons in the network are close to threshold already before the recall cue is applied, and hence some of the neurons are very quickly pushed by the recall cue into firing, so that information starts to be exchanged very rapidly (within 1–2ms of brain time) through the modified synapses by the neurons in the network. The progressive exchange of information starting early on within what would otherwise be thought of as an iteration period (of perhaps 20 ms, corresponding to a neuronal firing rate of 50 spikes/s), is the mechanism accounting for rapid recall in an autoassociative neuronal network made biologically realistic in this way. Further analysis of the fast dynamics of these networks if they are implemented in a biologically plausible way with ‘integrate-and-fire’ neurons is provided in Appendix A5 of Rolls and Treves (1998), by Rolls and Deco (2002), and by Treves (1993).
Learning is fast, ‘one-shot’, in that a single presentation of an input pattern e (producing y) enables the association between the activation of the dendrites (the postsynaptic term hi) (p.488) and the firing of the recurrent collateral axons yT, to be learned. Repeated presentation with small variations of a pattern vector is used to obtain the properties of prototype extraction, extraction of central tendency, and noise reduction, because these arise from the averaging process produced by storing very similar patterns in the network.
A.3.3.5 Local learning rule
The simplest learning used in autoassociation neural networks, a version of the Hebb rule, is (as in equation A.13)
Evidence that a learning rule with the general form of equation A.13 is implemented in at least some parts of the brain comes from studies of long-term potentiation, described in Section A.1.5. One of the important potential functions of heterosynaptic long-term depression is its ability to allow in effect the average of the presynaptic activity to be subtracted from the presynaptic firing rate (see Appendix A3 of Rolls and Treves (1998) and Rolls and Treves (1990)).
One measure of storage capacity is to consider how many orthogonal (i.e. uncorrelated) patterns could be stored, as with pattern associators. If the patterns are orthogonal, there will be no interference between them, and the maximum number p of patterns that can be stored will be the same as the number N of output neurons in a fully connected network.
With non-linear neurons used in the network, the capacity can be measured in terms of the number of input patterns y (produced by the external input e, see Fig. A.19) that can be stored in the network and recalled later whenever the network settles within each stored pattern's basin of attraction. The first quantitative analysis of storage capacity (Amit, Gutfreund and Sompolinsky 1987) considered a fully connected Hopfield (1982) autoassociator model, in which units are binary elements with an equal probability of being ‘on’ or ‘off’ in each pattern, and the number C of inputs per unit is the same as the number N of output units. (Actually it is equal to N−1, since a unit is taken not to connect to itself.) Learning is taken to occur by clamping the desired patterns on the network and using a modified Hebb rule, in (p.489) which the mean of the presynaptic and postsynaptic firings is subtracted from the firing on any one learning trial (this amounts to a covariance learning rule, and is described more fully in Appendix A4 of Rolls and Treves (1998)). With such fully distributed random patterns, the number of patterns that can be learned is (for C large) p ≈ 0.14C = 0.14N.
Treves and Rolls (1991) have extended this analysis to autoassociation networks that are much more biologically relevant in the following ways. First, some or many connections between the recurrent collaterals and the dendrites are missing (this is referred to as diluted connectivity, and results in a non-symmetric synaptic connection matrix in which wij does not equal wji, one of the original assumptions made in order to introduce the energy formalism in the Hopfield model). Second, the neurons need not be restricted to binary threshold neurons, but can have a threshold linear activation function (see Fig. A.3). This enables the neurons to assume real continuously variable firing rates, which are what is found in the brain (Rolls and Tovee 1995b, Treves, Panzeri, Rolls, Booth and Wakeman 1999). Third, the representation need not be fully distributed (with half the neurons ‘on’, and half ‘off’), but instead can have a small proportion of the neurons firing above the spontaneous rate, which is what is found in parts of the brain such as the hippocampus that are involved in memory (see Treves and Rolls (1994), and Chapter 6 of Rolls and Treves (1998)). Such a representation is defined as being sparse, and the sparseness a of the representation can be measured, by extending the binary notion of the proportion of neurons that are firing, as
The main factors that determine the maximum number of memories that can be stored in an autoassociative network are thus the number of connections on each neuron devoted to the recurrent collaterals, and the sparseness of the representation. For example, for C RC = 12,000 and a = 0.02 is calculated to be approximately 36,000. This storage capacity can be realized, with little interference between patterns, if the learning rule includes some form of heterosynaptic Long-Term Depression that counterbalances the effects of associative Long-Term Potentiation (Treves and Rolls (1991); see Appendix A4 of Rolls and Treves (1998)). It should be noted that the number of neurons N (which is greater than C RC, the number of recurrent collateral inputs received by any neuron in the network from the other neurons in the network) is not a parameter that influences the number of different memories that can be stored in the network. The implication of this is that increasing the number of neurons (p.490) (without increasing the number of connections per neuron) does not increase the number of different patterns that can be stored (see Rolls and Treves (1998) Appendix A4), although it may enable simpler encoding of the firing patterns, for example more orthogonal encoding, to be used. This latter point may account in part for why there are generally in the brain more neurons in a recurrent network than there are connections per neuron. Another advantage of having many neurons in the network may be related to the fact that within any integration time period of 20 ms not all neurons will have fired a spike if the average firing rate is less than 50 Hz. Having large numbers of neurons may enable the vector of neuronal firing to contribute to recall efficiently even though not every neuron can contribute in a short time period.
The non-linearity inherent in the NMDA receptor-based Hebbian plasticity present in the brain may help to make the stored patterns more sparse than the input patterns, and this may be especially beneficial in increasing the storage capacity of associative networks in the brain by allowing participation in the storage of especially those relatively few neurons with high firing rates in the exponential firing rate distributions typical of neurons in sensory systems (see Rolls and Treves (1998) and Rolls and Deco (2002)).
The environmental context in which learning occurs can be a very important factor that affects retrieval in humans and other animals. Placing the subject back into the same context in which the original learning occurred can greatly facilitate retrieval.
Context effects arise naturally in association networks if some of the activity in the network reflects the context in which the learning occurs. Retrieval is then better when that context is present, for the activity contributed by the context becomes part of the retrieval cue for the memory, increasing the correlation of the current state with what was stored. (A strategy for retrieval arises simply from this property. The strategy is to keep trying to recall as many fragments of the original memory situation, including the context, as possible, as this will provide a better cue for complete retrieval of the memory than just a single fragment.)
The effects that mood has on memory including visual memory retrieval may be accounted for by backprojections from brain regions such as the amygdala in which the current mood, providing a context, is represented, to brain regions involved in memory such as the perirhinal cortex, and in visual representations such as the inferior temporal visual cortex (see Rolls and Stringer (2001b) and Section 4.10). The very well-known effects of context in the human memory literature could arise in the simple way just described. An implication of the explanation is that context effects will be especially important at late stages of memory or information processing systems in the brain, for there information from a wide range of modalities will be mixed, and some of that information could reflect the context in which the learning takes place. One part of the brain where such effects may be strong is the hippocampus, which is implicated in the memory of recent episodes, and which receives inputs derived from most of the cortical information processing streams, including those involved in spatial representations (see Chapter 6 of Rolls and Treves (1998), Rolls (1996b), and Rolls (1999c)).
It is now known that reward-related information is associated with place-related information in the primate hippocampus, and this provides a particular neural system in which mood context can influence memory retrieval (Rolls and Xiang 2004, Rolls and Xiang 2005a).
A.3.3.8 Memory for sequences
One of the first extensions of the standard autoassociator paradigm that has been explored in the literature is the capability to store and retrieve not just individual patterns, but whole (p.491) sequences of patterns. Hopfield (1982) suggested that this could be achieved by adding to the standard connection weights, which associate a pattern with itself, a new, asymmetric component, that associates a pattern with the next one in the sequence. In practice this scheme does not work very well, unless the new component is made to operate on a slower time scale than the purely autoassociative component (Kleinfeld 1986, Sompolinsky and Kanter 1986). With two different time scales, the autoassociative component can stabilize a pattern for a while, before the heteroassociative component moves the network, as it were, into the next pattern. The heteroassociative retrieval cue for the next pattern in the sequence is just the previous pattern in the sequence. A particular type of ‘slower’ operation occurs if the asymmetric component acts after a delay τ. In this case, the network sweeps through the sequence, staying for a time of order τ in each pattern.
If implemented with integrate-and-fire neurons with biologically plausible dynamics, this type of sequence memory will either step through its remembered sequence with uncontrollable speed, or not step through the sequence. A proposal that attractor networks with adapting synapses could be used to retain memory sequences is an interesting alternative (Deco and Rolls 2005c).
A.4 Coupled attractor networks
In this Section A.4 an introduction to how attractor networks can interact is given, as this may be relevant to understanding how mood states influence cognitive processing, and vice versa.
It is prototypical of the cerebral neocortical areas that there are recurrent collateral connections between the neurons within an area or module, and forward connections to the next cortical area in the hierarchy, which in turn sends backprojections (see Rolls and Deco (2002)). This architecture, made explicit in Fig. 4.66 on page 199, immediately suggests, given that the recurrent connections within a module, and the forward and backward connections, are likely to be associatively modifiable, that the operation incorporates at least to some extent interactions between coupled attractor (autoassociation) networks. For these reasons, it is important to analyse the rules that govern the interactions between coupled attractor networks. This has been done using the formal type of model described in Section 12.1.2 of Rolls and Deco (2002) introduced here (see also Renart, Parga and Rolls (1999b), Renart, Parga and Rolls (1999a), Renart, Parga and Rolls (2000), Renart, Moreno, Rocha, Parga and Rolls (2001), and Deco and Rolls (2003)).
One boundary condition is when the coupling between the networks is so weak that there is effectively no interaction. This holds when the coupling parameter g between the networks is less than approximately 0.002, where the coupling parameter indicates the relative strength of the intermodular to the intramodular connections, and measures effectively the relative strengths of the currents injected into the neurons by the inter-modular relative to the intramodular (recurrent collateral) connections (Renart, Parga and Rolls 1999b). At the other extreme, if the coupling parameter is strong, all the networks will operate as a single attractor network, together able to represent only one state (Renart, Parga and Rolls 1999b). This critical value of the coupling parameter (at least for reciprocally connected networks with symmetric synaptic strengths) is relatively low, in the region of 0.024 (Renart, Parga and Rolls 1999b). This is one reason why cortico-cortical backprojections are predicted to be quantitatively relatively weak, and for this reason it is suggested end on the apical parts of the dendrites of cortical pyramidal cells (see Section 1.11 of Rolls and Deco (2002)). In (p.492) the strongly coupled regime when the system of networks operates as a single attractor, the total storage capacity (the number of patterns that can be stored and correctly retrieved) of all the networks will be set just by the number of synaptic connections received from other neurons in the network, a number in the order of a few thousand. This is one reason why connected cortical networks are thought not to act in the strongly coupled regime, because the total number of memories that could be represented in the whole of the cerebral cortex would be so small, in the order of a few thousand, depending on the sparseness of the patterns (see equation A.18) (O'Kane and Treves 1992).
Between these boundary conditions, that is in the region where the inter-modular coupling parameter g is in the range 0.002–0.024, it has been shown that interesting interactions can occur (Renart, Parga and Rolls 1999b, Renart, Parga and Rolls 1999a). In a bimodular architecture, with forward and backward connections between the modules, the capacity of one module can be increased, and an attractor is more likely to be found under noisy conditions, if there is a consistent pattern in the coupled attractor. By consistent we mean a pattern that during training was linked associatively by the forward and backward connections, with the pattern being retrieved in thefirst module. This provides a quantitative model for understanding some of the effects that backprojections can produce by supporting particular states in earlier cortical areas (Renart, Parga and Rolls 1999b). The total storage capacity of the two networks is however in line with O'Kane and Treves (1992), not a great deal greater than the storage capacity of one of the modules alone. Thus the help provided by the attractors in falling into a mutually compatible global retrieval state (in e.g. the scenario of a hierarchical system) is where the utility of such coupled attractor networks must lie. Another interesting application of such weakly coupled attractor networks is in coupled perceptual and short-term memory systems in the brain, described in Section 12.1 of Rolls and Deco (2002). Thus the most interesting scenario for coupled attractor networks is when they are weakly coupled, for then interactions occur whereby how well one module responds to its own inputs can be influenced by the states of the other modules, but it can retain partly independent representations. This emphasizes the importance of weak interactions between coupled modules in the brain (Renart, Parga and Rolls 1999b, Renart, Parga and Rolls 1999a, Renart, Parga and Rolls 2000).
If a multimodular architecture is trained with each of many patterns (which might be visual stimuli) in one module associated with one of a few patterns (which might be mood states) in a connected module, then interesting effects due to this asymmetry are found, as described in Section 4.10 and by Rolls and Stringer (2001b).
An interesting issue that arises is how rapidly a system of interacting attractor networks such as that illustrated in Fig. 4.66 settles into a stable state. Is it sufficiently rapid for the interacting attractor effects described to contribute to cortical information processing? It is likely that the settling of the whole system is quite rapid, if it is implemented (as it is in the brain) with synapses and neurons that operate with continuous dynamics, where the time constant of the synapses dominates the retrieval speed, and is in the order of 15 ms for each module, as described in Section 7.6 of Rolls and Deco (2002) and by Panzeri, Rolls, Battaglia and Lavis (2001). It is shown there that a multimodular attractor network architecture can process information in approximately 15 ms per module (assuming an inactivation time constant for the synapses of 10 ms).
Attractor networks can be coupled together with stronger forward than backward connections. This provides a model of how the prefrontal cortex could map sensory inputs (in one attractor), through intermediate attractors that respond to combinations of sensory inputs and (p.493) the behavioural responses being made, to further attractors that encode the response to be made (Deco and Rolls 2003). Having attractors at each stage enables the prefrontal cortex to bridge delays between parts of a task. The hierarchical organization of the attractors achieved by the stronger forward than backward connections enables the mapping to be from sensory input to motor output. The presence of intermediate attractors with neurons that respond to combinations of the stimuli and the behavioural responses to be made allows a top-down attentional input to bias the competition implemented by the intermediate level attractors to enable the behaviour to be switched from one cognitive mapping to another (Deco and Rolls 2003). The whole architecture has been modelled at the integrate-and-fire neuronal level, and simulates the activity of the different populations of neurons just described which are types of neuron recorded in the prefrontal cortex when monkeys are performing this decision task (see Deco and Rolls (2003)).
The cortico-cortical backprojection connectivity described can be interpreted as a system that allows the forward-projecting neurons in one cortical area to be linked autoassociatively with the backprojecting neurons in the next cortical area (see Fig. 4.66 and Rolls and Deco (2002)). It is interesting to note that if the forward and backprojection synapses were associatively modifiable, but there were no recurrent connections in each of the modules, then the whole system could still operate (with the right parameters) as an attractor network.
A.5 Reinforcement learning
In supervised networks, an error signal is provided for each output neuron in the network, and whenever an input to the network is provided, the error signals specify the magnitude and direction of the error in the output produced by each neuron. These error signals are then used to correct the synaptic weights in the network in such a way that the output errors for each input pattern to be learned gradually diminish over trials (see Rolls and Deco (2002)). These networks have an architecture that might be similar to that of the pattern associator shown in Fig. A.7, except that instead of an unconditioned stimulus, there is an error correction signal provided for each output neuron. Such a network trained by an error-correcting (or delta) rule is known as a one-layer perceptron. The architecture is not very plausible for most brain regions, in that it is not clear how an individual error signal could be computed for each of thousands of neurons in a network, and fed into each neuron as its error signal and then used in a delta rule synaptic correction (see Rolls and Treves (1998) and Rolls and Deco (2002)).
The architecture can be generalized to a multilayer feedforward architecture with many layers between the input and output (Rumelhart, Hinton and Williams 1986), but the learning is very non-local and rather biologically implausible (see Rolls and Treves (1998) and Rolls and Deco (2002)), in that an error term (magnitude and direction) for each neuron in the network must be computed from the errors and synaptic weights of all subsequent neurons in the network that any neuron influences, usually on a trial-by-trial basis, by a process known as error backpropagation. Thus although computationally powerful, an issue with perceptrons and multilayer perceptrons that makes them generally biologically implausible for many brain regions is that a separate error signal must be supplied for each output neuron, and that with multilayer perceptrons, computed error backpropagation must occur.
When operating in an environment, usually a simple binary or scalar signal representing success or failure of the whole network or organism is received. This is usually action-dependent feedback that provides a single evaluative measure of the success or failure. Evaluative (p.494)
A class of problems to which such reinforcement networks might be applied are motor-control problems. It was to such a problem that Barto and Sutton (Barto 1985, Sutton and Barto 1981) applied a reinforcement learning algorithm, the associative reward–penalty algorithm described next. The algorithm can in principle be applied to multilayer networks, and the learning is relatively slow. The algorithm is summarized by Rolls and Treves (1998) and Hertz, Krogh and Palmer (1991). More recent developments in reinforcement learning are described by Sutton and Barto (1998) and reviewed by Dayan and Abbott (2001), and some of these developments are described in Section A.5.3.
A.5.1 Associative reward–penalty algorithm of Barto and Sutton
The terminology of Barto and Sutton is followed here (see Barto (1985)).
The architecture, shown in Fig. A.20, uses a single reinforcement signal, r, =+1 for reward, and –1 for penalty. The inputs x i take real (continuous) values. The output of a neuron, y, is binary, +1 or –1. The weights on the output neuron are designated wi.
(p.495) A.5.1.2 Operation
1. An input vector is applied to the network, and produces activation, h, in the normal way as follows:
2. The output y is calculated from the activation with a noise term η included. The principle of the network is that if the added noise on a particular trial helps performance, then whatever change it leads to should be incorporated into the synaptic weights, in such a way that the next time that input occurs, the performance is improved.
3. Learning rule. The weights are changed as follows:
This network combines an associative capacity with its properties of generalization and graceful degradation, with a single ‘critic’ or error signal for the whole network (Barto 1985). [The term y − E[y|h] in Equation A.21 can be thought of as an error for the output of the neuron: it is the difference between what occurred, and what was expected to occur. The synaptic weight is adjusted according to the sign and magnitude of the error of the postsynaptic firing, multiplied by the presynaptic firing, and depending on the reinforcement r received. The rule is similar to a Hebb synaptic modification rule (Equation A.6), except that the postsynaptic term is an error instead of the postsynaptic firing rate, and the learning is modulated by the reinforcement.] The network can solve difficult problems (such as balancing a pole by moving a trolley that supports the pole from side to side, as the pole starts to topple). Although described for single-layer networks, the algorithm can be applied to multilayer networks. The learning rate is very slow, for there is a single reinforcement signal on each trial for the whole network, not a separate error signal for each neuron in the network as is the case in a perceptron trained with an error rule (see Rolls and Deco (2002) and Rolls and Treves (1998)).
This associative reward–penalty reinforcement-learning algorithm is certainly a move towards biological relevance, in that learning with a single reinforcer can be achieved. That (p.496) single reinforcer might be broadcast throughout the system by a general projection system. It is not clear yet how a biological system might store the expected output E[y|h] for comparison with the actual output when noise has been added, and might take into account the sign and magnitude of this difference. Nevertheless, this is an interesting algorithm, which is related to the temporal difference reinforcement learning algorithm described in Section A.5.3.
A.5.2 Error correction or delta rule learning, and classical conditioning
In classical or Pavlovian associative learning, a number of different types of association may be learned (see Section 220.127.116.11). This type of associative learning may be performed by networks with the general architecture and properties of pattern associators (see Section A.2 and Fig. A.7). However, the time course of the acquisition and extinction of these associations can be expressed concisely by a modified type of learning rule in which an error correction term is used (introduced in Section A.5.1), rather than the postsynaptic firing y itself as in Equation A.6. Use of this modified, error correction, type of learning also enables some of the properties of classical conditioning to be explained (see Dayan and Abbott (2001) for review), and this type of learning is therefore described briefly here. The rule is known in learning theory as the Rescorla–Wagner rule, after Rescorla and Wagner (1972).
The Rescorla–Wagner rule is a version of error correction or delta-rule learning, and is based on a simple linear prediction of the expected reward value, denoted by v, associated with a stimulus representation x(x = 1 if the stimulus is present, and x = 0 if the stimulus is absent). The expected reward value v is expressed as the input stimulus variable x multiplied by a weight w
How this functionality is implemented in the brain is not yet clear. We consider one suggestion (Schultz et al. 1995b, Schultz 2004) after we introduce a further sophistication of (p.497) reinforcement learning which allows the time course of events within a trial to be taken into account.
A.5.3 Temporal Difference (TD) learning
An important advance in the area of reinforcement learning was the introduction of algorithms that allow for learning to occur when the reinforcement is delayed or received over a number of time steps, and which allow effects within a trial to be taken into account (Sutton and Barto 1998, Sutton and Barto 1990). A solution to these problems is the addition of an adaptive critic that learns through a time difference (TD) algorithm how to predict the future value of the reinforcer. The time difference algorithm takes into account not only the current reinforcement just received, but also a temporally weighted average of errors in predicting future reinforcements. The temporal difference error is the error by which any two temporally adjacent error predictions are inconsistent (see Barto (1995)). The output of the critic is used as an effective reinforcer instead of the instantaneous reinforcement being received (see Sutton and Barto (1998), Sutton and Barto (1990) and Barto (1995)). This is a solution to the temporal credit assignment problem, and enables future rewards to be predicted. Summaries are provided by Doya (1999), Schultz et al. (1997) and Dayan and Abbott (2001).
In reinforcement learning, a learning agent takes an action u(t) in response to the state x(t) of the environment, which results in the change of the state
The goal is to find a policy function G which maps sensory inputs x to actions
The current action u(t) affects all future states and accordingly all future rewards. The maximization is realized by the use of the value function V of the states to predict, given the sensory inputs x, the cumulative sum (possibly discounted as a function of time) of all future rewards V(x) (possibly within a learning trial) as follows:
The basic algorithm for learning the value function is to minimize the temporal difference (TD) error Δ(t) for time t within a trial, and this is computed by a ‘critic’ for the estimated value predictions V̂(x(t)) at successive time steps as
Δ(t) is used to improve the estimates v̂(t) by the ‘critic’, and can also be used (by an ‘actor’) to choose appropriate actions.
For example, when the value function is represented (in the critic) as
A simple way of improving the policy of the actor is to take a stochastic action
(p.499) Thus, the TD error Δ(t), which signals the error in the reward prediction at time t, works as the main teaching signal in both learning the value function (implemented in the critic), and the selection of actions (implemented in the actor). The usefulness of a separate critic is that it enables the TD error to be calculated based on the difference in reward value predictions at two successive time steps as shown in equation A.29.
The algorithm has been applied to modelling the time course of classical conditioning (Sutton and Barto 1990). The algorithm effectively allows the future reinforcement predicted from past history to influence the responses made, and in this sense allows behaviour to be guided not just by immediate reinforcement, but also by ‘anticipated’ reinforcements. Different types of temporal difference learning are described by Sutton and Barto (1998). An application is to the analysis of decisions when future rewards are discounted with respect to immediate rewards (Dayan and Abbott 2001, Tanaka, Doya, Okada, Ueda, Okamoto and Yamawaki 2004). Another application is to the learning of sequences of actions to take within a trial (Suri and Schultz 1998).
The possibility that dopamine neuron firing may provide an error signal useful in training neuronal systems to predict reward has been discussed in Section 8.3.4. It has been proposed that the firing of the dopamine neurons can be thought of as an error signal about reward prediction, in that the firing occurs in a task when a reward is given, but then moves forward in time within a trial to the time when a stimulus is presented that can be used to predict when the taste reward will be obtained (Schultz et al. 1995b) (see Fig. 8.5). The argument is that there is no prediction error when the taste reward is obtained if it has been signalled by a preceding conditioned stimulus, and that is why the dopamine midbrain neurons do not respond at the time of taste reward delivery, but instead, at least during training, to the onset of the conditioned stimulus (Waelti, Dickinson and Schultz 2001). If a different conditioned stimulus is shown that normally predicts that no taste reward will be given, there is no firing of the dopamine neurons to the onset of that conditioned stimulus.
This hypothesis has been built into models of learning in which the error signal is used to train synaptic connections in dopamine pathway recipient regions (such as presumably the striatum and orbitofrontal cortex) (Houk et al. 1995, Schultz 2004, Schultz et al. 1997, Waelti et al. 2001, Dayan and Abbott 2001). Some difficulties with the hypothesis are discussed in Section 8.3.4. The difficulties include the fact that dopamine is released in large quantities by aversive stimuli (see Section 8.3.4); that error computations for differences between the expected reward and the actual reward received on a trial are computed in the primate orbitofrontal cortex, where expected reward, actual reward, and error neurons are all found, and lesions of which impair the ability to use changes in reward contingencies to reverse behaviour (see Section 18.104.22.168); that the tonic, sustained, firing of the dopamine neurons in the delay period of a task with probabilistic rewards reflects reward uncertainty, and not the expected reward, nor the magnitude of the prediction error (see Section 8.3.4 and Shizgal and Arvanitogiannis (2003)); and that reinforcement learning is suited to setting up connections that might be required in fixed tasks such as motor habit or sequence learning, for reinforcement learning algorithms seek to set weights correctly in an ‘actor’, but are not suited to tasks where rules must be altered flexibly, as in rapid one trial reversal, for which a very different type of mechanism is described in Appendix B.
Overall, reinforcement learning algorithms are certainly a move towards biological relevance, in that learning with a single reinforcer can be achieved in systems that might learn motor habits or fixed sequences. Whether a single prediction error is broadcast throughout a (p.500) neural system by a general projection system, such as the dopamine pathways in the brain, which distribute to large parts of the striatum and the prefrontal cortex, remains to be clearly established.
(48) In fact, the terms in which Hebb put the hypothesis were a little different from an association memory, in that he stated that if one neuron regularly comes to elicit firing in another, then the strength of the synapses should increase. He had in mind the building of what he called cell assemblies. In a pattern associator, the conditioned stimulus need not produce before learning any significant activation of the output neurons. The connections must simply increase if there is associated pre-and postsynaptic firing when, in pattern association, most of the postsynaptic firing is being produced by a different input.
(49) See Appendix 1 of Rolls and Deco (2002). There is no set of synaptic weights in a one-layer net that could solve the problem shown in Fig. A.16. Two classes of patterns are not linearly separable if no hyperplane can be positioned in their N-dimensional space so as to separate them (see Appendix 1 of Rolls and Deco (2002)). The XOR problem has the additional constraint that A = 0, B = 0 must be mapped to Output = 0.