(p.544) Appendix 1 Neural networks and emotionrelated learning
(p.544) Appendix 1 Neural networks and emotionrelated learning
A.1 Neurons in the brain, the representation of information, and neuronal learning mechanisms
A.1.1 Introduction
In Chapters 3 and 4, the type of learning that is important in learned emotional responses was characterized as stimulus–reinforcer association learning. This is a particular case of patternassociation learning, in which the tobeassociated or conditioned stimulus is the potential secondary reinforcer, and the unconditioned stimulus is the primary reinforcer (see Fig. 4.5). In Chapter 4, it was indicated that many of the properties required of emotional learning (e.g. generalization and graceful degradation) arise in pattern associators if the correct type of distributed representation is present (Section 4.4). In this Appendix the relevant properties of biologically plausible patternassociation memories (such as may be present in the orbitofrontal cortex and amygdala and used for stimulus–reinforcer association learning) are presented more formally, to provide a foundation for research into the neural basis of emotional learning. In Section A.3 an introduction to autoassociation or attractor networks is given, as this type of network is involved in decisionmaking, memory, attention, and in maintaining mood states. In Section A.4 an introduction to how attractor networks can interact is given, as this may be relevant to understanding how mood states influence cognitive processing, and vice versa. A fuller analysis of these neural networks, and of other neural networks that, for example, by competitive learning build representations of sensory stimuli, is provided by Rolls & Treves (1998) in Neural Networks and Brain Function, by Rolls & Deco (2002) in Computational Neuroscience of Vision, and by Rolls (2008b) in Memory, Attention, and DecisionMaking.
Before starting the description of patternassociation neuronal networks, a brief review of the evidence on synaptic plasticity, and the rules by which synaptic strength is modified, much based on studies with longterm potentiation, is provided.
After describing patternassociation and autoassociation neural networks, an overview of another learning algorithm, called reinforcement learning, which might be relevant to learning in systems that receive rewards and punishers and which has been supposed to be implemented using the dopamine pathways (Barto 1995, Schultz et al. 1995b, Houk et al. 1995, Schultz 2013), is provided in Section A.5.
A.1.2 Neurons in the brain, and their representation in neuronal networks
Neurons in the vertebrate brain typically have, extending from the cell body, large dendrites which receive inputs from other neurons through connections called synapses. The synapses operate by chemical transmission. When a synaptic terminal receives an allornothing action potential from the neuron of which it is a terminal, it releases a transmitter that crosses the synaptic cleft and produces either depolarization or hyperpolarization in the postsynaptic (p.545) neuron, by opening particular ionic channels. (A textbook such as (Kandel, Schwartz, Jessell, Siegelbaum & Hudspeth 2012) gives further information on this process.) Summation of a number of such depolarizations or excitatory inputs within the time constant of the receiving neuron, which is typically 15–25 ms, produces sufficient depolarization that the neuron fires an action potential. There are often 5,000–20,000 inputs per neuron. Examples of cortical neurons are shown in Fig. A.1, and further examples are shown elsewhere (Rolls 2008b, Shepherd 2004, Shepherd & Grillner 2010).
Once firing is initiated in the cell body (or axon initial segment of the cell body), the action potential is conducted in an allornothing way to reach the synaptic terminals of the neuron, whence it may affect other neurons. Any inputs the neuron receives that cause it to become hyperpolarized make it less likely to fire (because the membrane potential is moved away from the critical threshold at which an action potential is initiated), and are described as inhibitory. The neuron can thus be thought of in a simple way as a computational element that sums its inputs within its time constant and, whenever this sum, minus any inhibitory effects, exceeds a threshold, produces an action potential that propagates to all of its outputs. This simple idea is incorporated in many neuronal network models using a formalism of a type described in the next Section.
A.1.3 A formalism for approaching the operation of single neurons in a network
Let us consider a neuron i as shown in Fig. A.2, which receives inputs from axons that we label j through synapses of strength ${w}_{ij}$.
The first subscript (i) refers to the receiving neuron, and the second subscript (j) to the particular input. j counts from 1 to C, where C is the number of synapses or connections received by the neuron. The firing rate of the ith neuron (p.546) is denoted as ${y}_{i}$, and that of the jth input to the neuron as ${x}_{j}$. To express the idea that the neuron makes a simple linear summation of the inputs it receives, we can write the activation of neuron i, denoted ${h}_{i}$, as
where $\sum _{j}^{C}$ indicates that the sum is over the C input axons (or synaptic connections) indexed by j to each neuron. The multiplicative form here indicates that activation should be produced by an axon only if it is firing, and depending on the strength of the synapse ${w}_{ij}$ from input axon j onto the dendrite of the receiving neuron i. Equation A.1 indicates that the strength of the activation reflects how fast each axon j is firing (that is ${x}_{j}$), and how strong its synapse ${w}_{ij}$ is. The sum of all such activations expresses the idea that summation (of synaptic currents in real neurons) occurs along the length of the dendrite, to produce activation at the cell body, where the activation ${h}_{i}$ is converted into firing ${y}_{i}$. This conversion can be expressed as
which indicates that the firing rate is a function (f) of the activation. The function is called the activation function in this case. (The activation is equivalent to the depolarization of the neuron measured electrophysiologically.) The function at its simplest could be linear, so that the firing rate would be proportional to the activation (see Fig. A.3). Real neurons have thresholds, with firing occurring only if the activation is above the threshold. A threshold linear activation function is shown in Fig. A.3b.
This has been useful in formal analysis of the properties of neural networks. Neurons also have firing rates that become saturated at a maximum rate, and we could express this as the sigmoid activation function shown in Fig. A.3c. Another simple activation function, used in some models of neural networks, is the binary threshold function (Fig. A.3d), which indicates that if the activation is below threshold, there is no firing, and that if the activation is above threshold, the neuron fires maximally. (p.547) Some nonlinearity in the activation function is an advantage, for it enables many useful computations to be performed in neuronal networks, including removing interfering effects of similar memories, and enabling neurons to perform logical operations, such as firing only if several inputs are present simultaneously.
A property implied by Equation A.1 is that the postsynaptic membrane is electrically short, and so summates its inputs irrespective of where on the dendrite the input is received. In real neurons, the transduction of current into firing frequency (the analogue of the transfer function of Equation A.2) is generally studied not with synaptic inputs but by applying a steady current through an electrode into the soma. Examples of the resulting curves, which illustrate the additional phenomenon of firing rate adaptation, are reproduced in Fig. A.4.
A.1.4 Synaptic modification
For a neuronal network to perform useful computation, that is to produce a given output when it receives a particular input, the synaptic weights must be set up appropriately. This is often performed by synaptic modification occurring during learning.
A simple learning rule that was originally presaged by Donald Hebb 1949 proposes that (p.548) synapses increase in strength when there is conjunctive presynaptic and postsynaptic activity. The Hebb rule can be expressed more formally as follows:
where $\mathrm{\delta}{w}_{ij}$ is the change of the synaptic weight ${w}_{ij}$ that results from the simultaneous (or conjunctive) presence of presynaptic firing ${x}_{j}$ and postsynaptic firing ${y}_{i}$ (or strong depolarization), and $\mathrm{\alpha}$ is a learning rate constant that specifies how much the synapses alter on any one pairing. The presynaptic and postsynaptic activity must be present approximately simultaneously (to within perhaps 100–500 ms in the real brain).
The Hebb rule is expressed in this multiplicative form to reflect the idea that both presynaptic and postsynaptic activity must be present for the synapses to increase in strength. The multiplicative form also reflects the idea that strong pre and postsynaptic firing will produce a larger change of synaptic weight than smaller firing rates. The Hebb rule thus captures what is typically found in studies of associative LongTerm Potentiation (LTP) in the brain, described in Section A.1.5.
One useful property of large neurons in the brain, such as cortical pyramidal cells, is that with their short electrical length, the postsynaptic term, ${y}_{i}$, is available on much of the dendrite of a cell. The implication of this is that once sufficient postsynaptic activation has been produced, any active presynaptic terminal on the neuron will show synaptic strengthening. This enables associations between coactive inputs, or correlated activity in input axons, to be learned by neurons using this simple associative learning rule.
(p.549) A.1.5 LongTerm Potentiation and LongTerm Depression as models of synaptic modification
LongTerm Potentiation (LTP) and LongTerm Depression (LTD) provide useful models of some of the synaptic modifications that occur in the brain (Feldman 2009). The synaptic changes found appear to be synapse–specific, and to depend on information available locally at the synapse. LTP and LTD may thus provide a good model of the biological synaptic modifications involved in real neuronal network operations in the brain. We next therefore describe some of the properties of LTP and LTD, and evidence that implicates them in learning in at least some brain systems. Even if they turn out not to be the basis for the synaptic modifications that occur during learning, they have many of the properties that would be needed by some of the synaptic modification systems used by the brain.
Longterm potentiation (LTP) is a usedependent and sustained increase in synaptic strength that can be induced by brief periods of synaptic stimulation. It is usually measured as a sustained increase in the amplitude of electrically evoked responses in specific neural pathways following brief trains of highfrequency stimulation (see Fig. A.5b).
For (p.550) example, high frequency stimulation of the Schaffer collateral inputs to the hippocampal CA1 cells results in a larger response recorded from the CA1 cells to single test pulse stimulation of the pathway. LTP is longlasting, in that its effect can be measured for hours in hippocampal slices, and in chronic in vivo experiments in some cases it may last for months. LTP becomes evident rapidly, typically in less than 1 minute. LTP is in some brain systems associative. This is illustrated in Fig. A.5c, in which a weak input to a group of cells (e.g. the commissural input to CA1) does not show LTP unless it is given at the same time as (i.e. associatively with) another input (which could be weak or strong) to the cells. The associativity arises because it is only when sufficient activation of the postsynaptic neuron to exceed the threshold of NMDA receptors (see below) is produced that any learning can occur. The two weak inputs summate to produce sufficient depolarization to exceed the threshold. This associative property is shown very clearly in experiments in which LTP of an input to a single cell only occurs if the cell membrane is depolarized by passing current through it at the same time as the input arrives at the cell. The depolarization alone or the input alone is not sufficient to produce the LTP, and the LTP is thus associative. Moreover, in that the presynaptic input and the postsynaptic depolarization must occur at about the same time (within approximately 500 ms), the LTP requires temporal contiguity. LTP is also synapsespecific, in that for example an inactive input to a cell does not show LTP even if the cell is strongly activated by other inputs (Fig. A.5b, input B).
These spatiotemporal properties of LTP can be understood in terms of actions of the inputs on the postsynaptic cell, which in the hippocampus has two classes of receptor, NMDA (NmethylDaspartate) and K–Q (kainate–quisqualate), both activated by the glutamate released by the presynaptic terminals. The NMDA receptor channels are normally blocked by Mg${}^{2+}$, but when the cell is strongly depolarized by strong tetanic stimulation of the type necessary to induce LTP, the Mg${}^{2+}$ block is removed, and Ca${}^{2+}$ entering via the NMDA receptor channels triggers events that lead to the potentiated synaptic transmission (see Fig. A.6).
Part of the evidence for this is that NMDA antagonists such as AP5 (D2amino5phosphonopentanoate) block LTP. Further, if the postsynaptic membrane is voltage clamped to prevent depolarization by a strong input, then LTP does not occur. The voltagedependence of the NMDA receptor channels introduces a threshold and thus a nonlinearity that contributes to a number of the phenomena of some types of LTP, such as cooperativity (many small inputs together produce sufficient depolarization to allow the NMDA receptors to operate), associativity (a weak input alone will not produce sufficient depolarization of the postsynaptic cell to enable the NMDA receptors to be activated, but the depolarization will be sufficient if there is also a strong input), and temporal contiguity between the different inputs that show LTP (in that if inputs occur nonconjunctively, the depolarization shows insufficient summation to reach the required level, or some of the inputs may arrive when the depolarization has decayed). Once the LTP has become established (which can be within one minute of the strong input to the cell), the LTP is expressed through the K–Q receptors, in that AP5 blocks only the establishment of LTP, and not its subsequent expression (Bliss & Collingridge 1993, Nicoll & Malenka 1995, Fazeli & Collingridge 1996, Feldman 2009).
There are a number of possibilities about what change is triggered by the entry of Ca${}^{2+}$ to the postsynaptic cell to mediate LTP. One possibility is that somehow a messenger reaches the presynaptic terminals from the postsynaptic membrane and, if the terminals are active, causes them to release more transmitter in future whenever they are activated by an action potential. Consistent with this possibility is the observation that, after LTP has been induced, more transmitter appears to be released from the presynaptic endings. Another possibility is that the postsynaptic membrane changes just where Ca${}^{2+}$ has entered, so that K–Q receptors become more responsive to glutamate released in future. Consistent with this possibility is the observation that after LTP, the postsynaptic cell may respond more to locally applied (p.551) glutamate (using a microiontophoretic technique).
The rule that underlies associative LTP is thus that synapses connecting two neurons become stronger if there is conjunctive presynaptic and (strong) postsynaptic activity. This learning rule for synaptic modification is sometimes called the Hebb rule, after Donald Hebb of McGill University who drew attention to this possibility, and its potential importance in learning (Hebb 1949).
In that LTP is longlasting, develops rapidly, is synapsespecific, and is in some cases associative, it is of interest as a potential synaptic mechanism underlying some forms of memory. Evidence linking it directly to some forms of learning comes from experiments in which it has been shown that the drug AP5, infused so that it reaches the hippocampus to block NMDA receptors, blocks spatial learning mediated by the hippocampus (see (Morris, 1989), (p.552) (Martin, Grimwood & Morris 2000)). The task learned by the rats was to find the location relative to cues in a room of a platform submerged in an opaque liquid (milk). Interestingly, if the rats had already learned where the platform was, then the NMDA infusion did not block performance of the task. This is a close parallel to LTP, in that the learning, but not the subsequent expression of what had been learned, was blocked by the NMDA antagonist AP5. Although there is still some uncertainty about the experimental evidence that links LTP to learning (see for example (Martin, Grimwood & Morris 2000)), there is a need for a synapsespecific modifiability of synaptic strengths on neurons if neuronal networks are to learn (see Section A.2). If LTP is not always an exact model of the synaptic modification that occurs during learning, then something with many of the properties of LTP is nevertheless needed, and is likely to be present in the brain given the functions known to be implemented in many brain regions (see (Rolls & Treves, 1998)).
In another model of the role of LTP in memory, (Davis, 2000) has studied the role of the amygdala in learning associations to fearinducing stimuli. He has shown that blockade of NMDA synapses in the amygdala interferes with this type of learning, consistent with the idea that LTP also provides a useful model of this type of learning (see further Chapter 4).
LongTerm Depression (LTD) can also occur (Feldman 2009). It can in principle be associative or nonassociative. In associative LTD, the alteration of synaptic strength depends on the pre and postsynaptic activities. There are two types. Heterosynaptic LTD occurs when the postsynaptic neuron is strongly activated, and there is low presynaptic activity (see Fig. A.5b input B, and Table A.1). Heterosynaptic LTD is socalled because the synapse that weakens is other than (hetero) the one through which the postsynaptic neuron is activated. Heterosynaptic LTD is important in associative neuronal networks, and in competitive neuronal networks (see Chapter 7 of Rolls & Deco (2002)). In competitive neural networks it would be helpful if the degree of heterosynaptic LTD depended on the existing strength of the synapse, and there is some evidence that this may be the case (see Chapter 7 of Rolls & Deco (2002)). Homosynaptic LTD occurs when the presynaptic neuron is strongly active, and the postsynaptic neuron has some, but low, activity (see Fig. A.5d and Table A.1). Homosynaptic LTD is socalled because the synapse that weakens is the same as (homo) the one that is active. Heterosynaptic and homosynaptic LTD are found in the neocortex (Artola & Singer 1993, Singer 1995, Frégnac 1996) and hippocampus (Christie 1996), and in many cases are dependent on activation of NMDA receptors (see also (Fazeli & Collingridge, 1996)). LTD in the cerebellum is evident as weakening of active parallel fibre to Purkinje cell synapses when the climbing fibre connecting to a Purkinje cell is active (Ito 1984, Ito 1989, Ito 1993a, Ito 1993b).
An interesting timedependence of LTP and LTD has been observed, with LTP occurring especially when the presynaptic spikes precede by a few ms the postsynaptic activation, and LTD occurring when the presynaptic spikes follow the postsynaptic activation by a few ms (Markram, Lübke, Frotscher & Sakmann 1997, Bi & Poo 1998, Feldman 2012). This type of temporally asymmetric Hebbian learning rule, demonstrated in the neocortex and the hippocampus, can induce associations over time, and not just between simultaneous events. Networks of neurons with such synapses can learn sequences (Minai & Levy 1993), enabling them to predict the future state of the postsynaptic neuron based on past experience (Abbott & Blum 1996) (see further (Koch, 1999), (Markram, Pikus, Gupta & Tsodyks 1998) and (Abbott & Nelson, 2000)). This mechanism, because of its apparent timespecificity for periods in the range of tens of ms, could also encourage neurons to learn to respond to temporally synchronous presynaptic firing (Gerstner, Kreiter, Markram & Herz 1997), and indeed to decrease the synaptic strengths from neurons that fire at random times with respect to the synchronized group. This mechanism might also play a role in the normalization of the strength of synaptic connection strengths onto a neuron. Under the somewhat steady state conditions of the firing of neurons in the higher parts of the ventral visual system on the 10 (p.553) ms timescale that are observed not only when single stimuli are presented for 500 ms (see Fig. 4.11), but also when macaques have found a search target and are looking at it (in the experiments described in Section 4.4.4.3), the average of the presynaptic and postsynaptic rates are likely to be the important determinants of synaptic modification. Part of the reason for this is that correlations between the firing of simultaneously recorded inferior temporal cortex neurons are not common, and if present are not very strong or typically restricted to a short time window in the order of 10 ms (Rolls & Treves 2011, Franco et al. 2004, Aggelopoulos et al. 2005). This point is also made in the context that each neuron has thousands of inputs, several tens of which are normally likely to be active when a cell is firing above its spontaneous firing rate and is strongly depolarized. This may make it unlikely statistically that there will be a strong correlation between a particular presynaptic spike and postsynaptic firing, and thus that this is likely to be a main determinant of synaptic strength under these natural conditions. Further points are that cortical neurons often fire to effective stimuli with firing rates of 50 or more spikes/s (Rolls 2008b, Rolls & Treves 2011); that usually LTP and not LTD is observed with high firing rates; and that with these high firing rates, the issue arises of whether one neuron is firing before or after another neuron.
A.1.6 Distributed representations
When considering the operation of many neuronal networks in the brain, it is found that many useful properties arise if each input to the network (arriving on the axons as a firing rate vector $\mathbf{x}$) is encoded in the activity of an ensemble or population of the axons or input lines (distributed encoding), and is not signalled by the activity of a single input, which is called local encoding. We start with some definitions, and then highlight some of the differences, and summarize some evidence that shows the type of encoding used in some brain regions. Then in Section A.2.8 (e.g. Table A.2), we show how many of the useful properties of the neuronal networks described depend on distributed encoding. Rolls (2008b) and Rolls & Treves (2011) review evidence on the encoding actually found in cortical areas.
A.1.6.1 Definitions
A local representation is one in which all the information that a particular stimulus or event occurred is provided by the activity of one of the neurons. In a famous example, a single neuron might be active only if one’s grandmother was being seen. An implication is that most neurons in the brain regions where objects or events are represented would fire only very rarely. A problem with this type of encoding is that a new neuron would be needed for every object or event that has to be represented. There are many other disadvantages of this type of encoding, many of which are made apparent in this book. Moreover, there is evidence that objects are represented in the brain by a different type of encoding.
A fully distributed representation is one in which all the information that a particular stimulus or event occurred is provided by the activity of the full set of neurons. If the neurons are binary (e.g. either active or not), the most distributed encoding is when half the neurons are active for any one stimulus or event.
A sparse distributed representation is a distributed representation in which a small proportion of the population of neurons is active at any one time. In a sparse representation with binary neurons, less than half of the neurons are active for any one stimulus or event. For binary neurons, we can use as a measure of the sparseness the proportion of neurons in the active state. For neurons with real, continuously variable, values of firing rates, the sparseness a of the representation can be measured, by extending the binary notion of the proportion of neurons that are firing, as (p.554)
where ${y}_{i}$ is the firing rate of the ith neuron in the set of N neurons (Treves & Rolls 1991, Rolls & Treves 2011, Franco, Rolls, Aggelopoulos & Jerez 2007).
Coarse coding utilizes overlaps of receptive fields, and can compute positions in the input space using differences between the firing levels of coactive cells (e.g. colourtuned cones in the retina). The representation implied is very distributed. Fine coding (in which for example a neuron may be ‘tuned’ to the exact orientation and position of a stimulus) implies more local coding.
A.1.6.2 Advantages of different types of coding
One advantage of distributed encoding is that the similarity between two representations can be reflected by the correlation between the two patterns of activity that represent the different stimuli. We have already introduced the idea that the input to a neuron is represented by the activity of its set of input axons ${x}_{j}$, where j indexes the axons, numbered from $j=1,C$ (see Fig. A.2 and Equation A.1). Now the set of activities of the input axons is a vector (a vector is an ordered set of numbers; Appendix 1 of (Rolls & Treves, 1998) and of Rolls (2008b) provides a summary of some of the concepts involved). We can denote as ${\mathbf{x}}_{1}$ the vector of axonal activity that represents stimulus 1, and ${\mathbf{x}}_{2}$ the vector that represents stimulus 2. Then the similarity between the two vectors, and thus the two stimuli, is reflected by the correlation between the two vectors. The correlation will be high if the activity of each axon in the two representations is similar; and will become more and more different as the activity of more and more of the axons differs in the two representations. Thus the similarity of two inputs can be represented in a graded or continuous way if (this type of) distributed encoding is used. This enables generalization to similar stimuli, or to incomplete versions of a stimulus (if it is for example partly seen or partly remembered), to occur. With a local representation, either one stimulus or another is represented, and similarities between different stimuli are not encoded.
Another advantage of distributed encoding is that the number of different stimuli that can be represented by a set of C components (e.g. the activity of C axons) can be very large. A simple example is provided by the binary encoding of an 8element vector. One component can code for which of two stimuli has been seen, 2 components (or bits in a computer byte) for 4 stimuli, 3 components for 8 stimuli, 8 components for 256 stimuli, etc. That is, the number of stimuli increases exponentially with the number of components (or in this case, axons) in the representation. (In this simple binary illustrative case, the number of stimuli that can be encoded is ${2}^{C}$.) Put the other way round, even if a neuron has only a limited number of inputs (e.g. a few thousand), it can nevertheless receive a great deal of information about which stimulus was present. This ability of a neuron with a limited number of inputs to receive information about which of potentially very many input events is present is probably one factor that makes computation by the brain possible. With local encoding, the number of stimuli that can be encoded increases only linearly with the number C of axons or components (because a different component is needed to represent each new stimulus). (In our example, only 8 stimuli could be represented by 8 axons.)
In the real brain, there is now good evidence that in a number of brain systems, including the highorder visual and olfactory cortices, and the hippocampus, distributed encoding with the above two properties, of representing similarity, and of exponentially increasing encoding capacity as the number of neurons in the representation increases, is found (Rolls & Tovee 1995, Abbott, Rolls & Tovee 1996, Rolls, (p.555) Treves & Tovee 1997d, Rolls, Treves, Robertson, GeorgesFrançois & Panzeri 1998b, Rolls, Aggelopoulos, Franco & Treves 2004, Franco, Rolls, Aggelopoulos & Jerez 2007, Rolls 2008b, Rolls, Critchley, Verhagen & Kadohisa 2010a, Rolls & Treves 2011). For example, in the primate inferior temporal visual cortex, the number of faces or objects that can be represented increases approximately exponentially with the number of neurons in the population (see Chapter 4). If we consider instead the information about which stimulus is seen, we see that this rises approximately linearly with the number of neurons in the representation (see Chapter 4). This corresponds to an exponential rise in the number of stimuli encoded, because information is a log measure (see Rolls & Treves (2011)). A similar result has been found for the encoding of position in space by the primate hippocampus (Rolls, Treves, Robertson, GeorgesFrançois & Panzeri 1998b). Similar results have been found for the encoding of information about taste and olfactory stimuli in the orbitofrontal cortex (Rolls, Critchley, Verhagen & Kadohisa 2010a, Rolls & Treves 2011). It is particularly important that the information can be read from the ensemble of neurons using a simple measure of the similarity of vectors, the correlation (or dot product) between two vectors. The importance of this is that it is essentially vector similarity operations that characterize the operation of many neuronal networks (see Section A.2). The neurophysiological results show that both the ability to reflect similarity by vector correlation, and the utilization of exponential coding capacity, are properties of real neuronal networks found in the brain.
To emphasize one of the points being made here, although the binary encoding used in the 8bit vector described above has optimal capacity for binary encoding, it is not optimal for vector similarity operations. For example, the two very similar numbers 127 and 128 are represented by 01111111 and 10000000 with binary encoding, yet the correlation or bit overlap of these vectors is 0. The brain in contrast uses a code that has the attractive property of exponentially increasing capacity with the number of neurons in the representation, though it is different from the simple binary encoding of numbers used in computers; and at the same time the brain codes stimuli in such a way that the code can be read off with simple dot product or correlationrelated decoding, which is what is specified for the elementary neuronal network operation shown in Equation A.1 (see Rolls (2008b)).
A.2 Pattern association memory
A fundamental operation of most nervous systems is to learn to associate a first stimulus with a second that occurs at about the same time, and to retrieve the second stimulus when the first is presented. The first stimulus might be the sight of food, and the second stimulus the taste of food. After the association has been learned, the sight of food would enable its taste to be retrieved. In classical conditioning, the taste of food might elicit an unconditioned response of salivation, and if the sight of the food is paired with its taste, then the sight of that food would by learning come to produce salivation. Pattern associators are thus used where the outputs of the visual system interface to learning systems in the orbitofrontal cortex and amygdala that learn associations between the sight of objects and their taste or touch in stimulus–reinforcer association learning (see Chapter 4). Pattern association is also used throughout the visual processing cortical areas, as it is the architecture that describes the backprojection connections from one cortical area to the preceding cortical area (Rolls & Deco 2002, Rolls 2008b). Pattern association thus contributes to implementing topdown influences in vision, including the effects of attention from higher to lower cortical areas, and thus between the object and spatial processing streams (Rolls & Deco 2002); the effects of mood on memory and visual information processing (see Section 4.12); the recall of visual memories; and the operation of visual shortterm memory (Rolls 2008b).
(p.556) A.2.1 Architecture and operation
The essential elements necessary for pattern association, forming what could be called a prototypical pattern associator network, are shown in Fig. A.7.
What we have called the second or unconditioned stimulus pattern is applied through unmodifiable synapses generating an input to each neuron, which, being external with respect to the synaptic matrix we focus on, we can call the external input ${e}_{i}$ for the ith neuron. (We can also treat this as a vector, $\mathbf{e}$, as indicated in the legend to Fig. A.7. Vectors and simple operations performed with them are summarized in Appendix A of Rolls (2008b).) This unconditioned stimulus is dominant in producing or forcing the firing of the output neurons (${y}_{i}$ for the ith neuron, or the vector $\mathbf{y}$). At the same time, the first or conditioned stimulus pattern consisting of the set of firings on the horizontally running input axons in Fig. A.7 (${x}_{j}$ for the jth axon) (or equivalently the vector $\mathbf{x}$) is applied through modifiable synapses ${w}_{ij}$ to the dendrites of the output neurons. The synapses are modifiable in such a way that if there is presynaptic firing on an input axon ${x}_{j}$ paired during learning with postsynaptic activity on neuron i, then the strength or weight ${w}_{ij}$ between that axon and the dendrite increases. This simple learning rule is often called the Hebb rule, after Donald Hebb who in 1949 formulated the hypothesis that if the firing of one neuron was regularly associated with another, then the strength of the synapse or synapses between the neurons should increase^{46}. After learning, presenting the pattern $\mathbf{x}$ on the input axons will activate the dendrite through the strengthened synapses. If the cue or conditioned stimulus pattern is the same as that learned, the postsynaptic neurons will be activated, even in the absence of the external or unconditioned input, as each of the firing axons produces (p.557) through a strengthened synapse some activation of the postsynaptic element, the dendrite. The total activation ${h}_{i}$ of each postsynaptic neuron i is then the sum of such individual activations. In this way, the ‘correct’ output neurons, that is those activated during learning, can end up being the ones most strongly activated, and the second or unconditioned stimulus can be effectively recalled. The recall is best when only strong activation of the postsynaptic neuron produces firing, that is if there is a threshold for firing, just like real neurons. The advantages of this are evident when many associations are stored in the memory, as will soon be shown.
Next we introduce a more precise description of the above by writing down explicit mathematical rules for the operation of the simple network model of Fig. A.7, which will help us to understand how pattern association memories in general operate. (In this description we introduce simple vector operations, and, for those who are not familiar with these, refer the reader to for example Appendix 1 of Rolls (2008b).) We have denoted above a conditioned stimulus input pattern as $\mathbf{x}$. Each of the axons has a firing rate, and if we count or index through the axons using the subscript j, the firing rate of the first axon is ${x}_{1}$, of the second ${x}_{2}$, of the jth ${x}_{j}$, etc. The whole set of axons forms a vector, which is just an ordered (1, 2, 3, etc.) set of elements. The firing rate of each axon ${x}_{j}$ is one element of the firing rate vector $\mathbf{x}$. Similarly, using i as the index, we can denote the firing rate of any output neuron as ${y}_{i}$, and the firing rate output vector as $\mathbf{y}$. With this terminology, we can then identify any synapse onto neuron i from neuron j as ${w}_{ij}$ (see Fig. A.7). In this book, the first index, i, always refers to the receiving neuron (and thus signifies a dendrite), while the second index, j, refers to the sending neuron (and thus signifies a conditioned stimulus input axon in Fig. A.7). We can now specify the learning and retrieval operations as follows:
A.2.1.1 Learning
The firing rate of every output neuron is forced to a value determined by the unconditioned (or external or forcing stimulus) input. In our simple model this means that for any one neuron i,
which indicates that the firing rate is a function of the dendritic activation, taken in this case to reduce essentially to that resulting from the external forcing input (see Fig. A.7). The function f is called the activation function (see Fig. A.3), and its precise form is irrelevant, at least during this learning phase. For example, the function at its simplest could be taken to be linear, so that the firing rate would be just proportional to the activation.
The Hebb rule can then be written as follows:
where $\mathrm{\delta}{w}_{ij}$ is the change of the synaptic weight ${w}_{ij}$ that results from the simultaneous (or conjunctive) presence of presynaptic firing ${x}_{j}$ and postsynaptic firing or activation ${y}_{i}$, and $\mathrm{\alpha}$ is a learning rate constant that specifies how much the synapses alter on any one pairing.
The Hebb rule is expressed in this multiplicative form to reflect the idea that both presynaptic and postsynaptic activity must be present for the synapses to increase in strength. The multiplicative form also reflects the idea that strong pre and postsynaptic firing will produce a larger change of synaptic weight than smaller firing rates. It is also assumed for now that before any learning takes place, the synaptic strengths are small in relation to the changes that can be produced during Hebbian learning. We will see that this assumption can be relaxed later when a modified Hebb rule is introduced that can lead to a reduction in synaptic strength under some conditions.
(p.558) A.2.1.2 Recall
When the conditioned stimulus is present on the input axons, the total activation ${h}_{i}$ of a neuron i is the sum of all the activations produced through each strengthened synapse ${w}_{ij}$ by each active neuron ${x}_{j}$. We can express this as
where $\sum _{j=1}^{C}$ indicates that the sum is over the C input axons (or connections) indexed by j to each neuron.
The multiplicative form here indicates that activation should be produced by an axon only if it is firing, and only if it is connected to the dendrite by a strengthened synapse. It also indicates that the strength of the activation reflects how fast the axon ${x}_{j}$ is firing, and how strong the synapse ${w}_{ij}$ is. The sum of all such activations expresses the idea that summation (of synaptic currents in real neurons) occurs along the length of the dendrite, to produce activation at the cell body, where the activation ${h}_{i}$ is converted into firing ${y}_{i}$. This conversion can be expressed as
where the function f is again the activation function. The form of the function now becomes more important. Real neurons have thresholds, with firing occurring only if the activation is above the threshold. A threshold linear activation function is shown in Fig. A.3b. This has been useful in formal analysis of the properties of neural networks. Neurons also have firing rates that become saturated at a maximum rate, and we could express this as the sigmoid activation function shown in Fig. A.3c. Yet another simple activation function, used in some models of neural networks, is the binary threshold function (Fig. A.3d), which indicates that if the activation is below threshold, there is no firing, and that if the activation is above threshold, the neuron fires maximally. Whatever the exact shape of the activation function, some nonlinearity is an advantage, for it enables small activations produced by interfering memories to be minimized, and it can enable neurons to perform logical operations, such as to fire or respond only if two or more sets of inputs are present simultaneously.
A.2.2 A simple model
An example of these learning and recall operations is provided in a simple form as follows. The neurons will have simple firing rates, which can be 0 to represent no activity, and 1 to indicate high firing. They are thus binary neurons, which can assume one of two firing rates. If we have a pattern associator with six input axons and four output neurons, we could represent the network before learning, with the same layout as in Fig. A.7, as shown in Fig. A.8:
(p.559) where $\mathbf{x}$ or the conditioned stimulus (CS) is 101010, and $\mathbf{y}$ or the firing produced by the unconditioned stimulus (UCS) is 1100. (The arrows indicate the flow of signals.) The synaptic weights are initially all 0.
After pairing the CS with the UCS during one learning trial, some of the synaptic weights will be incremented according to Equation A.6, so that after learning this pair the synaptic weights will become as shown in Fig. A.9:
(p.560) We can represent what happens during recall, when, for example, we present the CS that has been learned, as shown in Fig. A.10:
The activation of the four output neurons is 3300, and if we set the threshold of each output neuron to 2, then the output firing is 1100 (where the binary firing rate is 0 if below threshold, and 1 if above). The pattern associator has thus achieved recall of the pattern 1100, which is correct.
We can now illustrate how a number of different associations can be stored in such a pattern associator, and retrieved correctly. Let us associate a new CS pattern 110001 with the UCS 0101 in the same pattern associator. The weights will become as shown next in Fig. A.11 after learning:
(p.561) If we now present the second CS, the retrieval is as shown in Fig. A.12:
The binary output firings were again produced with the threshold set to 2. Recall is perfect.
This illustration shows the value of some threshold nonlinearity in the activation function of the neurons. In this case, the activations did reflect some small crosstalk or interference from the previous pattern association of CS1 with UCS1, but this was removed by the threshold operation, to clean up the recall firing. The example also shows that when further associations are learned by a pattern associator trained with the Hebb rule, Equation A.6, some synapses will reflect increments above a synaptic strength of 1. It is left as an exercise to the reader to verify that recall is still perfect to CS1, the vector 101010. (The activation vector $\mathbf{h}$ is 3401, and the output firing vector $\mathbf{y}$ with the same threshold of 2 is 1100, which is perfect recall.)
A.2.3 The vector interpretation
The way in which recall is produced, Equation A.7, consists for each output neuron i of multiplying each input firing rate ${x}_{j}$ by the corresponding synaptic weight ${w}_{ij}$ and summing the products to obtain the activation ${h}_{i}$. Now we can consider the firing rates ${x}_{j}$ where j varies from 1 to ${N}^{\prime}$, the number of axons, to be a vector. (A vector is simply an ordered set of numbers – see Appendix 1 of Rolls (2008b).) Let us call this vector $\mathbf{x}$. Similarly, on a neuron i, the synaptic weights can be treated as a vector, ${\mathbf{w}}_{i}$. (The subscript i here indicates that this is the weight vector on the ith neuron.) The operation we have just described to obtain the activation of an output neuron can now be seen to be a simple multiplication operation of two vectors to produce a single output value (called a scalar output). This is the inner product or dot product of two vectors, and can be written
The inner product of two vectors indicates how similar they are. If two vectors have corresponding elements the same, then the dot product will be maximal. If the two vectors are similar but not identical, then the dot product will be high. If the two vectors are completely different, the dot product will be 0, and the vectors are described as orthogonal. (The term orthogonal means at right angles, and arises from the geometric interpretation of vectors, which is summarized in Appendix 1 of Rolls (2008b).) Thus the dot product provides a direct measure of how similar two vectors are.
(p.562) It can now be seen that a fundamental operation many neurons perform is effectively to compute how similar an input pattern vector $\mathbf{x}$ is to their stored weight vector ${\mathbf{w}}_{i}$. The similarity measure they compute, the dot product, is a very good measure of similarity, and indeed, the standard (Pearson productmoment) correlation coefficient used in statistics is the same as a normalized dot product with the mean subtracted from each vector, as shown in Appendix 1 of Rolls (2008b). (The normalization used in the correlation coefficient results in the coefficient varying always between $+1$ and $1$, whereas the actual scalar value of a dot product clearly depends on the length of the vectors from which it is calculated.)
With these concepts, we can now see that during learning, a pattern associator adds to its weight vector a vector $\mathrm{\delta}{\mathbf{w}}_{i}$ that has the same pattern as the input pattern $\mathbf{x}$, if the postsynaptic neuron i is strongly activated. Indeed, we can express Equation A.6 in vector form as
We can now see that what is recalled by the neuron depends on the similarity of the recall cue vector ${\mathbf{x}}_{r}$ to the originally learned vector $\mathbf{x}$. The fact that during recall the output of each neuron reflects the similarity (as measured by the dot product) of the input pattern ${\mathbf{x}}_{r}$ to each of the patterns used originally as $\mathbf{x}$ inputs (conditioned stimuli in Fig. A.7) provides a simple way to appreciate many of the interesting and biologically useful properties of pattern associators, as described next.
A.2.4 Properties
A.2.4.1 Generalization
During recall, pattern associators generalize, and produce appropriate outputs if a recall cue vector ${\mathbf{x}}_{r}$ is similar to a vector that has been learned already. This occurs because the recall operation involves computing the dot (inner) product of the input pattern vector ${\mathbf{x}}_{r}$ with the synaptic weight vector ${\mathbf{w}}_{i}$, so that the firing produced, ${y}_{i}$, reflects the similarity of the current input to the previously learned input pattern $\mathbf{x}$. (Generalization will occur to input cue or conditioned stimulus patterns ${\mathbf{x}}_{r}$ that are incomplete versions of an original conditioned stimulus $\mathbf{x}$, although the term completion is usually applied to the autoassociation networks described in Section A.3.)
This is an important property of pattern associators, for input stimuli during recall will rarely be absolutely identical to what has been learned previously, and automatic generalization to similar stimuli is extremely useful, and has great adaptive value in biological systems.
(p.563) Generalization can be illustrated with the simple binary pattern associator considered above. (Those who have appreciated the vector description just given might wish to skip this illustration.) Instead of the second CS, pattern vector 110001, we will use the similar recall cue 110100, as shown in Fig. A.13:
It is seen that the output firing rate vector, 0101, is exactly what should be recalled to CS2 (and not to CS1), so correct generalization has occurred. Although this is a small network trained with few examples, the same properties hold for large networks with large numbers of stored patterns, as described more quantitatively in the section on capacity below and in Appendix A3 of Rolls & Treves (1998).
A.2.4.2 Graceful degradation or fault tolerance
If the synaptic weight vector ${\mathbf{w}}_{i}$ (or the weight matrix, which we can call $\mathbf{W}$) has synapses missing (e.g. during development), or loses synapses, then the activation ${h}_{i}$ or $\mathbf{h}$ is still reasonable, because ${h}_{i}$ is the dot product (correlation) of $\mathbf{x}$ with ${\mathbf{w}}_{i}$. The result, especially after passing through the activation function, can frequently be perfect recall. The same property arises if for example one or some of the conditioned stimulus (CS) input axons are lost or damaged. This is a very important property of associative memories, and is not a property of conventional computer memories, which produce incorrect data if even only 1 storage location (for 1 bit or binary digit of data) of their memory is damaged or cannot be accessed. This property of graceful degradation is of great adaptive value for biological systems.
(p.564) We can illustrate this with a simple example. If we damage two of the synapses in Fig. A.12 to produce the synaptic matrix shown in Fig. A.14 (where x indicates a damaged synapse which has no effect, but was previously 1), and now present the second CS, the retrieval is as follows:
The binary output firings were again produced with the threshold set to 2. The recalled vector, 0101, is perfect. This illustration again shows the value of some threshold nonlinearity in the activation function of the neurons. It is left as an exercise to the reader to verify that recall is still perfect to CS1, the vector 101010. (The output activation vector $\mathbf{h}$ is 3301, and the output firing vector $\mathbf{y}$ with the same threshold of 2 is 1100, which is perfect recall.)
A.2.4.3 The importance of distributed representations for pattern associators
A distributed representation is one in which the firing or activity of all the elements in the vector is used to encode a particular stimulus. For example, in a conditioned stimulus vector CS1 that has the value 101010, we need to know the state of all the elements to know which stimulus is being represented. Another stimulus, CS2, is represented by the vector 110001. We can represent many different events or stimuli with such overlapping sets of elements, and because in general any one element cannot be used to identify the stimulus, but instead the information about which stimulus is present is distributed over the population of elements or neurons, this is called a distributed representation (see Section A.1.6). If, for binary neurons, half the neurons are in one state (e.g. 0), and the other half are in the other state (e.g. 1), then the representation is described as fully distributed. The CS representations above are thus fully distributed. If only a smaller proportion of the neurons is active to represent a stimulus, as in the vector 100001, then this is a sparse representation. For binary representations, we can quantify the sparseness by the proportion of neurons in the active (1) state.
In contrast, a local representation is one in which all the information that a particular stimulus or event has occurred is provided by the activity of one of the neurons, or elements in the vector. One stimulus might be represented by the vector 100000, another stimulus by the vector 010000, and a third stimulus by the vector 001000. The activity of neuron or element 1 would indicate that stimulus 1 was present, and of neuron 2, that stimulus 2 was present. The representation is local in that if a particular neuron is active, we know that the stimulus represented by that neuron is present. In neurophysiology, if such cells were present, they might be called ‘grandmother cells’ (cf. Barlow 2008b), in that one neuron might represent a stimulus in the environment as complex and specific as one’s grandmother. Where the activity of a number of cells must be taken into account in order to represent a (p.565) stimulus (such as an individual taste), then the representation is sometimes described as using ensemble encoding.
The properties just described for associative memories, generalization, and graceful degradation are only implemented if the representation of the CS or $\mathbf{x}$ vector is distributed. This occurs because the recall operation involves computing the dot (inner) product of the input pattern vector ${\mathbf{x}}_{r}$ with the synaptic weight vector ${\mathbf{w}}_{i}$. This allows the activation ${h}_{i}$ to reflect the similarity of the current input pattern to a previously learned input pattern $\mathbf{x}$ only if several or many elements of the $\mathbf{x}$ and ${\mathbf{x}}_{r}$ vectors are in the active state to represent a pattern. If local encoding were used, e.g. 100000, then if the first element of the vector (which might be the firing of axon 1, i.e. ${x}_{1}$, or the strength of synapse $i1$, ${w}_{i1}$) is lost, the resulting vector is not similar to any other CS vector, and the activation is 0. In the case of local encoding, the important properties of associative memories, generalization and graceful degradation do not thus emerge. Graceful degradation and generalization are dependent on distributed representations, for then the dot product can reflect similarity even when some elements of the vectors involved are altered. If we think of the correlation between Y and X in a graph, then this correlation is affected only a little if a few $X,Y$ pairs of data are lost (see Appendix 1 of Rolls (2008b)).
A.2.5 Prototype extraction, extraction of central tendency, and noise reduction
If a set of similar conditioned stimulus vectors $\mathbf{x}$ are paired with the same unconditioned stimulus ${e}_{i}$, the weight vector ${\mathbf{w}}_{i}$ becomes (or points towards) the sum (or with scaling, the average) of the set of similar vectors $\mathbf{x}$. This follows from the operation of the Hebb rule in Equation A.6. When tested at recall, the output of the memory is then best to the average input pattern vector denoted $\u3008\mathbf{x}\u3009$. If the average is thought of as a prototype, then even though the prototype vector $\u3008\mathbf{x}\u3009$ itself may never have been seen, the best output of the neuron or network is to the prototype. This produces ‘extraction of the prototype’ or ‘central tendency’. The same phenomenon is a feature of human memory performance (see (McClelland & Rumelhart, 1986) Chapter 17), and this simple process with distributed representations in a neural network accounts for the psychological phenomenon.
If the different exemplars of the vector $\mathbf{x}$ are thought of as noisy versions of the true input pattern vector $\u3008\mathbf{x}\u3009$ (with incorrect values for some of the elements), then the pattern associator has performed ‘noise reduction’, in that the output produced by any one of these vectors will represent the output produced by the true, noiseless, average vector $\u3008\mathbf{x}\u3009$.
A.2.6 Speed
Recall is very fast in a real neuronal network, because the conditioned stimulus input firings ${x}_{j}$ ($j=1,C$ axons) can be applied simultaneously to the synapses ${w}_{ij}$, and the activation ${h}_{i}$ can be accumulated in one or two time constants of the dendrite (e.g. 10–20 ms). Whenever the threshold of the cell is exceeded, it fires. Thus, in effectively one step, which takes the brain no more than 10–20 ms, all the output neurons of the pattern associator can be firing with rates that reflect the input firing of every axon. This is very different from a conventional digital computer, in which computing ${h}_{i}$ in Equation A.7 would involve C multiplication and addition operations occurring one after another, or $2C$ time steps.
The brain performs parallel computation in at least two senses in even a pattern associator. One is that for a single neuron, the separate contributions of the firing rate ${x}_{j}$ of each axon j multiplied by the synaptic weight ${w}_{ij}$ are computed in parallel and added in the same time step. The second is that this can be performed in parallel for all neurons $i=1,N$ in (p.566) the network, where there are N output neurons in the network. It is these types of parallel processing that enable these classes of neuronal network in the brain to operate so fast, in effectively so few steps.
Learning is also fast (‘oneshot’) in pattern associators, in that a single pairing of the conditioned stimulus $\mathbf{x}$ and the unconditioned stimulus (UCS) $\mathbf{e}$ which produces the unconditioned output firing $\mathbf{y}$ enables the association to be learned. There is no need to repeat the pairing in order to discover over many trials the appropriate mapping. This is extremely important for biological systems, in which a single cooccurrence of two events may lead to learning that could have lifesaving consequences. (For example, the pairing of a visual stimulus with a potentially lifethreatening aversive event may enable that event to be avoided in future.) Although repeated pairing with small variations of the vectors is used to obtain the useful properties of prototype extraction, extraction of central tendency, and noise reduction, the essential properties of generalization and graceful degradation are obtained with just one pairing. The actual time scales of the learning in the brain are indicated by studies of associative synaptic modification using longterm potentiation paradigms (LTP, see Section A.1.5). Cooccurrence or near simultaneity of the CS and UCS is required for periods of as little as 100 ms, with expression of the synaptic modification being present within typically a few seconds.
A.2.7 Local learning rule
The simplest learning rule used in pattern association neural networks, a version of the Hebb rule, is, as shown in Equation A.6 above,
This is a local learning rule in that the information required to specify the change in synaptic weight is available locally at the synapse, as it is dependent only on the presynaptic firing rate ${x}_{j}$ available at the synaptic terminal, and the postsynaptic activation or firing ${y}_{i}$ available on the dendrite of the neuron receiving the synapse (see Fig. A.15b). This makes the learning rule biologically plausible, in that the information about how to change the synaptic weight does not have to be carried from a distant source, where it is computed, to every synapse. Such a nonlocal learning rule would not be biologically plausible, in that there are no appropriate connections known in most parts of the brain to bring in the synaptic training or teacher signal to every synapse.
Evidence that a learning rule with the general form of Equation A.6 is implemented in at least some parts of the brain comes from studies of longterm potentiation, described in Section A.1.5. Longterm potentiation (LTP) has the synaptic specificity defined by Equation A.6, in that only synapses from active afferents, not those from inactive afferents, become strengthened. Synaptic specificity is important for a pattern associator, and most other types of neuronal network, to operate correctly. The number of independently modifiable synapses on each neuron is a primary factor in determining how many different memory patterns can be stored in associative memories (see Sections A.2.7.1 and A.3.3.6).
Another useful property of real neurons in relation to Equation A.6 is that the postsynaptic term, ${y}_{i}$, is available on much of the dendrite of a cell, because the electrotonic length of the dendrite is short. In addition, active propagation of spiking activity from the cell body along the dendrite may help to provide a uniform postsynaptic term for the learning. Thus if a neuron is strongly activated with a high value for ${y}_{i}$, then any active synapse onto the cell will be capable of being modified. This enables the cell to learn an association between the pattern of activity on all its axons and its postsynaptic activation, which is stored as an addition to its weight vector ${\mathbf{w}}_{i}$. Then later on, at recall, the output can be produced as a vector dot product (p.567) operation between the input pattern vector $\mathbf{x}$ and the weight vector ${\mathbf{w}}_{i}$, so that the output of the cell can reflect the correlation between the current input vector and what has previously been learned by the cell.
It is interesting that at least many invertebrate neuronal systems may operate very differently from those described here, as described by (Rolls & Treves, 1998) (see Fig. A.15a).
A.2.7.1 Capacity
The question of the storage capacity of a pattern associator is considered in detail in Appendix A3 of (Rolls & Treves, 1998). It is pointed out there that, for this type of associative network, the number of memories that it can hold simultaneously in storage has to be analysed together with the retrieval quality of each output representation, and then only for a given quality of the representation provided in the input. This is in contrast to autoassociative nets (Section A.3), in which a critical number of stored memories exists (as a function of various parameters of the network), beyond which attempting to store additional memories results in it becoming impossible to retrieve essentially anything. With a pattern associator, instead, one will always retrieve something, but this something will be very small (in information or correlation terms) if too many associations are simultaneously in storage and/or if too little is provided as input.
The conjoint qualitycapacity input analysis can be carried out, for any specific instance of a pattern associator, by using formal mathematical models and established analytical procedures (see e.g. (Treves, 1995)). This, however, has to be done case by case. It is anyway useful to develop some intuition for how a pattern associator operates, by considering what its capacity would be in certain welldefined simplified cases.
Linear associative neuronal networks
These networks are made up of units with a linear activation function, which appears to make them unsuitable to represent real neurons with their positiveonly firing rates. However, even purely linear units have been considered as provisionally relevant models of real neurons, by assuming that the latter operate sometimes (p.568) in the linear regime of their transfer function. (This implies a high level of spontaneous activity, and may be closer to conditions observed early on in sensory systems rather than in areas more specifically involved in memory.) As usual, the connections are trained by a Hebb (or similar) associative learning rule. The capacity of these networks can be defined as the total number of associations that can be learned independently of each other, given that the linear nature of these systems prevents anything more than a linear transform of the inputs. This implies that if input pattern C can be written as the weighted sum of input patterns A and B, the output to C will be just the same weighted sum of the outputs to A and B. If there are ${N}^{\prime}$ input axons, then there can be only at most ${N}^{\prime}$ mutually independent input patterns (i.e. none able to be written as a weighted sum of the others), and therefore the capacity of linear networks, defined above, is just ${N}^{\prime}$, or equal to the number of inputs to each neuron. In general, a random set of less than ${N}^{\prime}$ vectors (the CS input pattern vectors) will tend to be mutually independent but not mutually orthogonal (at $90deg$ to each other) (see Appendix 1 of Rolls (2008b)). If they are not orthogonal (the normal situation), then the dot product of them is not 0, and the output pattern activated by one of the input vectors will be partially activated by other input pattern vectors, in accordance with how similar they are (see Equations A.9 and A.10). This amounts to interference, which is therefore the more serious the less orthogonal, on the whole, is the set of input vectors.
Since input patterns are made of elements with positive values, if a simple Hebbian learning rule like the one of Equation A.6 is used (in which the input pattern enters directly with no subtraction term), the output resulting from the application of a stored input vector will be the sum of contributions from all other input vectors that have a nonzero dot product with it (see Appendix 1 of Rolls (2008b)), and interference will be disastrous. The only situation in which this would not occur is when different input patterns activate completely different input lines, but this is clearly an uninteresting circumstance for networks operating with distributed representations. A solution to this issue is to use a modified learning rule of the following form:
where x is a constant, approximately equal to the average value of ${x}_{j}$. This learning rule includes (in proportion to ${y}_{i}$) increasing the synaptic weight if $({x}_{j}x)\u30090$ (longterm potentiation), and decreasing the synaptic weight if $({x}_{j}x)\u30080$ (heterosynaptic longterm depression). It is useful for x to be roughly the average activity of an input axon ${x}_{j}$ across patterns, because then the dot product between the various patterns stored on the weights and the input vector will tend to cancel out with the subtractive term, except for the pattern equal to (or correlated with) the input vector itself. Then up to ${N}^{\prime}$ input vectors can still be learned by the network, with only minor interference (provided of course that they are mutually independent, as they will in general tend to be).
Table A.1 Effects of pre and postsynaptic activity on synaptic modification
Postsynaptic activation 

0 
high 

0 
No change 
Heterosynaptic LTD 

Presynaptic firing 

high 
Homosynaptic LTD 
LTP 
This modified learning rule can also be described in terms of a contingency table (Table A.1) showing the synaptic strength modifications produced by different types of learning rule, where LTP indicates an increase in synaptic strength (called LongTerm Potentiation in neurophysiology), and LTD indicates a decrease in synaptic strength (called LongTerm Depression in neurophysiology). Heterosynaptic long (p.569) term depression is socalled because it is the decrease in synaptic strength that occurs to a synapse that is other than that through which the postsynaptic cell is being activated. This heterosynaptic longterm depression is the type of change of synaptic strength that is required (in addition to LTP) for effective subtraction of the average presynaptic firing rate, in order, as it were, to make the CS vectors appear more orthogonal to the pattern associator. The rule is sometimes called the Singer–Stent rule, after work by (Singer, 1987) and (Stent, 1973), and was discovered in the brain by Levy ((Levy, 1985); (Levy & Desmond, 1985); see (Brown, Kairiss & Keenan 1990)). Homosynaptic longterm depression is socalled because it is the decrease in synaptic strength that occurs to a synapse which is (the same as that which is) active. For it to occur, the postsynaptic neuron must simultaneously be inactive, or have only low activity. (This rule is sometimes called the BCM rule after the paper of Bienenstock, Cooper and Munro 1982; see Rolls & Deco (2002), Chapter 7).
Associative neuronal networks with nonlinear neurons
With nonlinear neurons, that is with at least a threshold in the activation function so that the output firing ${y}_{i}$ is 0 when the activation ${h}_{i}$ is below the threshold, the capacity can be measured in terms of the number of different clusters of output pattern vectors that the network produces. This is because the nonlinearities now present (one per output neuron) result in some clustering of the outputs produced by all possible (conditioned stimulus) input patterns $\mathbf{x}$. Input patterns that are similar to a stored input vector can produce, due to the nonlinearities, output patterns even closer to the stored output; and vice versa sufficiently dissimilar inputs can be assigned to different output clusters thereby increasing their mutual dissimilarity. As with the linear counterpart, in order to remove the correlation that would otherwise occur between the patterns because the elements can take only positive values, it is useful to use a modified Hebb rule of the form shown in Equation A.11.
With fully distributed output patterns, the number p of associations that leads to different clusters is of order C, the number of input lines (axons) per output neuron (that is, of order ${N}^{\prime}$ for a fully connected network), as shown in Appendix A3 of (Rolls & Treves, 1998). If sparse patterns are used in the output, or alternatively if the learning rule includes a nonlinear postsynaptic factor that is effectively equivalent to using sparse output patterns, the coefficient of proportionality between p and C can be much higher than one, that is, many more patterns can be stored than inputs onto each output neuron (see Appendix A3 of (Rolls & Treves, 1998)). Indeed, the number of different patterns or prototypes p that can be stored can be derived for example in the case of binary units (Gardner 1988) to be
where ${a}_{o}$ is the sparseness of the output firing pattern $\mathbf{y}$ produced by the unconditioned stimulus. p can in this situation be much larger than C (see (Rolls & Treves, 1990), and Appendix A3 of (Rolls & Treves, 1998)). This is an important result for encoding in pattern associators, for it means that provided that the activation functions are nonlinear (which is the case with real neurons), there is a very great advantage to using sparse encoding, for then many more than C pattern associations can be stored. Sparse representations may well be present in brain regions involved in associative memory (see Chapter 12 of (Rolls & Treves, 1998)) for this reason.
(p.570) The nonlinearity inherent in the NMDA receptorbased Hebbian plasticity present in the brain may help to make the stored patterns more sparse than the input patterns, and this may be especially beneficial in increasing the storage capacity of associative networks in the brain by allowing participation in the storage of especially those relatively few neurons with high firing rates in the exponential firing rate distributions typical of neurons in sensory systems (see Rolls (2008b)).
A.2.7.2 Interference
Interference occurs in linear pattern associators if two vectors are not orthogonal, and is simply dependent on the angle between the originally learned vector and the recall cue or CS vector (see Appendix 1 of Rolls (2008b)), for the activation of the output neuron depends simply on the dot product of the recall vector and the synaptic weight vector (Equation A.9). Also in nonlinear pattern associators (the interesting case for all practical purposes), interference may occur if two CS patterns are not orthogonal, though the effect can be controlled with sparse encoding of the UCS patterns, effectively by setting high thresholds for the firing of output units. In other words, the CS vectors need not be strictly orthogonal, but if they are too similar, some interference will still be likely to occur.
The fact that interference is a property of neural network pattern associator memories is of interest, for interference is a major property of human memory. Indeed, the fact that interference is a property of human memory and of neural network association memories is entirely consistent with the hypothesis that human memory is stored in associative memories of the type described here, or at least that network associative memories of the type described represent a useful exemplar of the class of parallel distributed storage network used in human memory.
It may also be suggested that one reason that interference is tolerated in biological memory is that it is associated with the ability to generalize between stimuli, which is an invaluable feature of biological network associative memories, in that it allows the memory to cope with stimuli that will almost never be identical on different occasions, and in that it allows useful analogies that have survival value to be made.
A.2.7.3 Expansion recoding and pattern separation
If patterns are too similar to be stored in associative memories, then one solution that the brain seems to use repeatedly is to expand the encoding to a form in which the different stimulus patterns are less correlated, that is, more orthogonal, before they are presented as CS stimuli to a pattern associator. The problem can be highlighted by a nonlinearly separable mapping (which captures part of the eXclusive OR (XOR) problem), in which the mapping that is desired is as shown in Fig. A.16. The neuron has two inputs, A and B.
This is a mapping of patterns that is impossible for a onelayer network, because the patterns are not linearly separable^{47}. A solution is to remap the two input lines A and B to three input lines 1–3, that is to use expansion recoding, as shown in Fig. A.17. This can be (p.571) performed by a competitive network (see Rolls (2008b)).
The synaptic weights on the dendrite of the output neuron could then learn the following values using a simple Hebb rule, Equation A.6, and the problem could be solved as in Fig. A.18.
The whole network would look like that shown in Fig. A.17.
Rolls & Treves (1998) show that competitive networks could help with this type of recoding, and could provide very useful preprocessing for a pattern associator in the brain. It is possible that the lateral nucleus of the amygdala performs this function, for it receives inputs from the temporal cortical visual areas, and may preprocess them before they become the inputs to associative networks at the next stage of amygdala processing (see Fig. 4.56). Pattern separation is a major function of the dentate gyrus/mossy fibre connections in the hippocampal system (Rolls 2008b, Rolls 2010b).
A.2.8 Implications of different types of coding for storage in pattern associators
Throughout this Section, we have made statements about how the properties of pattern associators – such as the number of patterns that can be stored, and whether generalization and graceful degradation occur – depend on the type of encoding of the patterns to be associated. (The types of encoding considered, local, sparse distributed, and fully distributed, are described above.) We draw together these points in Table A.2.
Table A.2 Coding in associative memories*
Local 
Sparse distributed 
Fully distributed 

Generalization, Completion, 
No 
Yes 
Yes 
Graceful degradation 

Number of patterns that can 
N 
of order $C/[{a}_{o}log(1/{a}_{o})]$ 
of order C 
be stored 
(large) 
(can be larger) 
(usually smaller than N) 
Amount of information 
Minimal 
Intermediate 
Large 
in each pattern 
(log(N) bits) 
($N{a}_{o}log(1/{a}_{o})$ bits) 
(N bits) 
(values if binary) 
* N refers here to the number of output units, and C to the average number of inputs to each output unit. ${a}_{o}$ is the sparseness of output patterns, or roughly the proportion of output units activated by a UCS pattern. Note: logs are to the base 2.
The amount of information that can be stored in each pattern in a pattern associator is considered in Appendix A3 of (Rolls & Treves, 1998).
(p.572) In conclusion, the architecture and properties of pattern association networks make them very appropriate for stimulus–reinforcer association learning. Their high capacity enables them to learn the correct reinforcement associations for very large numbers of different stimuli.
A.3 Autoassociation memory: attractor networks
In this section an introduction to autoassociation or attractor networks is given, as this type of network may be relevant to understanding how mood states are maintained.
Autoassociative memories, or attractor neural networks, store memories, each one of which is represented by a different set of the neurons firing. The memories are stored in the recurrent synaptic connections between the neurons of the network, for example in the recurrent collateral connections between cortical pyramidal cells. Autoassociative networks can then recall the appropriate memory from the network when provided with a fragment of one of the memories. This is called completion. Many different memories can be stored in the network and retrieved correctly. A feature of this type of memory is that it is content addressable: that is, the information in the memory can be accessed if just the contents of the memory (or a part of the contents of the memory) are used. This is in contrast to a conventional computer, in which the address of what is to be accessed must be supplied, and used to access the contents of the memory. Content addressability is an important simplifying feature of this type of memory, which makes it suitable for use in biological systems. The issue of content addressability will be amplified below.
An autoassociation memory can be used as a shortterm memory, in which iterative processing round the recurrent collateral connections between the principal neurons in the network keeps a representation active by continuing, persistent, neuronal firing. Used in this way, attractor networks provide the basis for the implementation of shortterm memory in the dorsolateral prefrontal cortex. In this cortical area, the shortterm memory provides the basis for keeping a memory active even while perceptual areas such as the inferior temporal visual cortex must respond to each incoming visual stimulus in order for it to be processed, to produce behavioural responses, and for it to be perceived (Renart, Moreno, Rocha, Parga & Rolls 2001). The implementation of shortterm memory in the prefrontal cortex which can maintain neuronal firing even across intervening stimuli provides an important foundation for attention, in which an item or items must be held in mind for a period and during this time bias other brain areas by topdown processing using corticocortical backprojections (Rolls 2008b, Deco & Rolls 2004, Deco & Rolls 2005c), or determine how stimuli are (p.573) mapped to responses (Deco & Rolls 2003) or to rewards (see Deco & Rolls (2005a)) with rapid, onetrial, task switching and decisionmaking. This dorsolateral prefrontal cortex shortterm memory system also provides a computational foundation for executive function, in which several items must be held in a working memory so that they can be performed with the correct priority and order (Rolls 2008b). In brain areas involved in emotion, attractor networks may play a role in maintaining a mood state, at least in the short term after for example frustrative nonreward (see Chapters 2 and 3), and possibly in the longer term. Other functions for autoassociation networks including perceptual shortterm memory which may be used in the learning of invariant representations, constraint satisfaction, and episodic memory are described by Rolls & Treves (1998) and Rolls (2008b).
A.3.1 Architecture and operation
The prototypical architecture of an autoassociation memory is shown in Fig. A.19. The external input ${e}_{i}$ is applied to each neuron i by unmodifiable synapses. This produces firing ${y}_{i}$ of each neuron, or a vector of firing on the output neurons $\mathbf{y}$. Each output neuron i is connected by a recurrent collateral connection to the other neurons in the network, via modifiable connection weights ${w}_{ij}$. This architecture effectively enables the output firing vector $\mathbf{y}$ to be associated during learning with itself. Later on, during recall, presentation of part of the external input will force some of the output neurons to fire, but through the recurrent collateral axons and the modified synapses, other neurons in $\mathbf{y}$ can be brought into activity. This process can be repeated a number of times, and recall of a complete pattern may be perfect. Effectively, a pattern can be recalled or recognized because of associations formed between its parts. This of course requires distributed representations.
Next we introduce a more precise and detailed description of the above, and describe the properties of these networks. Ways to analyse formally the operation of these networks are introduced in Appendix A4 of (Rolls & Treves, 1998) and by (Amit, 1989).
A.3.1.1 Learning
The firing of every output neuron i is forced to a value ${y}_{i}$ determined by the external input ${e}_{i}$. Then a Hebblike associative local learning rule is applied to the recurrent synapses in the (p.574) network:
(The term ${y}_{j}$ in this Equation is the presynaptic term shown as ${x}_{j}$ in Fig. A.19, and this is due to the fact that the recurrent collateral connections connect the outputs of the network back as inputs.) It is notable that in a fully connected network, this will result in a symmetric matrix of synaptic weights, that is the strength of the connection from neuron 1 to neuron 2 will be the same as the strength of the connection from neuron 2 to neuron 1 (both implemented via recurrent collateral synapses).
It is a factor that is sometimes overlooked that there must be a mechanism for ensuring that during learning ${y}_{i}$ does approximate ${e}_{i}$, and must not be influenced much by activity in the recurrent collateral connections, otherwise the new external pattern $\mathbf{e}$ will not be stored in the network, but instead something will be stored that is influenced by the previously stored memories. Mechanisms that may facilitate this are described by Rolls & Treves (1998) and Rolls (2008b).
A.3.1.2 Recall
During recall, the external input ${e}_{i}$ is applied, and produces output firing, operating through the nonlinear activation function described below. The firing is fed back by the recurrent collateral axons shown in Fig. A.19 to produce activation of each output neuron through the modified synapses on each output neuron. The activation ${h}_{i}$ produced by the recurrent collateral effect on the ith neuron is, in the standard way, the sum of the activations produced in proportion to the firing rate of each axon ${y}_{j}$ operating through each modified synapse ${w}_{ij}$, that is,
where $\sum _{j}$ indicates that the sum is over the C input synapses each from a separate axon to each neuron, indexed by j.
The output firing ${y}_{i}$ is a function of the activation ${h}_{i}$ produced by the recurrent collateral effect (internal recall) and by the external input (${e}_{i}$):
The activation function should be nonlinear, and may be for example binary threshold, linear threshold, sigmoid, etc. (see Fig. A.3). The threshold at which the activation function operates is set in part by the effect of the inhibitory neurons in the network (not shown in Fig. A.19). The connectivity is that the pyramidal cells have collateral axons that excite the inhibitory interneurons, which in turn connect back to the population of pyramidal cells to inhibit them by a mixture of shunting (divisive) and subtractive inhibition using GABA (gammaaminobutyric acid) synaptic terminals, as described by Rolls (2008b).
There are many fewer inhibitory neurons than excitatory neurons [in the order of 5–10%, and of connections to and from inhibitory neurons, and partly for this reason the inhibitory neurons are considered to perform generic functions such as threshold setting, rather than to store patterns by modifying their synapses (see Rolls (2008b))]. The nonlinear activation function can minimize interference between the pattern being recalled and other patterns stored in the network, and can also be used to ensure that what is a positive feedback system remains stable. The network can be allowed to repeat this recurrent collateral loop a number of times. Each time the loop operates, the output firing becomes more like the originally stored pattern, and this progressive recall is usually complete within 5–15 iterations.
(p.575) A.3.2 Introduction to the analysis of the operation of autoassociation networks
With complete connectivity in the synaptic matrix, and the use of a Hebb rule, the matrix of synaptic weights formed during learning is symmetric. The learning algorithm is fast, ‘oneshot’, in that a single presentation of an input pattern is all that is needed to store that pattern.
During recall, a part of one of the originally learned stimuli can be presented as an external input. The resulting firing is allowed to iterate repeatedly round the recurrent collateral system, gradually on each iteration recalling more and more of the originally learned pattern. Completion thus occurs. If a pattern is presented during recall that is similar but not identical to any of the previously learned patterns, then the network settles into a stable recall state in which the firing corresponds to that of the most similar previously learned pattern. The network can thus generalize in its recall to the most similar previously learned pattern. The activation function of the neurons should be nonlinear, since a purely linear system would not produce any categorization of the input patterns it receives, and therefore would not be able to effect anything more than a trivial (i.e. linear) form of completion and generalization.
Recall can be thought of in the following way, relating it to what occurs in pattern associators. The external input $\mathbf{e}$ is applied, produces firing $\mathbf{y}$, which is applied as a recall cue on the recurrent collaterals as ${\mathbf{y}}^{\text{T}}$. (The notation ${\mathbf{y}}^{\text{T}}$ signifies the transpose of $\mathbf{y}$, which is implemented by the application of the firing of the neurons $\mathbf{y}$ back via the recurrent collateral axons as the next set of inputs to the neurons.) The activity on the recurrent collaterals is then multiplied with the synaptic weight vector stored during learning on each neuron to produce the new activation ${h}_{i}$ which reflects the similarity between ${\mathbf{y}}^{\text{T}}$ and one of the stored patterns. Partial recall has thus occurred as a result of the recurrent collateral effect. The activations ${h}_{i}$ after thresholding (which helps to remove interference from other memories stored in the network, or noise in the recall cue) result in firing ${y}_{i}$, or a vector of all neurons $\mathbf{y}$, which is already more like one of the stored patterns than, at the first iteration, the firing resulting from the recall cue alone, $\mathbf{y}=\text{f}(\mathbf{e}$). This process is repeated a number of times to produce progressive recall of one of the stored patterns.
Autoassociation networks operate by effectively storing associations between the elements of a pattern. Each element of the pattern vector to be stored is simply the firing of a neuron. What is stored in an autoassociation memory is a set of pattern vectors. The network operates to recall one of the patterns from a fragment of it. Thus, although this network implements recall or recognition of a pattern, it does so by an association learning mechanism, in which associations between the different parts of each pattern are learned. These memories have sometimes been called autocorrelation memories (Kohonen 1977), because they learn correlations between the activity of neurons in the network, in the sense that each pattern learned is defined by a set of simultaneously active neurons. Effectively each pattern is associated by learning with itself. This learning is implemented by an associative (Hebblike) learning rule.
Formal approaches to the operation of these networks have been described by Hopfield (1982), Amit (1989), (Hertz, Krogh & Palmer 1991), and Rolls & Treves (1998).
A.3.3 Properties
The internal recall in autoassociation networks involves multiplication of the firing vector of neuronal activity by the vector of synaptic weights on each neuron. This inner product vector multiplication allows the similarity of the firing vector to previously stored firing vectors to be provided by the output (as effectively a correlation), if the patterns learned are distributed. As a result of this type of ‘correlation computation’ performed if the patterns are distributed, (p.576) many important properties of these networks arise, including pattern completion (because part of a pattern is correlated with the whole pattern), and graceful degradation (because a damaged synaptic weight vector is still correlated with the original synaptic weight vector). Some of these properties are described next.
A.3.3.1 Completion
One important and useful property of these memories is that they complete an incomplete input vector, allowing recall of a whole memory from a small fraction of it. The memory recalled in response to a fragment is that stored in the memory that is closest in pattern similarity (as measured by the dot product, or correlation). Because the recall is iterative and progressive, the recall can be perfect.
A.3.3.2 Generalization
The network generalizes in that an input vector similar to one of the stored vectors will lead to recall of the originally stored vector, provided that distributed encoding is used. The principle by which this occurs is similar to that described for a pattern associator.
A.3.3.3 Graceful degradation or fault tolerance
If the synaptic weight vector ${\mathbf{w}}_{i}$ on each neuron (or the weight matrix) has synapses missing (e.g. during development), or loses synapses (e.g. with brain damage or aging), then the activation ${h}_{i}$ (or vector of activations $\mathbf{h}$) is still reasonable, because ${h}_{i}$ is the dot product (correlation) of ${\mathbf{y}}^{\text{T}}$ with ${\mathbf{w}}_{i}$. The same argument applies if whole input axons are lost. If an output neuron is lost, then the network cannot itself compensate for this, but the next network in the brain is likely to be able to generalize or complete if its input vector has some elements missing, as would be the case if some output neurons of the autoassociation network were damaged.
A.3.3.4 Speed
The recall operation is fast on each neuron on a single iteration, because the pattern ${\mathbf{y}}^{\text{T}}$ on the axons can be applied simultaneously to the synapses ${\mathbf{w}}_{i}$, and the activation ${h}_{i}$ can be accumulated in one or two time constants of the dendrite (e.g. 10–20 ms). If a simple implementation of an autoassociation net such as that described by (Hopfield, 1982) is simulated on a computer, then 5–15 iterations are typically necessary for completion of an incomplete input cue $\mathbf{e}$. This might be taken to correspond to 50–200 ms in the brain, rather too slow for any one local network in the brain to function. However, it transpires (see Rolls (2008b), (Treves, 1993), (Battaglia & Treves, 1998), Appendix A5 of (Rolls & Treves, 1998), and (Panzeri, Rolls, Battaglia & Lavis 2001)) that if the neurons are treated not as McCulloch–Pitts neurons which are simply ‘updated’ at each iteration, or cycle of time steps (and assume the active state if the threshold is exceeded), but instead are analysed and modelled as ‘integrateandfire’ neurons in real continuous time, then the network can effectively ‘relax’ into its recall state very rapidly, in one or two time constants of the synapses^{48} This corresponds to perhaps 20 ms in the brain. One factor in this rapid dynamics of autoassociative networks with brainlike ‘integrateandfire’ membrane and synaptic properties is that with some spontaneous activity, some of the neurons in the network are close to threshold already before the recall cue is applied, and hence some of the neurons are very quickly pushed by the recall cue into firing, so that information starts to be exchanged very rapidly (within 1–2 ms of brain time) through the modified synapses by the neurons in the network. The progressive exchange of information starting early on within what would otherwise be thought of as an iteration period (p.577) (of perhaps 20 ms, corresponding to a neuronal firing rate of 50 spikes/s), is the mechanism accounting for rapid recall in an autoassociative neuronal network made biologically realistic in this way. Further analysis of the fast dynamics of these networks if they are implemented in a biologically plausible way with ‘integrateandfire’ neurons is provided in Appendix A5 of (Rolls & Treves, 1998), by Rolls (2008b), and by (Treves, 1993).
Learning is fast, ‘oneshot’, in that a single presentation of an input pattern $\mathbf{e}$ (producing $\mathbf{y}$) enables the association between the activation of the dendrites (the postsynaptic term ${h}_{i}$) and the firing of the recurrent collateral axons ${\mathbf{y}}^{\text{T}}$, to be learned. Repeated presentation with small variations of a pattern vector is used to obtain the properties of prototype extraction, extraction of central tendency, and noise reduction, because these arise from the averaging process produced by storing very similar patterns in the network.
A.3.3.5 Local learning rule
The simplest learning used in autoassociation neural networks, a version of the Hebb rule, is (as in Equation A.13)
The rule is a local learning rule in that the information required to specify the change in synaptic weight is available locally at the synapse, as it is dependent only on the presynaptic firing rate ${y}_{j}$ available at the synaptic terminal, and the postsynaptic activation or firing ${y}_{i}$ available on the dendrite of the neuron receiving the synapse. This makes the learning rule biologically plausible, in that the information about how to change the synaptic weight does not have to be carried to every synapse from a distant source where it is computed. As with pattern associators, since firing rates are positive quantities, a potentially interfering correlation is induced between different pattern vectors. This can be removed by subtracting the mean of the presynaptic activity from each presynaptic term, using a type of longterm depression. This can be specified as
where $\mathrm{\alpha}$ is a learning rate constant. This learning rule includes (in proportion to ${y}_{i}$) increasing the synaptic weight if $({y}_{j}z)\u30090$ (longterm potentiation), and decreasing the synaptic weight if $({y}_{j}z)\u30080$ (heterosynaptic longterm depression). This procedure works optimally if z is the average activity $\u3008{y}_{j}\u3009$ of an axon across patterns.
Evidence that a learning rule with the general form of Equation A.13 is implemented in at least some parts of the brain comes from studies of longterm potentiation, described in Section A.1.5. One of the important potential functions of heterosynaptic longterm depression is its ability to allow in effect the average of the presynaptic activity to be subtracted from the presynaptic firing rate (see Appendix A3 of (Rolls & Treves, 1998) and (Rolls & Treves, 1990)).
A.3.3.6 Capacity
One measure of storage capacity is to consider how many orthogonal (i.e. uncorrelated) patterns could be stored, as with pattern associators. If the patterns are orthogonal, there will be no interference between them, and the maximum number p of patterns that can be stored will be the same as the number N of output neurons in a fully connected network.
With nonlinear neurons used in the network, the capacity can be measured in terms of the number of input patterns $\mathbf{y}$ (produced by the external input $\mathbf{e}$, see Fig. A.19) that can be stored in the network and recalled later whenever the network settles within each stored pattern’s basin of attraction. The first quantitative analysis of storage capacity (Amit, Gutfreund & Sompolinsky 1987) considered a fully connected (Hopfield, 1982) autoassociator model, in (p.578) which units are binary elements with an equal probability of being ‘on’ or ‘off’ in each pattern, and the number C of inputs per unit is the same as the number N of output units. (Actually it is equal to $N1$, since a unit is taken not to connect to itself.) Learning is taken to occur by clamping the desired patterns on the network and using a modified Hebb rule, in which the mean of the presynaptic and postsynaptic firings is subtracted from the firing on any one learning trial (this amounts to a covariance learning rule, and is described more fully in Appendix A4 of (Rolls & Treves, 1998)). With such fully distributed random patterns, the number of patterns that can be learned is (for C large) $p\approx 0.14C=0.14N$.
(Treves & Rolls, 1991) have extended this analysis to autoassociation networks that are much more biologically relevant in the following ways. First, some or many connections between the recurrent collaterals and the dendrites are missing (this is referred to as diluted connectivity, and results in a nonsymmetric synaptic connection matrix in which ${w}_{ij}$ does not equal ${w}_{ji}$, one of the original assumptions made in order to introduce the energy formalism in the Hopfield model). Second, the neurons need not be restricted to binary threshold neurons, but can have a threshold linear activation function (see Fig. A.3). This enables the neurons to assume real continuously variable firing rates, which are what is found in the brain (Rolls & Tovee 1995, Treves, Panzeri, Rolls, Booth & Wakeman 1999). Third, the representation need not be fully distributed (with half the neurons ‘on’, and half ‘off’), but instead can have a small proportion of the neurons firing above the spontaneous rate, which is what is found in parts of the brain such as the hippocampus that are involved in memory (see (Treves & Rolls 1994, Rolls & Treves 1998, Rolls 2008b, Rolls & Treves 2011). Such a representation is defined as being sparse, and the population sparseness a of the representation can be measured, by extending the binary notion of the proportion of neurons that are firing, as
where ${y}_{i}$ is the firing rate of the ith neuron in the set of N neurons. (It should be noted that this is the sparseness of the representation provided by the population of neurons to a single stimulus, though the value is similar to that of the single neuron sparseness which defines the tuning of a single neuron to a set of stimuli, consistent with the low correlations between the tuning of different neurons (Franco, Rolls, Aggelopoulos & Jerez 2007, Rolls 2008b, Rolls & Treves 2011). (Treves & Rolls, 1991) have shown that such a network does operate efficiently as an autoassociative network, and can store (and recall correctly) a number of different patterns p as follows:
where ${C}^{\text{RC}}$ is the number of synapses on the dendrites of each neuron devoted to the recurrent collaterals from other neurons in the network, and k is a factor that depends weakly on the detailed structure of the rate distribution, on the connectivity pattern, etc., but is roughly in the order of 0.2–0.3.
The main factors that determine the maximum number of memories that can be stored in an autoassociative network are thus the number of connections on each neuron devoted to the recurrent collaterals, and the sparseness of the representation. For example, for ${C}^{\text{RC}}=$ 12,000 and $a=0.02$, p is calculated to be approximately 36,000. This storage capacity can be realized, with little interference between patterns, if the learning rule includes some form of heterosynaptic LongTerm Depression that counterbalances the effects of associative (p.579) LongTerm Potentiation ((Treves & Rolls, 1991); see Appendix A4 of (Rolls & Treves, 1998)). It should be noted that the number of neurons N (which is greater than ${C}^{\text{RC}}$, the number of recurrent collateral inputs received by any neuron in the network from the other neurons in the network) is not a parameter that influences the number of different memories that can be stored in the network. The implication of this is that increasing the number of neurons (without increasing the number of connections per neuron) does not increase the number of different patterns that can be stored (see (Rolls & Treves, 1998) Appendix A4), although it may enable simpler encoding of the firing patterns, for example more orthogonal encoding, to be used, and simpler connectivity rules by reducing the probability of more than one connection between any pair of neurons which decreases the storage capacity (Rolls 2012f). These latter points may account in part for why there are generally in the brain more neurons in a recurrent network than there are connections per neuron. Another advantage of having many neurons in the network may be related to the fact that within any integration time period of 20 ms not all neurons will have fired a spike if the average firing rate is less than 50 Hz. Having large numbers of neurons may enable the vector of neuronal firing to contribute to recall efficiently even though not every neuron can contribute in a short time period.
The nonlinearity inherent in the NMDA receptorbased Hebbian plasticity present in the brain may help to make the stored patterns more sparse than the input patterns, and this may be especially beneficial in increasing the storage capacity of associative networks in the brain by allowing participation in the storage of especially those relatively few neurons with high firing rates in the exponential firing rate distributions typical of neurons in sensory systems (see Rolls & Treves (1998) and Rolls (2008b)).
A.3.3.7 Context
The environmental context in which learning occurs can be a very important factor that affects retrieval in humans and other animals. Placing the subject back into the same context in which the original learning occurred can greatly facilitate retrieval.
Context effects arise naturally in association networks if some of the activity in the network reflects the context in which the learning occurs. Retrieval is then better when that context is present, for the activity contributed by the context becomes part of the retrieval cue for the memory, increasing the correlation of the current state with what was stored. (A strategy for retrieval arises simply from this property. The strategy is to keep trying to recall as many fragments of the original memory situation, including the context, as possible, as this will provide a better cue for complete retrieval of the memory than just a single fragment.)
The effects that mood has on memory including visual memory retrieval may be accounted for by backprojections from brain regions such as the amygdala in which the current mood, providing a context, is represented, to brain regions involved in memory such as the perirhinal cortex, and in visual representations such as the inferior temporal visual cortex (see (Rolls & Stringer, 2001b) and Section 4.12). The very wellknown effects of context in the human memory literature could arise in the simple way just described. An implication of the explanation is that context effects will be especially important at late stages of memory or information processing systems in the brain, for there information from a wide range of modalities will be mixed, and some of that information could reflect the context in which the learning takes place. One part of the brain where such effects may be strong is the hippocampus, which is implicated in the memory of recent episodes, and which receives inputs derived from most of the cortical information processing streams, including those involved in spatial representations (see Chapter 6 of (Rolls & Treves, 1998), Rolls (1996a), and Rolls (1999b)).
It is now known that rewardrelated information is associated with placerelated information in the primate hippocampus, and this provides a particular neural system in which mood context can influence memory retrieval (Rolls & Xiang 2005).
(p.580) A.3.3.8 Memory for sequences
One of the first extensions of the standard autoassociator paradigm that has been explored in the literature is the capability to store and retrieve not just individual patterns, but whole sequences of patterns. Hopfield (1982) suggested that this could be achieved by adding to the standard connection weights, which associate a pattern with itself, a new, asymmetric component, that associates a pattern with the next one in the sequence. In practice this scheme does not work very well, unless the new component is made to operate on a slower time scale than the purely autoassociative component (Kleinfeld 1986, Sompolinsky & Kanter 1986). With two different time scales, the autoassociative component can stabilize a pattern for a while, before the heteroassociative component moves the network, as it were, into the next pattern. The heteroassociative retrieval cue for the next pattern in the sequence is just the previous pattern in the sequence. A particular type of ‘slower’ operation occurs if the asymmetric component acts after a delay $\mathrm{\tau}$. In this case, the network sweeps through the sequence, staying for a time of order $\mathrm{\tau}$ in each pattern.
If implemented with integrateandfire neurons with biologically plausible dynamics, this type of sequence memory will either step through its remembered sequence with uncontrollable speed, or not step through the sequence. A proposal that attractor networks with adapting synapses could be used to retain memory sequences is an interesting alternative (Deco & Rolls 2005d).
A.4 Coupled attractor networks
In this Section A.4 an introduction to how attractor networks can interact is given, as this may be relevant to understanding how mood states influence cognitive processing, and vice versa.
It is prototypical of the cerebral neocortical areas that there are recurrent collateral connections between the neurons within an area or module, and forward connections to the next cortical area in the hierarchy, which in turn sends backprojections (see Rolls (2008b)). This architecture, made explicit in Fig. 4.75 on page 217, immediately suggests, given that the recurrent connections within a module, and the forward and backward connections, are likely to be associatively modifiable, that the operation incorporates at least to some extent interactions between coupled attractor (autoassociation) networks. For these reasons, it is important to analyse the rules that govern the interactions between coupled attractor networks. This has been done using the formal type of model described by Rolls (2008b) introduced here (see also (Renart, Parga & Rolls, 1999a), (Renart, Parga & Rolls, 1999b), (Renart, Parga & Rolls, 2000), (Renart, Moreno, Rocha, Parga & Rolls, 2001), and Deco & Rolls (2003)).
One boundary condition is when the coupling between the networks is so weak that there is effectively no interaction. This holds when the coupling parameter g between the networks is less than approximately 0.002, where the coupling parameter indicates the relative strength of the intermodular to the intramodular connections, and measures effectively the relative strengths of the currents injected into the neurons by the intermodular relative to the intramodular (recurrent collateral) connections (Renart, Parga & Rolls 1999a). At the other extreme, if the coupling parameter is strong, all the networks will operate as a single attractor network, together able to represent only one state (Renart, Parga & Rolls 1999a). This critical value of the coupling parameter (at least for reciprocally connected networks with symmetric synaptic strengths) is relatively low, in the region of 0.024 (Renart, Parga & Rolls 1999a). This is one reason why corticocortical backprojections are predicted to be quantitatively relatively weak, and for this reason it is suggested end on the apical parts of the dendrites of cortical pyramidal cells (see Rolls (2008b)). In the strongly coupled regime when the system of networks operates as a single attractor, the total storage capacity (the number of patterns (p.581) that can be stored and correctly retrieved) of all the networks will be set just by the number of synaptic connections received from other neurons in the network, a number in the order of a few thousand. This is one reason why connected cortical networks are thought not to act in the strongly coupled regime, because the total number of memories that could be represented in the whole of the cerebral cortex would be so small, in the order of a few thousand, depending on the sparseness of the patterns (see Equation A.18) (O’Kane & Treves 1992).
Between these boundary conditions, that is in the region where the intermodular coupling parameter g is in the range 0.002–0.024, it has been shown that interesting interactions can occur (Renart, Parga & Rolls 1999a, Renart, Parga & Rolls 1999b). In a bimodular architecture, with forward and backward connections between the modules, the capacity of one module can be increased, and an attractor is more likely to be found under noisy conditions, if there is a consistent pattern in the coupled attractor. By consistent we mean a pattern that during training was linked associatively by the forward and backward connections, with the pattern being retrieved in the first module. This provides a quantitative model for understanding some of the effects that backprojections can produce by supporting particular states in earlier cortical areas (Renart, Parga & Rolls 1999a). The total storage capacity of the two networks is however in line with (O’Kane & Treves, 1992), not a great deal greater than the storage capacity of one of the modules alone. Thus the help provided by the attractors in falling into a mutually compatible global retrieval state (in e.g. the scenario of a hierarchical system) is where the utility of such coupled attractor networks must lie. Another interesting application of such weakly coupled attractor networks is in coupled perceptual and shortterm memory systems in the brain, described by Rolls (2008b). Thus the most interesting scenario for coupled attractor networks is when they are weakly coupled, for then interactions occur whereby how well one module responds to its own inputs can be influenced by the states of the other modules, but it can retain partly independent representations. This emphasizes the importance of weak interactions between coupled modules in the brain (Renart, Parga & Rolls 1999a, Renart, Parga & Rolls 1999b, Renart, Parga & Rolls 2000).
If a multimodular architecture is trained with each of many patterns (which might be visual stimuli) in one module associated with one of a few patterns (which might be mood states) in a connected module, then interesting effects due to this asymmetry are found, as described in Section 4.12 and by (Rolls & Stringer, 2001b).
An interesting issue that arises is how rapidly a system of interacting attractor networks such as that illustrated in Fig. 4.75 settles into a stable state. Is it sufficiently rapid for the interacting attractor effects described to contribute to cortical information processing? It is likely that the settling of the whole system is quite rapid, if it is implemented (as it is in the brain) with synapses and neurons that operate with continuous dynamics, where the time constant of the synapses dominates the retrieval speed, and is in the order of 15 ms for each module, as described by Rolls (2008b) and by (Panzeri, Rolls, Battaglia & Lavis 2001). It is shown there that a multimodular attractor network architecture can process information in approximately 15 ms per module (assuming an inactivation time constant for the synapses of 10 ms).
Attractor networks can be coupled together with stronger forward than backward connections. This provides a model of how the prefrontal cortex could map sensory inputs (in one attractor), through intermediate attractors that respond to combinations of sensory inputs and the behavioural responses being made, to further attractors that encode the response to be made (Deco & Rolls 2003). Having attractors at each stage enables the prefrontal cortex to bridge delays between parts of a task. The hierarchical organization of the attractors achieved by the stronger forward than backward connections enables the mapping to be from sensory input to motor output. The presence of intermediate attractors with neurons that respond to combinations of the stimuli and the behavioural responses to be made allows a topdown attentional (p.582) input to bias the competition implemented by the intermediate level attractors to enable the behaviour to be switched from one cognitive mapping to another (Deco & Rolls 2003). The whole architecture has been modelled at the integrateandfire neuronal level, and simulates the activity of the different populations of neurons just described which are types of neuron recorded in the prefrontal cortex when monkeys are performing this decision task (see Deco & Rolls (2003)).
The corticocortical backprojection connectivity described can be interpreted as a system that allows the forwardprojecting neurons in one cortical area to be linked autoassociatively with the backprojecting neurons in the next cortical area (see Fig. 4.75 and Rolls (2008b)). It is interesting to note that if the forward and backprojection synapses were associatively modifiable, but there were no recurrent connections in each of the modules, then the whole system could still operate (with the right parameters) as an attractor network.
A.5 Reinforcement learning
In supervised networks, an error signal is provided for each output neuron in the network, and whenever an input to the network is provided, the error signals specify the magnitude and direction of the error in the output produced by each neuron. These error signals are then used to correct the synaptic weights in the network in such a way that the output errors for each input pattern to be learned gradually diminish over trials (see Rolls (2008b)). These networks have an architecture that might be similar to that of the pattern associator shown in Fig. A.7, except that instead of an unconditioned stimulus, there is an error correction signal provided for each output neuron. Such a network trained by an errorcorrecting (or delta) rule is known as a onelayer perceptron. The architecture is not very plausible for most brain regions, in that it is not clear how an individual error signal could be computed for each of thousands of neurons in a network, and fed into each neuron as its error signal and then used in a delta rule synaptic correction (see Rolls & Treves (1998) and Rolls (2008b)).
The architecture can be generalized to a multilayer feedforward architecture with many layers between the input and output (Rumelhart, Hinton & Williams 1986), but the learning is very nonlocal and rather biologically implausible (see Rolls & Treves (1998) and Rolls (2008b)), in that an error term (magnitude and direction) for each neuron in the network must be computed from the errors and synaptic weights of all subsequent neurons in the network that any neuron influences, usually on a trialbytrial basis, by a process known as error backpropagation. Thus although computationally powerful, an issue with perceptrons and multilayer perceptrons that makes them generally biologically implausible for many brain regions is that a separate error signal must be supplied for each output neuron, and that with multilayer perceptrons, computed error backpropagation must occur.
When operating in an environment, usually a simple binary or scalar signal representing success or failure of the whole network or organism is received. This is usually actiondependent feedback that provides a single evaluative measure of the success or failure. Evaluative feedback tells the learner whether or not, and possibly by how much, its behaviour has improved; or it provides a measure of the ‘goodness’ of the behaviour. Evaluative feedback does not directly tell the learner what it should have done, and although it may provide an index of the degree (i.e. magnitude) of success, it does not include directional information telling the learner how to change its behaviour towards a target, as does errorcorrection learning (see Barto (1995)). Partly for this reason, there has been some interest in networks that can be taught with such a single reinforcement signal. In this Section, approaches to such networks are described. It is noted that such networks are classified as reinforcement networks in which there is a single teacher, and that these networks attempt to perform an (p.583) optimal mapping between an input vector and an output neuron or set of neurons. They thus solve the same class of problems as single layer and multilayer perceptrons. They should be distinguished from patternassociation networks in the brain, which might learn associations between previously neutral stimuli and primary reinforcers such as taste (signals which might be interpreted appropriately by a subsequent part of the brain), but do not attempt to produce arbitrary mappings between an input and an output, using a single reinforcement signal.
A class of problems to which such reinforcement networks might be applied are motorcontrol problems. It was to such a problem that Barto and Sutton (Barto 1985, Sutton & Barto 1981) applied a reinforcement learning algorithm, the associative reward–penalty algorithm described next. The algorithm can in principle be applied to multilayer networks, and the learning is relatively slow. The algorithm is summarized by (Rolls & Treves, 1998) and (Hertz, Krogh & Palmer 1991). More recent developments in reinforcement learning are described by (Sutton & Barto, 1998) and reviewed by Dayan & Abbott (2001), and some of these developments are described in Section A.5.3.
A.5.1 Associative reward–penalty algorithm of Barto and Sutton
A.5.1.1 Architecture
The architecture, shown in Fig. A.20, uses a single reinforcement signal, r, = +1 for reward, and –1 for penalty. The inputs ${x}_{i}$ take real (continuous) values. The output of a neuron, y, is binary, +1 or –1. The weights on the output neuron are designated ${w}_{i}$.
A.5.1.2 Operation
1. An input vector is applied to the network, and produces activation, h, in the normal way as follows:
where $\sum _{j=1}^{C}$ indicates that the sum is over the C input axons (or connections) indexed by j to each neuron.
2. The output y is calculated from the activation with a noise term $\mathrm{\eta}$ included. The principle of the network is that if the added noise on a particular trial helps performance, then whatever (p.584) change it leads to should be incorporated into the synaptic weights, in such a way that the next time that input occurs, the performance is improved.
where $\mathrm{\eta}$ = the noise added on each trial.
3. Learning rule. The weights are changed as follows:
$\mathrm{\rho}$ and $\mathrm{\lambda}$ are learningrate constants. (They are set so that the learning rate is higher when positive reinforcement is received than when negative reinforcement is received.) E$[yh]$ is the expectation of y given h (usually a sigmoidal function of h with the range $\pm 1$). E$[yh]$ is a (continuously varying) indication of how the neuron usually responds to the current input pattern, i.e. if the actual output y is larger than normally expected, by computing $h=\sum {w}_{j}{x}_{j}$, because of the noise term, and the reinforcement is +1, increase the weight from ${x}_{j}$; and vice versa. The expectation could be the prediction generated before the noise term is incorporated.
This network combines an associative capacity with its properties of generalization and graceful degradation, with a single ‘critic’ or error signal for the whole network (Barto 1985). [The term $y\text{E}[yh]$ in Equation A.21 can be thought of as an error for the output of the neuron: it is the difference between what occurred, and what was expected to occur. The synaptic weight is adjusted according to the sign and magnitude of the error of the postsynaptic firing, multiplied by the presynaptic firing, and depending on the reinforcement r received. The rule is similar to a Hebb synaptic modification rule (Equation A.6), except that the postsynaptic term is an error instead of the postsynaptic firing rate, and the learning is modulated by the reinforcement.] The network can solve difficult problems (such as balancing a pole by moving a trolley that supports the pole from side to side, as the pole starts to topple). Although described for singlelayer networks, the algorithm can be applied to multilayer networks. The learning rate is very slow, for there is a single reinforcement signal on each trial for the whole network, not a separate error signal for each neuron in the network as is the case in a perceptron trained with an error rule (see Rolls (2008b) and Rolls & Treves (1998)).
This associative reward–penalty reinforcementlearning algorithm is certainly a move towards biological relevance, in that learning with a single reinforcer can be achieved. That single reinforcer might be broadcast throughout the system by a general projection system. It is not clear yet how a biological system might store the expected output E$[yh]$ for comparison with the actual output when noise has been added, and might take into account the sign and magnitude of this difference. Nevertheless, this is an interesting algorithm, which is related to the temporal difference reinforcement learning algorithm described in Section A.5.3.
A.5.2 Error correction or delta rule learning, and classical conditioning
In classical or Pavlovian associative learning, a number of different types of association may be learned (see Section 4.6.1.1). This type of associative learning may be performed by networks with the general architecture and properties of pattern associators (see Section A.2 and Fig. A.7). However, the time course of the acquisition and extinction of these associations can be expressed concisely by a modified type of learning rule in which an error correction term is used (introduced in Section A.5.1), rather than the postsynaptic firing y itself as in (p.585) Equation A.6. Use of this modified, error correction, type of learning also enables some of the properties of classical conditioning to be explained (see Dayan & Abbott (2001) for review), and this type of learning is therefore described briefly here. The rule is known in learning theory as the Rescorla–Wagner rule, after Rescorla & Wagner (1972).
The Rescorla–Wagner rule is a version of error correction or deltarule learning, and is based on a simple linear prediction of the expected reward value, denoted by v, associated with a stimulus representation x ($x=1$ if the stimulus is present, and $x=0$ if the stimulus is absent). The expected reward value v is expressed as the input stimulus variable x multiplied by a weight w
The error in the reward prediction is the difference between the expected reward v and the actual reward obtained r, i.e.
where Δ is the reward prediction error. The value of the weight w is learned by a rule designed to minimize the expected squared error $\u3008(rv{)}^{2}\u3009$ between the actual reward r and the predicted reward v. The angle brackets indicate an average over the presentations of the stimulus and reward. The delta rule will perform the required type of learning:
where $\mathrm{\delta}w$ is the change of synaptic weight, k is a constant that determines the learning rate, and the term $(rv)$ is the error Δ in the output (equivalent to the error in the postsynaptic firing, rather than the postsynaptic firing y itself as in Equation A.6). Application of this rule during conditioning with the stimulus x presented on every trial results in the weight w approaching the asymptotic limit $w=r$ exponentially over trials as the error Δ becomes zero. In extinction, when $r=0$, the weight (and thus the output of the system) exponentially decays to $w=0$. This rule thus helps to capture the time course over trials of the acquisition and extinction of conditioning. The rule also helps to account for a number of properties of classical conditioning, including blocking, inhibitory conditioning, and overshadowing (see Dayan & Abbott (2001)).
How this functionality is implemented in the brain is not yet clear. We consider one suggestion (Schultz et al. 1995b, Schultz 2004, Schultz 2013) after we introduce a further sophistication of reinforcement learning which allows the time course of events within a trial to be taken into account.
A.5.3 Temporal Difference (TD) learning
An important advance in the area of reinforcement learning was the introduction of algorithms that allow for learning to occur when the reinforcement is delayed or received over a number of time steps, and which allow effects within a trial to be taken into account (Sutton & Barto 1998, Sutton & Barto 1990). A solution to these problems is the addition of an adaptive critic that learns through a time difference (TD) algorithm how to predict the future value of the reinforcer. The time difference algorithm takes into account not only the current reinforcement just received, but also a temporally weighted average of errors in predicting future reinforcements. The temporal difference error is the error by which any two temporally adjacent error predictions are inconsistent (see Barto (1995)). The output of the critic is used as an effective reinforcer instead of the instantaneous reinforcement being received (see Sutton & Barto (1998), Sutton & Barto (1990) and Barto (1995)). This is a solution to the temporal credit assignment problem, and enables future rewards to be predicted. Summaries are provided by Doya (1999), Schultz et al. (1997) and Dayan & Abbott (2001).
(p.586) In reinforcement learning, a learning agent takes an action $\mathbf{\text{u}}(t)$ in response to the state $\mathbf{\text{x}}(t)$ of the environment, which results in the change of the state
and the delivery of the reinforcement signal, or reward
In the above equations, $\mathbf{\text{x}}$ is a vector representation of inputs ${x}_{j}$, and Equation A.25 indicates that the next state $\mathbf{\text{x}}(t+1)$ at time $(t+1)$ is a function F of the state at the previous time step of the inputs and actions at that time step in a closed system. In Equation A.26 the reward at the next time step is determined by a reward function R which uses the current sensory inputs and action taken. The time t may refer to time within a trial.
The goal is to find a policy function G which maps sensory inputs $\mathbf{\text{x}}$ to actions
which maximizes the cumulative sum of the rewards based on the sensory inputs.
The current action $\mathbf{\text{u}}(t)$ affects all future states and accordingly all future rewards. The maximization is realized by the use of the value function V of the states to predict, given the sensory inputs $\mathbf{\text{x}}$, the cumulative sum (possibly discounted as a function of time) of all future rewards $\text{V}(\mathbf{\text{x}})$ (possibly within a learning trial) as follows:
where $r(t)$ is the reward at time t, and $E[\cdot ]$ denotes the expected value of the sum of future rewards up to the end of the trial. $0\le \mathrm{\gamma}\le 1$ is a discount factor that makes rewards that arrive sooner more important than rewards that arrive later, according to an exponential decay function. (If $\mathrm{\gamma}=1$ there is no discounting.) It is assumed that the presentation of future cues and rewards depends only on the current sensory cues and not the past sensory cues. The right hand side of Equation A.28 is evaluated for the dynamics in Equations A.25–A.27 with the initial condition $\mathbf{\text{x}}(t)=\mathbf{\text{x}}$. The two basic ingredients in reinforcement learning are the estimation (which we term $\stackrel{\u02c6}{\text{V}}$) of the value function V, and then the improvement of the policy or action $\mathbf{\text{u}}$ using the value function (Sutton & Barto 1998).
The basic algorithm for learning the value function is to minimize the temporal difference (TD) error $\mathrm{\Delta}(t)$ for time t within a trial, and this is computed by a ‘critic’ for the estimated value predictions $\stackrel{\u02c6}{\text{V}}(\mathbf{\text{x}}(t))$ at successive time steps as
where $\stackrel{\u02c6}{\text{V}}(\mathbf{\text{x}}(t))\stackrel{\u02c6}{\text{V}}(\mathbf{\text{x}}(t1))$ is the difference in the reward value prediction at two successive time steps, giving rise to the terminology temporal difference learning. If we introduce the term $\stackrel{\u02c6}{v}$ as the estimate of the cumulated reward by the end of the trial, we can define it as a function $\stackrel{\u02c6}{\text{V}}$ of the current sensory input $\mathbf{\text{x}}(t)$, i.e. $\stackrel{\u02c6}{v}=\stackrel{\u02c6}{\text{V}}(\mathbf{\text{x}})$, and we can also write Equation A.29 as
which draws out the fact that it is differences at successive timesteps in the reward value predictions $\stackrel{\u02c6}{v}$ that are used to calculate Δ.
$\mathrm{\Delta}(t)$ is used to improve the estimates $\stackrel{\u02c6}{v}(t)$ by the ‘critic’, and can also be used (by an ‘actor’) to choose appropriate actions.
(p.587) For example, when the value function is represented (in the critic) as
the learning algorithm for the (value) weight ${w}_{j}^{\text{C}}$ in the critic is given by
where $\mathrm{\delta}{w}_{j}^{\text{C}}$ is the change of synaptic weight, ${k}_{c}$ is a constant that determines the learning rate for the sensory input ${x}_{j}$, and $\mathrm{\Delta}(t)$ is the Temporal Difference error at time t. Under certain conditions this learning rule will cause the estimate $\stackrel{\u02c6}{v}$ to converge to the true value (Dayan & Sejnowski 1994).
A simple way of improving the policy of the actor is to take a stochastic action
where g() is a scalar version of the policy function G, ${w}_{ij}^{\text{A}}$ is a weight in the actor, and ${\mathrm{\mu}}_{i}(t)$ is a noise term. The TD error $\mathrm{\Delta}(t)$ as defined in Equation A.29 then signals the unexpected delivery of the reward $r(t)$ or the increase in the state value $\stackrel{\u02c6}{\text{V}}(\mathbf{\text{x}}(t))$ above expectation, possibly due to the previous choice of action ${u}_{i}(t1)$. The learning algorithm for the action weight ${w}_{ij}^{\text{A}}$ in the actor is given by
where $\u3008{u}_{i}\u3009$ is the average level of the action output, and ${k}_{a}$ is a learning rate constant in the actor.
Thus, the TD error $\mathrm{\Delta}(t)$, which signals the error in the reward prediction at time t, works as the main teaching signal in both learning the value function (implemented in the critic), and the selection of actions (implemented in the actor). The usefulness of a separate critic is that it enables the TD error to be calculated based on the difference in reward value predictions at two successive time steps as shown in Equation A.29.
The algorithm has been applied to modelling the time course of classical conditioning (Sutton & Barto 1990). The algorithm effectively allows the future reinforcement predicted from past history to influence the responses made, and in this sense allows behaviour to be guided not just by immediate reinforcement, but also by ‘anticipated’ reinforcements. Different types of temporal difference learning are described by Sutton & Barto (1998). An application is to the analysis of decisions when future rewards are discounted with respect to immediate rewards (Dayan & Abbott 2001, Tanaka, Doya, Okada, Ueda,Okamoto & Yamawaki 2004). Another application is to the learning of sequences of actions to take within a trial (Suri & Schultz 1998).
The possibility that dopamine neuron firing may provide an error signal useful in training neuronal systems to predict reward has been discussed in Section 6.2.4. It has been proposed that the firing of the dopamine neurons can be thought of as an error signal about reward prediction, in that the firing occurs in a task when a reward is given, but then moves forward in time within a trial to the time when a stimulus is presented that can be used to predict when the taste reward will be obtained (Schultz et al. 1995b) (see Fig. 6.4). The argument is that there is no prediction error when the taste reward is obtained if it has been signalled by a preceding conditioned stimulus, and that is why the dopamine midbrain neurons do not (p.588) respond at the time of taste reward delivery, but instead, at least during training, to the onset of the conditioned stimulus (Waelti, Dickinson & Schultz 2001). If a different conditioned stimulus is shown that normally predicts that no taste reward will be given, there is no firing of the dopamine neurons to the onset of that conditioned stimulus.
This hypothesis has been built into models of learning in which the error signal is used to train synaptic connections in dopamine pathway recipient regions (such as presumably the striatum and orbitofrontal cortex) (Houk et al. 1995, Schultz 2004, Schultz et al. 1997, Waelti et al. 2001, Dayan & Abbott 2001, Schultz 2013). Some difficulties with the hypothesis are discussed in Section 6.2.4. The difficulties include the fact that dopamine is released in large quantities by aversive stimuli (see Section 6.2.4); that error computations for differences between the expected reward and the actual reward received on a trial are computed in the primate orbitofrontal cortex, where expected reward, actual reward, and error neurons are all found, and lesions of which impair the ability to use changes in reward contingencies to reverse behaviour (see Section 4.5.5.5); that the tonic, sustained, firing of the dopamine neurons in the delay period of a task with probabilistic rewards reflects reward uncertainty, and not the expected reward, nor the magnitude of the prediction error (see Section 6.2.4 and Shizgal & Arvanitogiannis (2003)); and that reinforcement learning is suited to setting up connections that might be required in fixed tasks such as motor habit or sequence learning, for reinforcement learning algorithms seek to set weights correctly in an ‘actor’, but are not suited to tasks where rules must be altered flexibly, as in rapid one trial reversal, for which a very different type of mechanism is described in Section 4.5.7.
Overall, reinforcement learning algorithms are certainly a move towards biological relevance, in that learning with a single reinforcer can be achieved in systems that might learn motor habits or fixed sequences. Whether a single prediction error is broadcast throughout a neural system by a general projection system, such as the dopamine pathways in the brain, which distribute to large parts of the striatum and the prefrontal cortex, remains to be clearly established.
Notes:
(^{46}) In fact, the terms in which Hebb put the hypothesis were a little different from an association memory, in that he stated that if one neuron regularly comes to elicit firing in another, then the strength of the synapses should increase. He had in mind the building of what he called cell assemblies. In a pattern associator, the conditioned stimulus need not produce before learning any significant activation of the output neurons. The connections must simply increase if there is associated pre and postsynaptic firing when, in pattern association, most of the postsynaptic firing is being produced by a different input.
(^{47}) See Appendix 1 of Rolls (2008b). There is no set of synaptic weights in a onelayer net that could solve the problem shown in Fig. A.16. Two classes of patterns are not linearly separable if no hyperplane can be positioned in their Ndimensional space so as to separate them (see Appendix 1 of Rolls (2008b)). The XOR problem has the additional constraint that $A=0,B=0$ must be mapped to Output = 0.