(p.283) A1 Introduction to linear algebra for neural networks
(p.283) A1 Introduction to linear algebra for neural networks
In this appendix we review some simple elements of linear algebra relevant to understanding neural networks. This will provide a useful basis for a quantitative understanding of how neural networks operate. A more extensive review is provided by Jordan (1986).
A vector is an ordered set of numbers. An example of a vector is the set of numbers
If we denote the jth element of this vector as Wj, then w 1 = 7, and w 2 = 4. We can denote the whole vector by w. This notation is very economical. If the vector has 10 000 elements, then we can still refer to it in mathematical operations as w. w might refer to the vector of 10000 synaptic weights on the dendrites of a neuron. Another example of a vector is the set of firing rates of the axons that make synapses onto a dendrite, as shown in Fig. 1.2. The rate r′ of each axon forming the input the vector can be indexed by j, and is denoted by r’j The vector would be denoted by r′. (We use the prime after the r simply because we have used r’ to refer elsewhere in this book to the firing of an input axon, and the algebra we introduce next is appropriate for understanding the operations performed on vectors made up by sets of axons.)
Certain mathematical operations can be performed with vectors. We start with the operation which is fundamental to simple models of neural networks, the inner product or dot product of two vectors.
A1.1.1 The inner or dot product of two vectors
Recall the operation of computing the activation h of a neuron from the firing rate on its input axons multiplied by the corresponding synaptic weight:
where Σj indicates that the sum is over the C input axons indexed by j. Denoting the firing rate vector as r′ and the synaptic weight vector as w, we can write
(p.284) If the weight vector is
and the firing rate input vector is
then we can write
Thus in the inner or dot product, we multiply the corresponding terms, and then sum the result. As this is the simple mathematical operation that is used to compute the activation h in the most simplified abstraction of a neuron (see Chapter 1), we see that it is indeed the fundamental operation underlying many types of neural networks. We will shortly see that some of the properties of neuronal networks can be understood in terms of the properties of the dot product. We next review a number of basic aspects of vectors and inner products between vectors.
There is a simple geometrical interpretation of vectors, at least in low-dimensional spaces. If we define, for example, x and y axes at right angles to each other in a two-dimensional space, then any two-component vector can be thought of as having a direction and length in that space which can be defined by the values of the two elements of the vector. If the first element is taken to correspond to x and the second to y, then the x axis lies in the direction [1, 0] in the space, and the y axis in the direction [0, 1], as shown in Fig. A1.1. The line to point [1, 1] in the space then lies at 45° to both axes, as shown in Fig. A1.1.
(p.285) A1.1.2 The length of a vector
Consider taking the inner product of a vector with itself. Then
This is the length of the vector. We can represent this operation in the two-dimensional graph shown in Fig. A1.1. In this case, the coordinates where vector w ends in the space are [1, 1]. The length of the vector (from [0, 0]) to [1, 1] is obtained by Pythagoras’ theorem. Pythagoras’ theorem states that the length of the vector w is equal to the square root of the sum of the squares of the two sides. Thus we define the length of the vector w as
In the [1, 1] case, this value is √2 = 1.414.
A1.1.4 The angle between two vectors: the normalized dot product
The angle between two vectors r′ and w is defined in terms of the inner product as follows:
For example, the angle between two vectors and
where the length of vector r′ is (0·0 + 1·1)1/2 = 1 and of vector w is (1·1 + 1·1)1/2 = 2, is
(p.286) The dot product reflects the similarity between two vectors. Once the length of the vectors is fixed, the higher their dot product, the more similar are the two vectors. By normalizing the dot product, that is by dividing by the lengths of each vector as shown in Eq. A1.4, we obtain a value which varies from–1 to +1. This normalized dot product is then just the cosine of the angle between the vectors, and is a very useful measure of the similarity between any two vectors, because it always lies in the range–1 to +1. It is closely related to the (Pearson product-moment) correlation between any two vectors, as we see if we write the equation in terms of its components
which is just the formula for the correlation coefficient between two sets of numbers with zero mean (or with the mean value removed by subtracting the mean of the components of each vector from each component of that vector).
Now consider two vectors which have a dot product of zero, that is where cos θ = 0 or the angle between the vectors is 90°. Such vectors are described as orthogonal (literally at right angles) to each other. If our two orthogonal vectors were r′ and w, then the activation of the neuron, measured by the dot product of these two vectors, would be zero. If our two orthogonal vectors each had a mean of zero, their correlation would also be zero: the two vectors can then be described as unrelated or independent.
If instead the two vectors had zero angle between them, that is if cos θ = 1, then the dot product would be maximal (given the vectors’ lengths), the normalized dot product would be 1, and the two vectors would be described as identical to each other apart from their length. Note that in this case their correlation would also be 1, even if the two vectors did not have zero mean components.
For intermediate similarities of the two vectors, the degree of similarity would be expressed by the relative magnitude of the dot product, or by the normalized dot product of the two vectors, which is just the cosine of the angle between them. These measures are closely related to the correlation between two vectors.
Thus we can think of the simple operation performed by neurons as measuring the similarity between their current input vector and their weight vector. Their activation, h, is this dot product. It is because of this simple operation that neurons can generalize to similar inputs; can still produce useful outputs if some of their inputs or synaptic weights are damaged or missing, that is they can show graceful degradation or fault tolerance; and can be thought of as learning to point their weight vectors towards input patterns, which is very useful in enabling neurons to categorize their inputs in competitive networks.
A1.1.5 The outer product of two vectors
Let us take a row vector having as components the firing rates of a set of output neurons in a pattern associator or competitive network, which we might denote as r, with components ri and the index i running from 1 to the number N of output neurons, r is then a shorthand for writing down each component, e.g. [7, 2, 5, 2,…], to indicate that the firing rate of neuron 1 is (p.287) 7, etc. To avoid confusion, we continue in the following to denote the firing rate of input neurons as r’j, with a prime. Now recall how the synaptic weights are formed in a pattern associator using a Hebb rule as follows:
where δwij is the change of the synaptic weight wij which results from the simultaneous (or conjunctive) presence of presynaptic firing r’j and postsynaptic firing (or activation) ri and k is a learning rate constant which specifies how much the synapses alter on any one pairing. In a more compact vector notation, this expression would be
where the firing rates on the axons form a column vector with the values, for example, as follows:
The weights are then updated by a change proportional (the k factor) to the following matrix
This multiplication of the two vectors is called the outer, or tensor, product, and forms a matrix, in this case of (alterations to) synaptic weights. In the most compact notation, among those used in linear algebra, if W refers to the matrix of synaptic weights, we can write
where the T superscript is useful because, since the normal definition of a vector we have assumed defines column vectors, if we wish to indicate instead that we are using a row vector, we should indicate the transpose operation. (The row vector [7, 2, 5] can be defined as the transpose of the column vector
Thus we see that the operation of altering synaptic weights in a network can be thought of as forming a matrix of weight changes, which can then be used to alter the existing matrix of synaptic weights.
(p.288) A1.2 Linear and non-linear systems
The operations with which we have been concerned in this appendix so far are linear operations. We should note that if two matrices operate linearly, we can form their product by matrix multiplication, and then replace the two matrices with the single matrix that is their product. We can thus effectively replace two synaptic matrices in a linear multilayer neural network with one synaptic matrix, the product of the two matrices. For this reason, multilayer neural networks if linear cannot achieve more than can be achieved in a single-layer linear network. It is only in non-linear networks that more can be achieved, in terms of mapping input vectors through the synaptic weight matrices, to produce particular mappings to output vectors. Much of the power of many networks in the brain comes from the fact that they are multilayer non-linear networks (in that the computing elements in each network, the neurons, have non-linear properties such as thresholds, and saturation with high levels of input). Because the matrix by matrix multiplication operations of linear algebra cannot be applied directly to the operation of neural networks in the brain, we turn instead back to other aspects of linear algebra, which can help us to understand which classes of pattern can be successfully learned by different types of neural network.
A1.2.1 Linear combinations of vectors, linear independence, and linear separability
We can multiply a vector by a scalar (a single value, e.g. 2) thus:
We can add two vectors thus:
The sum of the two vectors is an example of a linear combination of two vectors, which is in general a weighted sum of several vectors, component by component. Thus, the linear combination of vectors v1, v2,…to form a vector vs is expressed by the sum
where c 1 and c 2 are scalars.
By adding vectors in this way, we can produce any vector in the space spanned by a set of vectors as a linear combination of vectors in the set. If in a set of n vectors at least one can be written as a linear combination of the others, then the vectors are described as linearly dependent. If in a set of n vectors none can be written as a linear combination of the others, then the vectors are described as linearly independent. A linearly independent set of vectors (p.289) has the properties that any vector in the space spanned by the set can be written in only one way as a linear combination of the set, and the space has dimension d = n. In contrast, a vector in a space spanned by a linearly dependent set can be written in an infinite number of equivalent ways, and the dimension d of the space is less than n.
Consider a set of linearly dependent vectors and the d-dimensional space they span. Two subsets of this set are described as linearly separable if the vectors of one subset (that is, their endpoints) can be separated from those of the other by a hyperplane, that is a subspace of dimension d–1. Subsets formed from a set of linearly independent vectors are always linearly separable. For example, the four vectors
are linearly dependent, because the fourth can be formed by a linear combination of the second and third (and also because the first, being the null vector, can be formed by multiplying any other vector by zero—a specific linear combination). In fact, n = 4 and d = 2. If we split this set into subset A including the first and fourth vector, and subset B including the second and third, the two subsets are not linearly separable, because there is no way to draw a line (which is the subspace of dimension d–1 = 1) to separate the two subsets A and B. We have encountered this set of vectors in Chapter 4, and this is the geometrical interpretation of why a one-layer, one-output unit network cannot separate these patterns. Such a network (a simple perceptron) is equivalent to its (single) weight vector, and in turn the weight vector defines a set of parallel d–1 dimensional hyperplanes. (Here d = 2, so a hyperplane is simply a line, any line perpendicular to the weight vector.) No line can be found which separates the first and fourth vector from the second and third, whatever the weight vector the line is perpendicular to, and hence no perceptron exists which performs the required classification (see Fig. 5.4). To separate such patterns, a multilayer network with non-linear neurons is needed (see Chapter 5).
Any set of linearly independent vectors comprise the basis of the space they span, and they are called basis vectors. All possible vectors in the space spanned by these vectors can be formed as linear combinations of these vectors. If the vectors of the basis are in addition mutually orthogonal, the basis is an orthogonal basis, and it is, further, an orthonormal basis if the vectors are chosen to be of unit length. Given any space of vectors with a preassigned meaning to each of their components (for example the space of patterns of activation, in which each component is the activation of a particular unit) the most natural, canonical choice for a basis is the set of vectors in which each vector has one component, in turn, with value 1, and all the others with value 0. For example, in the d = 2 space considered earlier the natural choice is to take as basis vectors
If we had three vectors that were all in the same plane in a three-dimensional (x, y, z) space, then the space they spanned would be less than three-dimensional. For example, the three vectors
all he in the same z plane, and span only a two-dimensional space. (All points in the space could be shown in the plane of the paper in Fig. A1.1.)
A1.2.2 Application to understanding simple neural networks
The operation of simple one-layer networks can be understood in terms of these concepts. A network with a single binary unit can implement a classification between two subspaces of a space of possible input patterns provided that the p actual patterns given as examples of the correct classification are linearly separable. The binary output unit is equivalent to a hyperplane (the hyperplane orthogonal to its synaptic weight vector) that divides the input space in two. The input space is obviously of dimension d, if d is the number of input axons. A one-layer network with a number n of binary output units is equivalent to n hyperplanes, that could potentially divide the input space into as many as 2n regions, each corresponding to input patterns leading to a different output. However the number p of arbitrary examples of the correct classification (each example consisting of an input pattern and its required correct output) that the network may be able to implement is well below 2n, and in fact depends on d not on n. This is because for p too large it will be impossible to position the n weight vectors such that all examples of input vectors for which the first output unit is required to be ‘on’ fall on one side of the hyperplane associated with the first weight vector, all those for which it is required to be ‘off’ fall on the other side, and simultaneously the same holds with respect to the second output unit (a different dichotomy), the third, and so on. The limit on p, which can be thought of also as the number of independent associations implemented by the network, when this is viewed as a heteroassociator (i.e. pattern associator) with binary outputs, can be calculated with the Gardner method and depends on the statistics of the patterns. For input patterns that are also binary, random and with equal probability for each of the two states on every unit, the limit is pc = 2d (for a fully connected system, otherwise the limit is twice the number of connections per output unit; see further Chapter 2 and Appendix A3).
These concepts also help one to understand further the limitation of linear systems, and the power of non-linear systems. Consider the dot product operation by which the neuronal activation h is computed:
If the output firing is just a linear function of the activation, any input pattern will produce a non-zero output unless it happens to be exactly orthogonal to the weight vector. For positive-only (p.291) firing rates and synaptic weights, being orthogonal means taking non-zero values only on non-corresponding components. Since with distributed representations the non-zero components of different input firing vectors will in general be overlapping (i.e. some corresponding components in both firing rate vectors will be on, that is the vectors will overlap), this will result effectively in interference between any two different patterns that for example have to be associated to different outputs. Thus a basic limitation of linear networks is that they can perform pattern association perfectly only if the input patterns r′ are orthogonal; and for positive-only patterns that represent actual firing rates only if the different firing rate vectors are non-overlapping. Further, linear networks cannot of course perform any classification, just because they act linearly. (Classification implies producing output states that are clearly defined as being in one class, and not in other classes.) For example, in a linear network, if a pattern is presented which is intermediate between two patterns v1 and v2, such as c 1V1 + c 2v2, then the output pattern will be a linear combination of the outputs produced by v1 and v2 (e.g. c 1 o 1 + c 2 o 2), rather than being classified into o 1 or o 2. In contrast, with non-linear neurons, the patterns need not be orthogonal, only linearly separable, for a one-layer network to be able to correctly classify the patterns (provided that a sufficiently powerful learning rule is used, see Chapter 5).