Sebastian Seung 9.641 Lecture 7: September 26, 2002
So far we have discussed a number of ways of training a perceptron to perform a task. These have involved a set of input vectors x, and desired outputs y. But there are other ways to train a perceptron that focus on extracting regularities in a set of input vectors, rather than learning input-output pairings. This is sometimes called unsupervised learning. What we learn today can also be regarded as part of multivariate statistics. Most of you have learned about the mean and variance of a single scalar variable. Today we’ll see how to generalize these statistical quantities to vectors. We’ll assume that the perceptron is learning from examples of a random vector x. We’ll use the notation to denote an average over x. Sometimes we’ll assume that there is a ?nite training set of vectors (x1 , . . . , xm ), so that f (x) = 1 f (x? ) m ?=1
Other times we’ll assume that x is drawn from some probability density P (x), so that f (x) = dxP (x)f (x)
1 Mean as the minimum of a cost function
Most of you have learned about the mean of set of numbers or vectors. You may not have learned that the mean is the minimum of a cost function, E(w) = 1 |w ? x|2 2
This is the mean squared Euclidean distance between the vector w and the input vector x. Solving the equation ?E/?w = 0 yields the minimum of this cost function, w= x A physical picture of the cost function can be given. Imagine that there are m rubber bands. One end of each band is attached to an input vector, and the other is attached to w. We’ll consider the idealized case where the rest length of each rubber 1
band is zero. Then E(w) is the average elastic energy in the rubber bands. You can think of this device as an analog computer that minimizes the elastic energy E(w). Therefore the rubber bands will pull w to a position given by the mean of the input vectors.
2 Stochastic gradient descent
According to stochastic approximation theory, an online algorithm for ?nding the mean is ? 1 ?w = ?η |w ? x|2 ?w 2 This is equivalent to ?w = η(x ? w) The update rule always moves w in the direction of the latest example. If w is to converge to the mean, it is important for η to decrease with time. To see the importance of this, suppose the opposite, that η is constant in time. Then if w is initialized exactly at the mean, the update rule will cause w to ?uctuate around the mean, because each input vector tries to pull w in its direction. There is a special way of changing the learning rate η with time so that w does converge to the mean. If η = 1/t, then 1 w(t) = w(t ? 1) + (x(t) ? w(t ? 1)) t This is equivalent to w(T ) = 1 T
So w(T ) is the average of the ?rst T vectors.
3 Stochastic approximation theory
Stochastic approximation theory generalizes the above result. Consider the update rule w(t) = w(t ? 1) + η(t) e(w(t ? 1), z(t)) Then every limit point of the sequence w(t) is a stationary point of E(w) = e(w, z) if the conditions
∞ t=1 ∞ t=1
η(t) = ∞
η(t)2 < ∞
are satis?ed. In most cases, this implies that w(t) converges to a local minimum of E(w). Note that η(t) = 1/t satis?es these conditions, but other choices are possible. 2
The second condition requires that η decrease to zero with time. As mentioned above, this has to happen for w to converge at all. The ?rst condition requires that η not decrease too quickly. If η decreases too quickly, then the learning will stop prematurely, so that w will converge, but not to a local minimum of E(w). If a set of examples z 1 , . . . , z m is presented once, it is not generally true that w is at a stationary point of ? e(w, z ? ). This was true for the case of the quadratic cost function presented earlier, but this was a special case. In general, the set of examples must be presented in?nitely many times for convergence to a stationary point.
Given d random variables, one can de?ne the covariance matrix, Cij = = x i xj ? x i x j (1) (2)
(xi ? xi )(xj ? xj )
You should be able to verify that these two forms are equivalent. The covariance quanti?es how much the ?uctuations in xi and xj vary together. If both variables tend to ?uctuate in the same direction, then the covariance is positive. If they tend to ?uctuate in opposite directions, then the covariance is negative. The covariance vanishes if x i and xj are statistically independent. Each diagonal elements Cii of the covariance matrix is the variance of x i . The covariances are the off-diagonal elements. There are d(d ? 1)/2 of them, since C is a symmetric matrix. This can be a large number, which gives some indication that multivariate statistics is a more complex subject than univariate statistics. Another important property of C is that it is positive semide?nite. We’ll see how to prove this later. It’s always possible to shift x to have zero mean by making the subtraction: x? x . From now on, let’s assume now that x has zero mean. Then the second moment is equivalent to the variance, and the correlation is equivalent to the covariance. Then the covariance matrix is de?ned by Cij = xi xj . This can be written without indices as C = xxT If we didn’t subtract the mean ?rst, we’d have to use the de?nition xx T ? x x T or (x ? x )(x ? x )T . Suppose that we have a set of d dimensional column vectors (x 1 , . . . , xm ), referred to as observations. They can be packaged into a d × m matrix X. The element X i? is the ith component of the ?th column vector. We’ll follow the convention that roman indices refer to components, while greek indices refer to observations. Then C= 1 XX T m
The MATLAB function cov can also be used to ?nd the covariance. It automatically takes care of subtracting the mean of the data.
5 Principal component
When d is large, the number of covariances can be overwhelming. One way of summarizing the information in the covariance matrix is to extract its eigenvectors corresponding to its largest few eigenvalues. These are called the principal components of the data. An eigenvector of C satis?es the equation Cv = λv The eigenvalue λ is guaranteed to be real, as C is symmetric. The principal eigenvector of C is the one with the largest eigenvalue. Suppose that you are given a data set in the form of a matrix X. To ?nd the principal components in MATLAB, you can use the cov function to compute the covariance, and then run the eig command to ?nd the eigenvectors and eigenvalues. Alternatively, you can use the function svd to ?nd a factorization of the form X = Q1 SQT , where S is diagonal with the same dimensions as X, and Q 1 and Q2 2 are orthogonal matrices. The diagonal elements of S are called the singular values of X, and are arranged in decreasing order. An orthogonal matrix Q is de?ned as one that satis?es QT Q = QQT = I. In other words, its columns form an orthonormal basis, as also do its rows. If the data vectors are the columns of X, then the principal components are the ?rst few columns of Q1 . When applied to large matrices, the eig and svd functions can take a lot of computational time. If all you want is just the ?rst few principal components, and not all the eigenvectors, there are faster ways to ?nd them, as we’ll see later.
6 Principal component as optimum of a cost function
There are two ways to think about the principal component as the result of an optimization. Both ways have geometric interpretations. The principal eigenvector is the maximum of E (w) = (wT x)2 with respect to w, subject to the constraint w T w = 1. To see this, note that (w T x)2 = wT Cw. A condition for a stationary point can be found by using a Lagrange multiplier. A stationary point of w T Cw ? λwT w must satisfy the equation Cw = λw. You may not get much intuition from these algebraic manipulations, so let’s consider the geometry of the situation. Every data vector x can be decomposed into the sum of components that are perpendicular and parallel to w x = x⊥ + x In particular, x = w(wT x) = (wwT )x I’ve written the parentheses in two ways. The ?rst way shows that x is parallel to w and has length w T x. The second way shows that the matrix ww T is a projection 4
operator that extracts the component of x that is parallel to w. Now we can rewrite the cost function as E (w) = |x |2 So w is the projection along which which the data has maximal variance. The principal eigenvector is also the minimum of E⊥ = |x ? wwT x|2 with respect to w, subject to the constraint w T w = 1. This is because |x ? ww T x|2 = ?wT Cw, assuming the constraint w T w = 1 is satis?ed. (Exercise: Prove that the principal eigenvector is the minimum of E⊥ , even without the constraint.) The geometrical interpretation is that the principal eigenvector de?nes the best ?tting straight line to the data.
7 Oja’s rule
Let’s consider the case of a linear perceptron y = wT x Then online gradient ascent on E (w) can be written as ?w = ηyx if we ignore the constraint w T w = 1. The problem with this learning rule is that w diverges to in?nity. Oja’s rule is a way of imposing the constraint ?w = η(yx ? y 2 w)