9512.net

# Mixed-mode learning A method for reducing the number of hidden units in cascade correlation

MIXED-MODE LEARNING: A METHOD FOR REDUCING THE NUMBER OF HIDDEN UNITS IN CASCADE CORRELATION Chin-Chi Teng and Benjamin W. Wah Center for Reliable and High Performance Computing Coordinated Science Laboratory University of Illinois at Urbana-Champaign 1308 West Main Street, Urbana, Illinois 61801, USA wah@manip.crhc.uiuc.edu

Abstract
Research supported by National Science Foundation Grant MIP 92-18715 and Joint Services Electronics Program Contract JSEP N00014-90-J-1270. Proceedings of International Symposium on Arti?cial Neural Networks, December 20-22, 1993, National Chiao Tung University, Hsinchu, Taiwan, ROC

-2-

convergence in CAS is a monotonically nondecreasing function of the number of learning epochs. This paper is organized as follows. Section 2 presents a non-iterative learning algorithm based on linear programming for training single-layer neural networks. Using this algorithm, Section 3 presents MM, our mixed-mode learning mechanism. We then present in Section 4 the application of MM in CAS. Section 5 shows our experimental results, and conclusions are drawn in Section 6. 2. Supervised Learning of a Single-Layer Neural Network as a Linear Programming Problem An application suitable for supervised learning can be modeled as a mapping of an input pattern matrix P (with k patterns, each with m values) into an output pattern matrix D (with k patterns, each with n values). P is, therefore, a k-by-m matrix, and D, a k-by-n matrix. Let P T and D T be the transposes of P and D. We have P T = [ p 0 p 1 . . . pk ?1 ] , and D T = [ d 0 d 1 . . . dk ?1 ] , (1) where pi is the i-th input pattern pi = [ pi, 0 pi, 1 . . . pi,m ?1 ] , and di is the corresponding i-th output pattern di = [ di, 0 di, 1 . . . di,n ?1 ] . The single-layer neural network to be learned performs a mapping from P to D. Assume initially that the number of output units is one (n = 1). Since the classi?cation problems that we study in this paper have binary outputs, we assume that the 40-20-40 criterion is applied; that is, an output is considered a logic ONE if it is larger than 0.6, and ZERO if smaller than 0.4. When the sigmoid function is used as the activation function, the problem of learning the weights of a singlelayer single-output neural network is the same as getting a weight matrix Wm×1 where W T = [ w 0 w 1 . . . wm ?1 ] , (2) such that pi . W =
≥ S ?1 (0.6) ≤S
?1

if di =1 if di =0 ,

(0.4)

for all i = 0, 1, 2, .., k ?1,

(3)

where S ?1 (x) is the inverse sigmoid function. Since S ?1 (0.6) = ?S ?1 (0.4), Eq. (3) can be represented in matrix form as follows: S ?1 (0.6) S ?1 (0.6) . . . S ?1 (0.6)
k×1

? P .W ≥

,

(4)

where

?T ? ? ? P = [ p 0 p 1 . . . pk ?1 ]

and

pi ? pi = ?p i

if di =1 if di =0 . (5)

The process of obtaining W that satis?es Eq. (4) is very similar to that of ?nding a feasible solution of a linear program. The only difference is that elements of W can be negative, whereas variables in linear programming problems have to be positive. To deal with the problem of unrestricted variables, we transform the elements of W into xi , where x j = w j + η, Hence, Eq. (4) becomes such that x j , η ≥ 0, and j = 0, .., m ?1. (6)

-3-

| y0 | y1 | | y2 ? P | . | | . | yk ?1 | where

S ?1 (0.6) S ?1 (0.6) X . ??? ≥ . η (m +1)×1 . S ?1 (0.6)
k×(m +1)

(7)

X T = [ x 0 x 1 . . . xm ?1 ] , and yi =

k ?1 j =0

? Σ ?pi, j .

(8)

Since variables xi and η in Eq. (6) are positive, the values of variables that satisfy Eq. (6) can be obtained by the simplex method. The following example illustrates the procedure for transforming supervised learning into a linear programming problem. Example 1. Consider a single-layer neural network for solving a problem whose input matrix P and desired output matrix D are ?15 ?5 P = ?1 1 ?1 ?3 Eq’s (3) and (4) are applied to obtain ?15 ?5 ? P = ?1 1 ?1 ?3 and a set of inequalities: x0 ?15 ?5 20 0.41 . x 1 ≥ 0.41 , ?1 1 0 ?1 ?3 4 0.41 η where x 0 , x 1 , η ≥ 0. The simplex method is then applied to obtain the following feasible solution. [ x 0 x 1 η ]T = [ 0.0 0.41 0.0 ]T . The weight matrix W is, therefore, x 0 ?η ?0.41 W = x ?η = 0.0 . 1 We can verify the result by computing Dreal = sigmoid (P . W) = [ 1.0 0.6 0.6 ]T . (14) (13) (12) , (10) and 1 D= 1 . 1 (9)

(11)

In the above example, we assume the number of output units to be one, i.e., D is a column vector. If the number of output units is n (that is, the desired output matrix D is a k-by-n matrix), we can decompose the learning ? problem into n sub-problems. In each sub-problem, one of the column vectors of D is used to get a matrix P by ? is then used to get the corresponding column vector of weight matrix W. applying Eq. (3). P

-4-

For the case of one output unit with binary output values, the linear programming formulation will allow a weight matrix W to be found such that sigmoid ( P . W ) ? D if and only if ? R k ∩ span{P} ≠ φ , D where R k = {[x 0 . . . xk ?1 ]T | sign (xi )=sign(di ?0.5)}, D where di = 0 or 1 for all i = 0,1,...,k ?1 , (16) (15)

and span{P} is the m-dimensional space spanned by the k input patterns. In supervised learning, Eq. (15) is seldom satis?ed; i.e., Eq. (4) usually has no feasible solution. However, the original learning problem only requires ?nding a set of weights such that the number of correct output patterns is maximum. Hence, we can change the objective of our simplex formulation to ?nding a set of values for all variables such that the number of inequalities that are satis?ed is maximized. Assume | y0 | y | 1 | . ? A= P | . | | . | yk ?1 |

,

X V = ??? η

,
(m +1)×1

b0 . BT = . , . bk ?1

and bi = S ?1 (0.6).

(17)

k×(m +1)

To obtain weights such that the number of correct output patterns is maximized, we can formulate the corresponding optimization problem as follows.
k ?1

Maximize such that

Σ u 0 jΣ ai, j v j ? bi i =0 =0
≥ bi

m

(18)

j =0

Σ ai, j v j

m

where v j ≥ 0, i = 0, .., k ?1, j = 0, ..., m , and u 0 ( ) is a step function with transition at 0. The overhead for solving the above nonlinear optimization problem is very high, and the process is, therefore, not practical. To reduce this overhead, we use a heuristic to obtain a proper set of weights when there is no feasible solution. The heuristic is similar to Phase I of the simplex method. First, we add a slack variable yi to every constraint inequality, changing every inequality into an equality as follows:

Σ ai, j v j ? yi ? bi = 0, j =0
zi = bi + yi ?
m

m

where yi ≥ 0, and i = 0, ..., k ?1.

(19)

Next, we attach an arti?cial variable zi to each constraint equation, where

Σ ai, j v j j =0

and

zi ≥ 0, i = 0, ..., k ?1,

(20)

in such a way that the set of equalities in Eq. (20) always has a feasible solution. Note that the trivial feasible solution is to set all vi and y j to zero. We then solve the following linear optimization problem using the simplex method.
k ?1

Minimize

i =0

Σ zi
j =0

(21)

such that zi = bi + yi ?

Σ ai, j v j

m

where v j , yi , zi ≥ 0, i = 0, ..., k ?1, j = 0, ..., m.

-5-

Since we minimize Σ zi , the optimal solution of Eq. (21) should have a small value for each zi . Therefore, it is very likely that the inequality zi ≤ yi can be satis?ed, and the original constraint inequalities listed in Eq. (18) are satis?ed with a large probability. This is true because if zi ≤ yi , then

Σ ai, j v j = yi + bi ? zi ≥ bi. j =0

m

(22)

Note that the more inequalities in Eq. (18) are satis?ed, the more correct output patterns are obtained in the output layer. 3. Mixed-Mode Learning Mechanism The objective of the mixed-mode learning mechanism (MM) is to reduce the number of training epochs. This can be achieved if the output matrix is ?exible; that is, learning is faster if there is a large pool of desired output matrices, and learning can stop whenever any one of them is found. By exploiting this property, MM ?rst transforms the original problem of ?nding a one-to-one mapping from P to D into one that ?nds a one-to-one mapping from P to one of a large set of possible output matrices Ireal . It then transforms Ireal to D by using the noniterative learning algorithms for single-layer neural networks described in Section 2. We use Figure 1 to illustrate how MM works. Given an application problem with input matrix P and desired output matrix D, an existing supervised learning algorithm is used to train the original network. During training, a monitor is used to extract intermediate output matrix Ireal periodically, and apply the linear programming method described in Section 2 to map Ireal to D, where Ireal is the set of output values of input and/or hidden units that are connected to the output units. An element (Ireal )i, j in matrix Ireal is the output of the j-th unit that is connected to the output layer when the i-th training pattern is applied. MM requires a smaller number of training epochs since it has a relaxed objective. Learning in MM only tries to get an intermediate output matrix Ireal that can satisfy R k i ∩ span{Ireal } ≠ φ for all i = 0, 1,..., k ?1 d (23)

where di is the i-th column vector of D. Since there are potentially many Ireal ’s satisfying this criterion, learning involves ?nding one of the one-to-many mappings and is much easier. The additional overhead incurred by an extra subnet involves solving a feasible solution of a set of linear inequalities.

original network trained by any supervised learning algorithm

HIDDEN LAYER

INPUT LAYER

P

I

OUTPUT LAYER

real W

D

real

MONITOR: and use real linear mapping to get W ’ Check I W’

OUTPUT LAYER

D’ real

Figure 1. Mixed-mode learning mechanism.

-6-

4. Reduction in the Number of Hidden Units for Cascade Correlation In CAS, the number of hidden units changes in a monotonically non-decreasing fashion as learning progresses. Hence, reducing the number of epochs by MM will lead to an equal or smaller number of hidden units in the resulting network. There are two training phases in CAS: TRAIN_INPUT for adding new hidden units, and TRAIN_OUTPUT for training the weights in the output layer. These two phases are executed alternatively. If a monitor is added to CAS, we note (a) that Ireal cannot be acquired in the TRAIN_INPUT phase, as the new hidden unit has not been decided upon, and (b) that Ireal is frozen in the TRAIN_OUTPUT phase. Consequently, we only have to use the monitor in the ?rst epoch of each TRAIN_OUTPUT phase, and the monitor is seldom activated when MM is applied in CAS. The detailed procedure for applying MM in CAS is summarized as follows. Procedure 1. Applying MM in CAS. Given a training pattern set (including an input matrix P and a desired output matrix D), the procedure has ?ve steps. 1. Train the primary network that includes the input units and the output units by Quickprop. If ||Dreal ? D|| (see Figure 1) is smaller than a prescribed threshold, then stop. 2. Execute the TRAIN_INPUT phase of CAS to add a new hidden unit. 3. Extract Ireal from the network obtained in Step 2, where Ireal is the input matrix to the output layer. 4. Using the linear programming method described in Section 2 to obtain the weights of connections to the output layer. If ||Dreal ? D|| (see Figure 1) is smaller than a prescribed threshold, then stop. 5. Execute the TRAIN_OUTPUT phase of CAS. If ||Dreal ? D|| is smaller than a prescribed threshold, then stop; otherwise go to Step 2. 5. Experimental Results We compare the performance of the original CAS with that of CAS with MM (CAS+MM) using the twospiral problem as a benchmark. The experiment is repeated with thirty trials, each with different initial weights. Figure 2 shows a plot of normalized number of hidden units versus normalized learning time (CPU seconds). Each point (x, y) in this ?gure is normalized with respect to the performance of the ‘‘original’’ CAS starting with an identical initial con?guration and an identical set of random weights. (Using this normalization method, point (1,1) in Figure 2 represents the performance of the ‘‘original’’ CAS.) For the set of points generated using MM+CAS, the following normalizations are performed. # hidden units for net trained by MM+CAS and # hidden units for net trained by CAS learning time for net trained by MM+CAS y = . learning time for net trained by CAS x = (24) (25)

In our experiments, we have found that it is not necessary to activate the monitor early in the learning process, since it is dif?cult for the monitor to ?nd an Ireal to map to D at that time. In Figure 2, points indicated by circles are obtained when the monitor is always activated. In contrast, points indicated by deltas are obtained for cases in which the monitor is activated when the error in the original network is smaller than 20%. We see that this results in savings in learning time. Note that the number of hidden units of networks trained by MM+CAS is never larger than those trained by the ‘‘original’’ CAS, since learning completes when either the ‘‘original’’ CAS completes or the monitor in MM+CAS ?nds a suitable mapping. Finally, Table 1 summarizes the average normalized performance of CAS+MM. When the monitor is activated when the error in training is less than 20%, 13 cases lead to 11% reduction in the number of hidden units (as compared to the ‘‘original’’ CAS), taking only 95% of the training time. There are 17 cases where no improvement in the number of hidden units is found; the penalty in these cases is an additional 16% training time.

-7-

Our empirical results show that, in most trials, MM+CAS converges with less number of hidden units, although learning time is slightly longer. This reduction in the number of hidden units is important, since a trained ANN may have to be applied many times in the future. 6. Conclusions In this paper, we present a method for reducing the number of hidden units required for convergence in the cascade-correlation learning algorithm. Our method is based on searching an intermediate output matrix that can be mapped to the desired output matrix. The validity of this mapping is veri?ed by linear programming that ?nds a feasible solution for a set of linear inequalities. From our experimental results, our proposed mechanism leads to reduced number of hidden units but increased computation time. This trade-off is important, since the resulting smaller network is able to generalize better and is faster when deployed in the target application.

References [1] C.-C. Teng, Mixed-Mode Supervised Learning Algorithms for Multilayered Feed-Forward Neural Networks, M.Sc. Thesis, Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, IL, August 1993. [2] S. E. Fahlman, ‘‘Faster-Learning Variations on Back-Propagation: An Empirical Study,’’ Proc. Connectionist Models Summer School, pp. 38-51, Morgan Kaufmann, Palo Alto, CA, 1988. [3] S. E. Fahlman and Christian Lebiere, The Cascade-Correlation Learning Architecture, Tech. Rep. CMU-CS90-100, School of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, Feb. 1990. [4] D. E. Rummelhart, J. L. McClelland, and the PDP Group (ed.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, MIT Press, Cambridge, MA 1986.

1

? ? ?? ? ? ??oo? o o o o o ?? ? ? oo o o o ?o oo oo

Table 1. Average normalized performance of MM+CAS for solving the two-spiral problem. Monitor Activated Monitor Always Number When Error < 20% Activated Of Cases Units Time Units Time 13 0.893 0.959 0.893 1.091 17 1 1.164 1 1.310

0.95 Norm. No. Of 0.9 Hidden Units 0.85
? ? o ? ? ?? ? ? ? ? o o o oo o o o

? ? o o

o o

0.8

?

0.8

1 1.2 1.4 Normalized Learning Time

Figure 2. Performance of MM+CAS for solving the twospiral problem normalized with respect to performance of the original CAS. (Circles represent results when the monitor is always activated; deltas represent results when the monitor is activated when the error in the original network is less than 20%.)