9512.net
甜梦文库
当前位置:首页 >> >>

Constructive neural network learning algorithms for multi-category real-valued pattern clas


Constructive Neural Network Learning Algorithms for Multi-Category Pattern Classi cation
Technical Report TR95-15 Rajesh Parekh, Jihoon Yang, and Vasant Honavar Arti cial Intelligence Research Group Department of Computer Science 226 Atanaso Hall, Iowa State University, Ames, IA 50011. U.S.A.
parekhjyangjhonavar@cs.iastate.edu

October 6, 1995
Abstract

This research was partially supported by the National Science Foundation grant IRI-9409580 to Vasant Honavar.

Constructive learning algorithms o er an approach for incremental construction of potentially near-minimal neural network architectures for pattern classi cation tasks. Such algorithms help overcome the need for ad-hoc and often inappropriate choice of network topology in the use of algorithms that search for a suitable weight setting in an otherwise a-priori xed network architecture. Several such algorithms proposed in the literature have been shown to converge to zero classi cation errors (under certain assumptions) on a nite, non-contradictory training set in a 2-category classi cation problem. This paper explores multi-category extensions of several constructive neural network learning algorithms for pattern classi cation. In each case, we establish the convergence to zero classi cation errors on a multicategory classi cation task (under certain assumptions). Results of experiments with non-separable multi-category data sets demonstrate the feasibility of this approach to multi-category pattern classi cation and also suggest several interesting directions for future research.

1 Introduction
Multi-layer networks of threshold logic units (TLU) or multi-layer perceptrons (MLP) o er a particularly attractive framework for the design of pattern classi cation and inductive knowledge acquisition systems for a number of reasons including: potential for parallelism and fault tolerance; signi cant representational and computational e ciency that they o er over disjunctive normal form (DNF) functions and decision trees Gallant, 93]; and simpler digital hardware realizations than their continuous counterparts. A single TLU, also known as perceptron, can be trained to classify a set of input patterns into one of two classes. A TLU is an elementary processing unit that computes a function of the weighted sum of its inputs. Assuming that the patterns are drawn from an N -dimensional Euclidean space, the output Op , of a TLU with weight vector W, in response to a pattern Xp, is a bipolar hardlimiting function of W Xp, i.e. Op = 1 if W Xp > 0 = ?1 otherwise Such a TLU or threshold neuron implements a (N ? 1)-dimensional hyperplane given by W X = 0 which partitions the N -dimensional Euclidean pattern space de ned by the coordinates x1 xN into two regions (or two classes). Given a set of examples S = S+ S? where S+ = f(Xp; C p) j C p = 1g and S? = f(Xp; C p) j C p = ?1g (C p is the desired output of the pattern classi er for the input pattern Xp), it is the goal of a ^ perceptron training algorithm to attempt nd a weight vector W such that 8Xp 2 S+ , p > 0 and 8Xp 2 S? ,W Xp ^ ^ ^ W X 0. If such a weight vector (W) exists for the pattern set S then S is said to be linearly separable. Several iterative algorithms are ^ available for nding such a W if one exists Nilsson, 65; Duda & Hart, 73] . Most of these are variants of the perceptron weight update rule: W W + (C p ? Op )Xp (where > 0 is the learning rate). However when S is not linearly separable, such algorithms behave poorly (i.e., the classi cation accuracy on the training set can uctuate wildly from iteration to iteration). Several extensions to the perceptron weight update rule e.g., pocket algorithm Gallant, 93], thermal perceptron Frean, 90], loss minimization algorithm Hrycej, 92], and the barycentric correction procedure Poulard, 95] are designed to nd a reasonably good weight vector that correctly classi es a large fraction of the training set S when S is not linearly separable and converge to zero classi cation errors when S is linearly separable. For a detailed comparison of the single TLU training algorithms see Yang et al., 95]. Recently Siu et al., 95] have established the necessary and su cient conditions for a training set S to be non separable. They have also uncovered structures within a non-separable set S and have shown that the problem of identifying a largest linearly separable subset SSep of S is NP-complete. It is widely conjectured that no polynomial time algorithms exist for NP-complete problems Garey & Johnson, 1979] . Thus, we rely on heuristic algorithms such as the pocket algorithm or the thermal perceptron to correctly classify as large a subset of training patterns as possible within 2

the given constraints (such as limited training time). When S is not linearly separable, however, a multi-layer network of TLUs is needed to learn a complex decision boundary that correctly classi es all the training examples. The focus of this paper is on constructive or generative learning algorithms that incrementally construct networks of threshold neurons to correctly classify a given (typically non-separable) training set. Some of the motivations for studying such algorithms Honavar, 90; Honavar & Uhr, 93] include: Limitations of learning by weight modi cation alone within an otherwise a-priori xed network topology: Weight modi cation algorithms typically search for a solution weight vector that satis es some desired performance criterion (e.g., classi cation error). In order for this approach to be successful, such a solution must lie within the weight-space being searched, and the search procedure employed must in fact, be able to locate it. This means that unless the user has adequate problemspeci c knowledge that could be brought to bear upon the task of choosing an adequate network topology, the process is reduced to one of trial and error. Constructive algorithms can potentially o er a way around this problem by extending the search for a solution, in a controlled fashion, to the space of network topologies. Complexity of the network should match the intrinsic complexity of the classi cation task: It is desirable that a learning algorithm construct networks whose complexity (as measured in terms of relevant criteria such as number of nodes, number of links, connectivity, etc.) is commensurate with the intrinsic complexity of the classi cation task (implicitly speci ed by the training data). Smaller networks yield e cient hardware implementations. And everything else being equal, the more compact the network, the more likely it is that it exhibits better generalization properties. Constructive algorithms can potentially discover near-minimal networks for correct classi cation of a given data set. Estimation of expected case complexity of pattern classi cation tasks: Many pattern classi cation tasks are known to be computationally hard. However, little is known about the expected case complexity of classi cation tasks that are encountered, and successfully solved, by living systems - primarily because it is di cult to mathematically characterize the statistical distribution of such problem instances. Constructive algorithms, if successful, can provide useful empirical estimates of expected case complexity of real-world pattern classi cation tasks. Trade-o s among performance measures: Di erent constructive learning algorithms o er natural means of trading o certain subsets of performance measures (e.g., learning time) against others (network size, generalization). Incorporation of prior knowledge: Constructive algorithms provide a natural framework for exploiting problem-speci c knowledge (e.g., in the form of production 3

rules) into the initial network con guration or heuristic knowledge (e.g., about the general topological constraints on the network) into the network construction algorithm. A number of constructive algorithms that incrementally construct networks of threshold neurons for 2-category pattern classi cation tasks have been proposed in the literature. These include the tower, pyramid Gallant, 90], tiling Mezard & Nadal, 89], upstart Frean, 90], and perceptron cascade Burgess, 94]. They are all based on the idea of transforming the hard task of determining the necessary network topology and weights to two subtasks: Incremental addition of one or more threshold neurons to the network when the existing network topology fails to achieve the desired classi cation accuracy on the training set. Training the added threshold neuron(s) using some variant of the perceptron training algorithm (e.g., the pocket algorithm) Di erent constructive algorithms di er in terms of their choices regarding: restrictions on input representation (e.g., binary, bipolar, or real-valued inputs); when to add a neuron; where to add a neuron; connectivity of the added neuron; weight initialization for the added neuron; how to train the added neuron (or a subnetwork a ected by the addition); and so on. The interested reader is referred to Chen et al, 95] for an analysis (in geometrical terms) of the decision boundaries generated by some of these constructive learning algorithms. Each of these algorithms can be shown to converge to networks which yield zero classi cation errors on any given training set in the 2-category case. The convergence proof in each case is based on the ability of the variant of the perceptron training algorithm to nd a weight setting for each newly added neuron or neurons such that the number of pattern misclassi cations is reduced by at least one each time a unit (or a set of units) is added and trained. We will refer to such a variant of the perceptron algorithm as LW . In practice, the performance of the constructive algorithm is based partly on the choice of LW (which may be the pocket algorithm, thermal perceptron or some such algorithm) and its ability to nd weight settings that reduce the total number of misclassi cations each time a new unit is added to the network and trained. Pattern classi cation tasks that arise in practice often require assigning patterns to one of M (M > 2) classes. Although in principle, an M -category classi cation task can be reduced to an equivalent set of M 2-category classi cation tasks (each with its own training set constructed from the given M -category training set), a better approach might be one that takes into account the inter-relationships between the M output classes. For instance, the knowledge of membership of a pattern Xp in category i can be used by the learning algorithm to e ectively rule out its membership in a di erent category j (j 6= i) and any internal representations learned in inducing the structure of i can therefore be 4

exploited in inducing the structure of a category j (j 6= i). Thus, extensions of 2category constructive learning algorithms to deal with multi-category classi cation tasks are clearly of interest. However, in most cases, such extensions have not been explored while in other cases, only some preliminary ideas (not supported by detailed theoretical or experimental analysis) for possible multi-category extensions of 2-category algorithms are available in the literature. Against this background, the focus of this paper is on provably convergent multi-category learning algorithms for construction of networks of threshold neurons for pattern classi cation. The rest of the paper is organized as follows: Section 2 explores multi-category extensions of tower, pyramid, upstart, tiling, and perceptron cascade algorithms. In each case, convergence to zero classi cation errors is established. Section 3 presents some preliminary results on two classi cation tasks (an arti cial task involving random boolean mappings, and a real-world task of classifying the iris data set) Section 4 concludes with a summary and discussion of some directions for future research.

2 Multi-Category Constructive Learning Algorithms
This section outlines several multi-category constructive learning algorithms. Some of these are relatively straightforward extensions of the corresponding 2-category algorithms whereas others entail non-trivial modi cations and present several interesting design choices. Proof of convergence to zero classi cation errors on any given nite,noncontradictory training set is provided in each case. The following notation is used in the convergence proofs of the constructive learning algorithms. Number of input neurons: N Number of output neurons (equal to the number of categories): M Categories: 1; 2; : : : M Number of units in layer A: UA Weight vector for neuron j : Wj Net Input for neuron j of layer A in response to pattern Xp: np j A Threshold (or bias) for unit i of layer A: WAi;0 Connection weight between unit i of layer A and unit j of layer B : WAi ;Bj Indexing for neurons of layer A: A1; A2; : : :; AUA For the input layer A = I p p p Augmented Pattern vector p: Xp = < X0 ; X1p; : : : ; XN >, X0 = 1 for all p p ; C p; : : :; C p >, C p = 1 if Xp 2 Target output for pattern Xp: Cp = < C1 2 i and i M p = ?1 otherwise Ci p p p Observed output for pattern Xp at layer A: Op = < OA1 ; OA2 ; : : : ; OAk > where UA = k A 5

2.1 Notation

A pattern is said to be correctly classi ed at layer A when Cp = Op . A Number of patterns wrongly classi ed at layer A: eA

The 2-category Tower algorithm Gallant, 90] constructs a tower of TLUs. The bottommost neuron in the tower receives N inputs, one for each component of the pattern vector. The tower is built by successively adding neurons to the network and training them using LW until the desired classi cation accuracy is achieved. Each newly added neuron becomes the new output neuron and receives as input each of the N components of the input pattern as well as the output of the neuron immediately below itself. The extension of the 2-category tower algorithm to deal with multiple (M ) output categories is rather straightforward. It can be accomplished by simply adding M neurons each time a new layer is added to the tower. Each neuron in the newly added layer (which becomes the new output layer) receives inputs from the N input neurons as well as the M neurons in the preceding layer. The topology of the resulting multi-category tower network is shown in Fig. 1.
Output Layer: M neurons

2.2 Tower Algorithm

Hidden Layer 1: M neurons

Input Layer: N neurons

Figure 1: Tower Network

2.2.1 Multi-Category Tower Algorithm

1. Set the current output layer index L = 0. 6

2. Repeat the following steps until the desired training accuracy is achieved or the maximum number of hidden layers allowed is exceeded. 3. L = L + 1. Add M output neurons to the network at layer L. This forms the new output layer of the network. Connect each neuron in layer L to each of the input neurons and to each neuron in the preceding layer L ? 1 (if one exists). 4. Train the weights associated with each of the newly added neurons in layer L (the rest of the weights in the network are left unchanged).

2.2.2 Convergence Proof Theorem 1:

There exists a weight setting for neurons in a newly added layer L in the M -category tower network such that the number of patterns misclassi ed by the tower with L layers is less than the number of patterns misclassi ed by the same tower prior to the addition of the Lth layer (i.e., 8L > 1; eL < eL?1).

Proof:

Assume that a pattern Xp was not correctly classi ed at layer L ? 1 (i.e. Cp 6= Op ?1). L Consider the following weight setting for neurons in layer L (the newly added output layer).

WLj ;0 = Cjp for j = 1 : : : M WLj ;Ii = Cjp Xip for j = 1 : : : M and i = 1 : : : N WLj ;L?1j = N WLj ;L?1k = 0 for k = 1 : : : M; k 6= j
Cj
p

bias

j

Cj X 1 1

p

p

Cj X N N

p

p

0 1

N j

0 M

Figure 2: Weight Setting for the j th output neuron in the Tower Network 7

Input Layer Connections

Connections to Layer L-1

For the pattern Xp the net input np j for the j th unit in layer L is: L
p np j = WLj ;0 + WLj ;Ii Xip + WLj ;L?1i OL?1i L i=1 i=1 p = Cjp + CjpN + NOL?1j i=N

X

i=M

X

p If Cjp = ?OL?1j :

p If Cjp = OL?1j :

np j = Cjp L p OLj = sgn(np j ) L = Cjp np j = (2N + 1)Cjp L p OLj = sgn(np j ) L p = Cj
i=N i=1

Thus we have shown that the pattern Xp is corrected at layer L. Now consider a pattern Xq 6= Xp . Clearly, Xp Xq N ? 2 for bipolar patterns.
q = Cjp + CjpXp Xq + NOL?1j q (N ? 1)Cjp + NOL?1j q OLj = sgn(nq j ) L q = OL?1j Thus, for all patterns Xq 6= Xp, the outputs produced at layers L and L ? 1 are identical. We have shown the existence of a weight setting that is guaranteed to yield a reduction in the number of misclassi ed patterns whenever a new layer is added to the tower. We rely on the algorithm LW to nd such a weight setting. Since the training set is nite in size, eventual convergence to zero errors is guaranteed.

nq j L

=

=M X W X q + iX W q WLj ;0 + Lj ;L?1i OL?1i Lj ;Ii i
i=1

2

2.3 Pyramid Algorithm

The 2-category pyramid algorithm Gallant, 90] constructs a network in a manner similar to the tower algorithm, except that each newly added neuron receives input from each of the N input neurons as well as the outputs of all the neurons at each of the preceding layers. The newly added neuron constitutes the new output of the network. As in the case of the tower algorithm, the extension of the 2-category tower algorithm to handle M output categories is quite straightforward with each newly added layer of M neurons receiving inputs from the N input neurons and the outputs of each neuron in each of the previously added layers. The resulting M -category tower network is shown in Fig. 3. 8

Output Layer: M neurons

Input Layer Connections

Hidden Layers: M neurons

Individual connection between two neurons Group connection Input Layer: N neurons

Figure 3: Pyramid Network

2.3.1 Multi-Category Pyramid Algorithm

1. Set the current output layer index L = 0. 2. Repeat the following steps until the desired training accuracy is achieved or the maximum number of hidden layers allowed is exceeded. 3. L = L + 1. Add M neurons to the network at layer L. This forms the new output layer of the network. Connect each neuron in the layer L to the N inputs and each neuron in each of the previous layers. 4. Train the weights associated with each of the newly added neurons in layer L (the rest of the weights in the network are left unchanged).

2.3.2 Convergence Proof Theorem 2:

There exists a weight setting for neurons in the newly added layer L in the M -category 9

pyramid network such that the number of patterns misclassi ed by the pyramid with L layers is less than the number of patterns misclassi ed by the same pyramid prior to the addition of the Lth layer (i.e., 8L > 1; eL < eL?1).

Proof:

Assume that a pattern Xp was not correctly classi ed in layer L ? 1 (i.e. Cp 6= Op ?1). L Consider the following weight setting for neurons in layer L (the newly added output layer).

WLj ;0 = Cjp for j = 1 : : : M WLj ;Ii = CjpXip for j = 1 : : : M and i = 1 : : : N WLj ;L?ik = 0 for i = 2 : : : L ? 1; j = 1 : : : M; k = 1 : : : M WLj ;L?1j = N WLj ;L?1k = 0 for k = 1 : : : M; k 6= j
Cj
p

bias

j

Cj X 1 1

p

p

Cj X N N

p

p

0 1

N j

0 M

0

0

Connections to all neurons in layers L-2, L-3, ..., 1. Input Layer Connections Connections to Layer L-1

Figure 4: Weight Setting for the j th output neuron in the Pyramid Network This choice of weights reduces an M -category pyramid network to an M -category tower network. The convergence proof follows directly from the convergence proof of the tower algorithm.

2

2.4 Upstart Algorithm

The 2-category upstart algorithm Frean, 90] constructs a a binary tree of threshold neurons. A simple extension of this idea to deal with M output categories would be to construct M independent binary trees (one for each output class). This approach fails to exploit the inter-relationships that may exist between the di erent M -outputs. We therefore follow an alternative approach (suggested by Frean) using a single hidden layer 10

instead of a binary tree. Since the original upstart algorithm was presented for the case with binary valued patterns and TLUs implementing the binary hardlimiter function, we will present our extension of this algorithm to M classes under the same binary valued framework.1 First, an output layer of M neurons is trained using the chosen LW algorithm. If all the patterns are correctly classi ed, the procedure terminates without the addition of any hidden neurons. If that is not the case, the output neuron (k) that makes the most p number of errors (in the sense Ckp = ?Ok ) is identi ed. Depending upon whether the p = 0; Op = 1) or wrongly-o (i.e. C p = 1; Op = 0) more neuron k is wrongly-on (i.e. Ck k k k often, a wrongly-on corrector daughter (X ) or a wrongly-o corrector daughter (Y ) is added to the hidden layer and trained to correct some errors of the output neuron k. For p p each pattern X p in the training set, the target outputs (CX and CY ) for the X and Y daughters are determined as follows: p p p p If Ck = 0 and Ok = 0 then CX = 0, CY = 0. p p p p If Ck = 0 and Ok = 1 then CX = 1, CY = 0. p p p p If Ck = 1 and Ok = 0 then CX = 0, CY = 1. p p p p If Ck = 1 and Ok = 1 then CX = 0, CY = 0. The daughter is trained using the LW algorithm, and after connecting it to each of the M output units the output weights are retrained. The resulting network is shown in Fig. 5.

2.4.1 Multi-Category Upstart Algorithm

1. Train a single layer network with M output units and N input units using the algorithm LW . 2. If the desired training accuracy is not achieved so far then repeat the following steps until the desired training accuracy is achieved or the maximum number of allowed neurons in the output layer is exceeded. (a) Determine the unit k in the output layer that makes the most errors. (b) Add a X or a Y daughter depending on whether the unit k is wrongly-on or wrongly-o more often. The daughter unit is connected to all the N inputs. (c) Construct the training set for the daughter unit as described above and train it. Freeze the weights of this newly added daughter. (d) Connect the daughter unit to each of the output neurons and retrain the output weights.

1The modi cation to handle bipolar valued patterns is straightforward with the only change being that instead of adding a X daughter or a Y daughter individually, a pair of X and Y daughters must be added at each time.

11

Output Layer: M neurons

Hidden Layer

Current daughter Previously added daughters

Input Layer: N neurons

Individual connection between two neurons Group connection - full connectivity between the two blocks connected

Figure 5: Upstart Network Assume that at some time during the training there is at least one pattern that is not correctly classi ed at the output layer L of M units. Thus far, the hidden layer L ? 1 comprises of UL?1 daughter units. Assume also that the output neuron z (1 z M ) is wrongly on (i.e., it produces an output of 1 when the desired output is in fact 0) for a training pattern Xp. A X daughter unit is added to layer L ? 1 and trained so as to correct the classi cation of Xp at the output layer. The daughter unit is trained to output 1 for pattern Xp, and to output 0 for all other patterns. Next the newly added daughter unit is connected to all output units and the output weights are retrained.

2.4.2 Convergence Proof

Theorem 3: Proof:

There exists a weight setting for the X daughter unit and the output units that ensures that the number of misclassi ed patterns is reduced by at least one for the multi-category upstart network. Consider the following weight setting for the daughter unit:

WX;0 = ?N + 1 WX;Ii = Xip for i = 1 : : : N
12

bias

WL ,0 j

j

WL ,I

j 1

WL ,I 1

j N

N WL ,L-1 j 1 1 WL ,L-1 j U L-1 U L-1

p p 2 (C - O ) λ j j j

Input Layer Connections

-N+1 X

Connections to previous daughters

X1 1

p

XN N

p

Figure 6: Weight Setting for the j th output neuron in the Upstart Network For pattern Xp:
N X W Xp ?N + 1 + X;Ik k

Input Layer Connections

np X
p OX

= = = = =

1

sgn(np ) X
N X W Xq X;Ik k k=1 N X ?N + 1 + X pX q

1

?N + 1 + N

k=1

For any other pattern Xq 6= Xp

nq = ?N + 1 + X
=

q OX = sgn(nq ) X

?N + 1 + (N ? 2) ?1

k=1

k k

= 0

13

UL? Let j = abs(WLj ;0) + PN=1 abs(WLj;Ik ) + Pk=1 1 abs(WLj ;L?1k ) (i.e. for each output unit k j , j is the sum of absolute values of all its existing weights). Consider the following weight setting for connections between each output layer neuron and the newly trained X daughter: WLj ;X = 2(Cjp ? Ojp) j Ojp is the original output of neuron j in the output layer. Let us consider the new output of each neuron j in the output layer in response to pattern Xp

np j L

= =

UL N X W X p + k=X? W p p p p WLj ;0 + Lj ;Ii i Lj ;L?1k Ok + 2(Cj ? Oj ) j OX i=1 k=1 k=X? UL N X W Op + 2(C p ? Op) (1) W + W Xp +
1

Lj ;0

1

Pk=UL?1 WL ;L?1 Op j . We know that ? j WLj ;0 + P k=1 j k k p = Op we see that the net input for unit L remains the same as that before If Cj j j adding the daughter units and hence the output remains the same i.e. Cjp. If Cjp = 0 and Ojp = 1, the net input for unit Lj is np j j ? 2 j . Since j 0, L the new output of Lj is 0 which is Cjp. If Cjp = 1 and Ojp = 0, the net input for unit j is nLp ? j + 2 j . Since j 0, j the new output of Lj is 1 which is Cjp. q Thus pattern Xp is corrected. Consider any other pattern Xq . We know that OX = 0.
nq j = WLj ;0 + L
=
UL N X W X q + k=X? W q p p q Lj ;Ii i Lj ;L?1k Ok + 2(Cj ? Oj ) j OX i=1 k=1 k=X? N X W X q + UL W q WLj ;0 + Lj ;Ii i Lj ;L?1k Ok
1 1

i=1

Lj ;Ii i

k=1 p N W i=1 Lj ;Ii Xi +

Lj ;L?1k k

j

j

j

We see that the daughter's contribution to the output neurons in the case of any patterns other than Xp is zero. Thus the input of each neuron in the output layer remains the same as it was before the addition of the daughter unit and hence the outputs for patterns other than Xp remain unchanged. A similar proof can be presented for the case when a wrongly o corrector (i.e. a Y daughter) is added to the hidden layer. Thus, we see that the addition of a daughter ensures that the number of misclassi ed patterns is reduced by at least one. Since the number of patterns in the training set is nite, the number of errors are guaranteed to eventually become zero.

i=1

k=1

2

14

Output Layer - M neurons

Hidden Layer 2 Output Layer connections

Hidden Layer connections Hidden Layer 1

Individual connection Group connection connecting all neurons in the block

Input Layer - N neurons

Figure 7: Perceptron Cascade Network

2.5 Perceptron Cascade Algorithm

The perceptron cascade algorithm2 Burgess, 94] draws on the ideas used in the upstart algorithm and constructs a neural network that is topologically similar to the one built by the cascade correlation algorithm Fahlman & Lebiere, 90]. However, unlike the cascade correlation algorithm, the perceptron cascade algorithm uses TLUs. Initially an output neuron is trained using the LW algorithm. If the output unit does not correctly classify the desired fraction of the training set, a daughter neuron is added and trained to correct some of the errors made by the output neuron. The daughter neuron receives
2

Although the original two category version of this algorithm is guaranteed to converge for real valued patterns we have restricted our extension to multiple output classes to binary/bipolar valued patterns only. The extension to real valued patterns is straight forward.

15

inputs from each of the input units and from each of the previously added daughters. The targets for the daughter are determined exactly as in the case of the upstart network. The extension to M output classes is relatively straight forward. First, the output layer of M neurons is trained. If the desired training accuracy is not achieved, the output p neuron, k, that makes the largest number of errors (in the sense that Ckp = ?Ok ) is identi ed and a daughter unit (an X daughter if the unit is wrongly-on more often or a Y daughter if the unit is wrongly-o more often) is added to the hidden layer and trained to correct some errors at the output layer. For each pattern Xp in the training set, the target outputs for the daughter unit are determined as in the upstart algorithm. The daughter receives its inputs from each of the input neurons and from the outputs of each of the previously added daughters. After the daughter is trained it is connected to each of the M output units and the output weights are retrained. Fig. 7 shows the construction of a perceptron cascade network.

2.5.1 Multi-Category Perceptron Cascade Algorithm

1. Train a single layer network with M output units and N input units using the algorithm LW . 2. If the desired training accuracy is not achieved so far then repeat the following steps until the desired training accuracy is achieved or the maximum number of allowed neurons in the output layer is exceeded. (a) Determine the unit k in the output layer that makes the most errors. (b) Add a X or a Y daughter depending on whether the unit k is wrongly-on or wrongly-o more often. The daughter unit is connected to all the N inputs and to all previously added daughter units. (c) Construct the training set for the daughter unit and train it. Freeze the weights of the daughter. (d) Connect the daughter unit to each of the output neurons and retrain the output weights.

2.5.2 Convergence Proof Theorem 4: Proof:

There exists a weight setting for each daughter unit added and the output units that ensures that the number of misclassi ed patterns is reduced by least one for the multicategory perceptron cascade network. The perceptron cascade is similar to the upstart algorithm except for the fact that each 16

newly added daughter unit is connected to all the previously added daughter units in addition to all the input units. If we set the weights connecting each newly added daughter to all the previous daughter units to zero, the perceptron cascade would behave exactly as the upstart algorithm. The convergence proof for the perceptron cascade thus follows directly from the proof of the upstart algorithm.

2
bias WL ,0 j j

WL ,I

j 1

WL ,I 1

j N

N WL ,1 j 1 1 WL ,L-1 j 1 L-1

Input Layer Connections

2 (C - O ) λ j j j -N+1 X

p

p

Connections to previous daughters in layers 1...L-1

X1 1 N

p

XN

p

0 1

0 L-1

Input Layer Connections

Figure 8: Weight Setting for the j th output neuron in the Cascade Network

Connections to previous daughters in layers 1...L-1

The tiling algorithm Mezard & Nadal, 89] constructs a strictly layered network of threshold neurons. The bottom-most layer of neurons receives inputs from each of the N input neurons. The neurons in each subsequent layer receive inputs from the neurons in the layer immediatedly below itself. Each layer maintains a master neuron. The network construction procedure ensures that the master neuron in a given layer correctly classies more patterns than the master neuron of the previous layer. Ancillary units may be added to layers and trained to ensure a faithful representation of the training set. The faithfulness criterion simply ensures that no two training examples belonging to di erent classes produce identical output at any given layer. Faithfulness is clearly a necessary 17

2.6 Tiling Algorithm

condition for convergence in strictly layered networks Mezard & Nadal, 89]. The proposed extension to multiple output classes involves constructing layers with M master neurons (one for each of the output classes). Sets of one or more ancillary neurons are trained at a time in an attempt to make the current layer faithful. Fig. 9 shows the construction of a tiling network.
Output Layer: M neurons

Hidden Layer 2: M + k2 neurons

Hidden Layer 1: M + k1 neurons

Input Layer: N neurons

Input / Master neurons

Ancillary neurons

Figure 9: Tiling Network

2.6.1 Algorithm

1. Train a layer of M master neurons. Each master neuron is connected to the N inputs. 2. If the master neurons of the current layer can achieve the desired classi cation accuracy then stop. 3. Otherwise, if the current layer is not faithful, add ancillary neurons to the current layer to make it faithful as follows, else go to step 4. (a) Among all the unfaithful output vectors at the current output layer, identify the one that the largest number of input patterns map to. (An output vector is 18

said to be unfaithful if it is generated by input patterns belonging to di erent classes). (b) Determine the set of patterns that generate the output vector identi ed in step 3(a) above. This set of patterns will form the training set for ancillary neurons. (c) Add a set of k (1 k M ) ancillary units where k is the number of target classes represented in the set of patterns identi ed in the above step and train them. (d) Repeat these last three steps (of adding and training ancillary units) till the output layer representation of the patterns is faithful. 4. Train a new layer of M master neurons that are connected to each neuron in the previous layer and go to step 2.

2.6.2 Convergence Proof

In the tiling algorithm each hidden layer contains M master units plus several ancillary units to achieve a faithful representation of the patterns in the layer. Let p =< p p p 1 ; 2 ; : : :; M +K > (also called a prototype) be the representation of a subset of patterns that have the same output in a layer (say A) with UA = M (master) + K (ancillary) units. ip = 1 for all i = 1 : : : (M + K ).

Theorem 5: Proof:

Suppose that all classes in layer L ? 1 are faithful and that the number of errors of the master units (eL?1) is non-zero. There exists a weight setting for the master units of the newly added layer (L) such that eL < eL?1. Consider a prototype p for which the master units at layer L ? 1 do not yield the corp p p p rect output. i.e., < 1p; 2p; : : : M > 6= < C1 ; C2 ; : : : CM >. The following weight setting results in correct output for prototype p at layer L and leaves the outputs of all other prototypes unchanged.

WLj ;0 = 2Cjp for j = 1 : : : M WLj ;L?1k = Cjp kp for j = 1 : : : M and k = 1 : : : UL?1; k 6= j WLj ;L?1j = UL?1
For prototype p:

np j = WLj ;0 + L

UL?1 k=1

XW
19

p Lj ;L?1k k

bias

2C j

p

j

Cj τ

p

p

1

U L-1 1 j

Cj

p

τp

UL -1

U L-1

Figure 10: Weight Setting for the j th output neuron in the Tiling Network = = = = 2Cjp + UL?1 jp + (UL?1 ? 1)Cjp UL?1 jp + (UL?1 + 1)Cjp sgn(np j ) L

Connections to Layer L-1

p OL j

Cjp

For prototype

q

6= p:
nq j L
= WLj ;0 + =

UL?1 k=1

XW

q Lj ;L?1k k UL?1

2Cjp + UL?1 jq +
j

= 2Cjp + UL?1
q OL j

= = = =

j k k k=1;k6=j 2Cjp + UL?1 jq + (UL?1 ? 3)Cjp] (UL?1 ? 1)Cjp] + UL?1 jq sgn(nq j ) L q j

X W q Lj ;L?1k k k=1;k6=j UL? q + X Cp p q
1

Once again we rely on algorithm LW to nd an appropriate weight setting. With the above weights the formerly incorrectly classi ed prototype, p, would be corrected and all other prototypes would be una ected. This reduces the number of incorrect prototypes by one (i.e. eL < eL?1). Since the training set is nite, the number of prototypes must be nite, and with a su cient number of layers the tiling algorithm would eventually converge to zero classi cation errors.

2
20

3 Experimental Results
This section presents results of some experiments with the multi-category constructive learning algorithms described in section 2. The algorithms were tested on two data sets: an arti cially generated 3-category data set where 5 bit boolean patterns were assigned randomly to one of three classes (5 such data sets were generated for the experiments); and a real world non-separable data set (iris). The original iris data set comprises of 4 real valued attributes and 3 output classes. Since all of the constructive learning algorithms explored in this paper require binary or bipolar representation of input patterns, we had to use a quantized version of the data set comprising of 22 binary (or bipolar as appropriate) valued attributes. (The quantization used ensures that no two patterns belonging to di erent categories map to the same binary or bipolar input vector). The simulations for tower, pyramid, and tiling algorithms were carried out using bipolar valued patterns while those for the upstart and perceptron cascade algorithms were performed using binary valued patterns. The entire training set of 32 patterns in the case of random mappings and 150 patterns in the case of iris was used for training. For intermediate training of a neuron or a group of neurons the pocket algorithm with ratchet modi cation was used with the learning rate set to 1 and the initial weights set randomly to ?1, 0, or 1. In each case, for intermediate training patterns were randomly drawn from the training set with 64,000 pattern presentations allowed for the random mappings and 150,000 pattern presentations allowed for iris. These choices were dictated by our choice of the pocket algorithm for training individual threshold neurons (which requires a su ciently large number of pattern presentations randomly selected from the training set). In the case of the upstart and perceptron cascade algorithms, several runs failed to converge to zero classi cation errors. Upon closer scrutiny of the experiments, this was explained by the fact that the training sets of the daughter units had very few patterns with a target output of 1. The pocket algorithm with ratchet modi cation while trying to correctly classify the largest subset of training patterns ended up assigning an output of 0 to all patterns. Thus it failed to meet the requirements imposed on LW in this case. This resulted in the added daughter unit's failure to reduce the number of misclassi ed patterns by at least one and in turn caused the upstart and the perceptron cascade algorithms to keep adding daughter units without converging. To overcome this problem, a balancing of the training set for the daughter unit was performed as follows. If for a 21

3.1 Data Sets

3.2 Training Methodology

daughter unit the fraction of training patterns with target output 1 was less than 25% of the entire training set then the patterns with target output 1 were replicated su cient number of times so as to create a modi ed training set with equal number of patterns with targets 1 and 0. Given the tendency of the pocket algorithm to nd a set of weights that correctly classify a near-maximal subset of its training set, it was able to now (with the modi ed training set) at least approximately satisfy the requirements imposed on LW . The results reported in the following subsection are based on this modi cation to the training procedure for the upstart and perceptron cascade algorithms. A summary of the number of units (excluding the input units) generated by each algorithm appears in Tables 1 and 2 for the 5 bit random mappings and the iris data set respectively. The results represent an average of 5 di erent runs for each data set. Algorithm Data Set 1 Data Set 2 Data Set 3 Data Set 4 Data Set 5 Tower 16.2 19.8 27.6 25.2 23.4 Pyramid 17.4 17.4 18 18 22.2 Upstart 10.8 10.8 14.6 12.2 13 Cascade 11.2 12 12.8 11.2 13.6 Tiling 21.8 19.6 21.2 18.6 20.2 Table 1: Average number of units for 3-class 5-bit random functions Algorithm Iris Data Set Tower 6 Pyramid 6 Upstart 4 Cascade 4 Tiling 6 Table 2: Average number of units for the iris data set Average 22.44 18.6 12.8 12.16 20.28

3.3 Results

4 Summary and Discussion
Constructive neural network learning algorithms o er a potentially powerful approach to inductive learning for pattern classi cation applications. In this paper, we have focused on a family of such algorithms that incrementally construct networks of threshold neurons with binary or bipolar input patterns. Although a number of such algorithms have been proposed in the literature, most of them were limited to 2-category pattern classi cation tasks. This paper extends several existing constructive learning algorithms 22

to handle multi-category classi cation. While the extensions are rather straightforward in the case of some of the algorithms considered, in other cases, they are non-trivial and o er an interesting range of design choices (each with its performance implications) that remain to be explored in detail. However, we have provided rigorous proofs of convergence to zero classi cation errors on nite, non-contradictory training sets for each of the multi-category algorithms proposed in this paper. In each case, the convergence of the proposed algorithm to zero classi cation errors was established by showing that each modi cation of the network topology guarantees the existence of a weight setting that would yield a classi cation error that is less than that provided by the network before such modi cation and assuming a weight modi cation algorithm LW that would nd such a weight setting. We do not have a rigorous proof that any of the graceful variants of perceptron learning algorithms that are currently available can in practice, satisfy the requirements imposed on LW , let alone nd an optimal (in some suitable well-de ned sense of the term - e.g., so as to yield minimal networks) set of weights. The design of suitable threshold neuron training algorithms that (with a high probability) satisfy the requirements imposed on LW and are at least approximately optimal remains an open research problem. Against this background, the primary purpose of the experiments described in section 3 was to explore the actual performance of such multi-category constructive learning algorithms on some non-separable classi cation tasks if we were to use a particular variant of of perceptron learning for non-separable data sets - namely, Gallant's pocket algorithm with ratchet modi cation. Detailed theoretical and experimental analysis of the performance of single threshold neuron training algorithms is in progress Yang et al., 95]. We expect such analysis to be useful in suggesting near-optimal designs for threshold neuron training algorithms for each of the constructive learning algorithms. A few additional comments on the experimental results presented in section 3 are in order. In particular, it must be pointed out that it is premature to draw any conclusions on the relative e cacies of the di erent algorithms based on the limited set of experiments that were described above. We have not made any attempt to optimize the performance of the proposed algorithms. A number of such improvements suggest themselves. Perhaps the most important one would involve the use of an appropriate single neuron training algorithm that works well with each of the constructive algorithms. An extensive comparison of the di erent members of this family of algorithms would be meaningful only after we have explored this and related issues in su cient detail. It is also worth noting that the real valued iris data set is known to be relatively easy to classify with two of the three classes being separable from each other. Quantization simpli es the task further although it does not render pattern set linearly separable. In light of this fact, exploration of quantization algorithms (designed speci cally with classi cation rather than data compression as the objective) are of interest and are currently under study. 23

Since our primary focus in this paper was on provably convergent multi-category constructive learning algorithms for pattern classi cation, we have not addressed a number of important issues in the preceding discussion. Each of the constructive algorithms has its own set of inductive and representational biases implicit in the design choices that determine when and where a new neuron is added and how it is trained. A systematic characterization of this bias would be quite useful in guiding the design of better constructive algorithms. Comparative analysis of performance of various constructive algorithms on a broad range of real-world data sets is currently in progress. Generalization ability of this family of constructive learning algorithms also deserves systematic investigation. Extensions of constructive algorithms to work with multi-valued or real-valued inputs are of interest as well. Another potentially useful extension of multi-category constructive algorithms involves the use of winner-take-all groups of threshold neurons instead of independently trained neurons.

References
Burgess, 94] Burgess, N. (1994). A Constructive Algorithm That Converges for RealValued Input Patterns. International Journal of Neural Systems. Vol. 5, No 1. pp 59-66. Chen et al, 95] Chen, C-H., Parekh, R.G., Yang, J., Balakrishnan, K., and Honavar, V.G. (1995). Analysis of Decision Boundaries Generated by Constructive Neural Network Learning Algorithms. Proceedings of the World Congress on Neural Networks. Vol. I, July 1995, pp 628-635. Duda & Hart, 73] Duda, R. O., and Hart, P. E. (1973). Pattern Classi cation and Scene Analysis. New York: Wiley. Fahlman & Lebiere, 90] Fahlman, S. E. and Lebiere, C. (1990). The Cascade Correlation Learning Architecture. In Neural Information Systems 2. Touretzky, D. S. (ed). Morgan-Kau man, 1990. pp 524-532. Frean, 90] Frean, M. (1990). Small nets and short paths: Optimizing neural computation. Ph.D. Thesis. Center for Cognitive Science. University of Edinburgh, UK. Gallant, 90] Gallant, S.I. (1990). Perceptron Based Learning Algorithms. IEEE Transactions on Neural Networks. Vol. I, No. 2, June 1990. pp 179-191. Gallant, 93] Gallant, S. I. (1993). Neural Network Learning and Expert Systems. Cambridge, MA: MIT Press. Garey & Johnson, 79] Garey, M., and Johnson, D. (1979). Computers and Intractability. New York: W. H. Freeman. 24

Honavar, 90] Honavar, V. (1990). Generative Learning Structures and Processes for Generalized Connectionist Networks. Ph.D. Thesis. University of Wisconsin, Madison, U.S.A. Honavar & Uhr, 93] Honavar, V., and Uhr, L. (1993). Generative Learning Structures and Processes for Connectionist Networks. Information Sciences 70, 75-108. Hrycej, 92] Hrycej, T. (1992). Modular Neural Networks. New York: Wiley. Mezard & Nadal, 89] Mezard, M., and Nadal, J. (1989). Learning in feed-forward networks: The tiling algorithm. J. Phys. A: Math. and Gen. 22. 2191-2203. Nilsson, 65] Nilsson, N. (1965). Learning Machines. New York: McGraw-Hill. Poulard, 95] Poulard, H. (1995). Barycentric Correction Procedure: Fast Method of Learning Threshold Elements. Laboratoire d'Analyse et d'Architecture des Systemes. Report 95062. Siu et al., 95] Siu, K-Y., Roychowdhury, V., and Kailath, T. (1995). Discrete Neural Computation - A Theoretical Foundation. Englewood Cli s, NJ: Prentice Hall. Yang et al., 95] Yang, J., Parekh, R.G., and Honavar, V.G. (1995). Empirical Comparison of Graceful Variants of the Perceptron Learning Algorithm on Non-Separable Data Sets. In preparation.

25


赞助商链接

更多相关文章:
更多相关标签:

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。
甜梦文库内容来自网络,如有侵犯请联系客服。zhit325@126.com|网站地图