9512.net

甜梦文库

甜梦文库

当前位置：首页 >> >> # for Artificial Neural Networks Supervisor Prof.dr.ir. A.J. van de Goor

Constructive Supervised Learning Algorithms for Arti cial Neural Networks

Natalio Simon

1

Type Number of Pages 97 Date 4 juni 1993 Department Codenumber Author Title Supervisor Mentors

Delft University of Technology Faculty of Electrical Engineering Master Thesis

Computer Architecture and Digital Technique 1-68340-28(1993)06 Natalio Simon Constructive Supervised Learning Algorithms for Arti cial Neural Networks

Prof.dr.ir. A.J. van de Goor Dr. J. Hoekstra

2

Abstract

The eld of Arti cial Neural Networks has received in the last years a great deal of attention from the academic community and the industry. Neural Network applications are now found in a wide range of areas, including Control Systems, Medicine, Communication, Cognitive Sciences, Linguistic, Robotics, Physics and Economics. This report focuses on one of the key challenges of the eld of Arti cial Neural Networks, the creation of good general-purpose learning algorithms. As opposed to conventional programmed computing, where a series of explicit commands (a program) processes information on an associated database, Arti cial Neural Networks develop information processing capabilities by learning from examples. Learning techniques can be roughly divided into two categories: Supervised Learning, and SelfOrganization or Unsupervised Learning. Supervised learning requires a set of examples for which the desired network response is known. The learning process consists then of adapting the network in a way that it will produce the correct response for the set of examples. The resulting network should then be able to generalize (give a good response) when presented with cases not found in the set of examples. The class of Constructive Supervised Learning Algorithms has been de ned in this report. As opposed to non-constructive learning algorithms, where the topology of the network is selected before the algorithm starts and remains xed, Constructive Supervised Learning Algorithms start from an initial network topology and they add new elements (neurons) in order to learn the set of examples. The nal topology and the size of the network are then dynamically determined by the algorithm and are a function of the set of examples chosen and of the learning parameters. A division has been made between Constructive Supervised Learning Algorithms for continuous connectionist networks, and Constructive Supervised Learning Algorithms for discrete connectionist networks. The power and applicability of Constructive Supervised Learning Algorithms has been shown by using them in six applications of di erent nature. The applications are: 1. 2. 3. 4. 5. 6. Solution to the Inverse Kinematic Transformations for Robot Control. Corrective networks for damage Robot Controllers. Hamming Codes Generators and Parity Generators. A Pattern Recognition problem. Mapping of Random Boolean Functions. The Monk's problems.

The above applications have allowed an analysis and comparison of the di erent constructive learning algorithms presented. The comparison has been done in terms of learning speed, size of the network and generalization accuracy.

3

Continuous Constructive Supervised Learning Algorithms

For Continuous Constructive Supervised Learning Algorithms, the Cascade-Correlation Learning Algorithm and a modi ed version of it have been presented. The Cascade-Correlation Algorithm builds a network by adding a hidden unit at a time between the previously added hidden units and the output units, until the training set is learned. Hidden units are trained to correlate to the sum of the remaining outputs' errors before they are connected. The Modi ed Cascade-Correlation Algorithm has been introduced for problems were multiple outputs are required. In this algorithm, a new unit is added per output and is trained to correlate to the remaining error at the respective output before it is connected. Then the hidden unit is connected to all output units. The Inverse Kinematic Transformations for a two and a three degrees of freedom Robot Controller has allowed for a comparison between the Modi ed and the original Cascade-Correlation algorithms, For this problem, the Modi ed Cascade-Correlation was shown to learn quicker, scale better to more training cases and to lower tolerance, and to have a higher success rate than the original Cascade-Correlation. The Inverse Kinematic Transformation for a two degrees of freedom Robot Controller allowed also a comparison of the Modi ed Cascade-Correlation algorithm with previous results obtained with the non-constructive Backpropagation learning algorithm. In this case the accuracy of generalization achieved by the Modi ed Cascade-Correlation algorithm was equal or better and with a much smaller network than achieved by using the Backpropagation algorithm. The other advantage given by the use of the Modi ed Cascade-Correlation algorithm over the Backpropagation algorithm is that no trial and error stage is needed to determine the amount and size of the hidden layers in the network, what is the case with the non-constructive Backpropagation algorithms. The determination of an a-priori network topology when applying non-constructive learning algorithms, so as the Backpropagation algorithm, remains a di cult problem in the area of Robot Control. The results achieved for the Inverse Kinematic Transformation and for a Corrective Networks for a damage Robot Controllers, point to the applicability of the Modi ed Cascade-Correlation learning algorithm for this area of application. For the Pattern Recognition problem and for the Hamming Code Generation problem, the Modi ed Cascade-Correlation Learning Algorithm was also shown to learn faster than the original Cascade-Correlation Learning Algorithm, When tested for generalization accuracy with the Inverse Kinematic Transformations and the Corrective networks for a damage Robot Controllers, the Modi ed Cascade-Correlation algorithm also scored better.

Discrete Constructive Supervised Learning Algorithms

For Discrete Constructive Supervised Learning Algorithms, the Tilling Learning Algorithm, the Tower Algorithm, the Upstart Algorithm and a new algorithm introduced in this report, the BINCOR Algorithm, have been presented. The Tilling Learning Algorithm and the Tower Learning Algorithm build a network by adding

4 a layer at a time, where the internal representation of the input patterns in each layer remains di erent for input patterns with di erent outputs. The process goes on until the rst unit added at a new layer produces the correct mapping for all input patterns. The convergence of these algorithms for any Boolean mapping is guaranteed. The Tilling Algorithm adds a variable number of units in each layer. The Tower Algorithm adds only one unit in each layer and uses "shortcut" connections, i.e. it not only connects the unit of a new layer to the unit of the previous layer, so as in the Tilling Algorithm, but also to the input units. The Upstart Algorithm and the BINCOR Algorithm use a di erent strategy than the previous two algorithms for building a network. In the Tilling and the BINCOR algorithms, units are added between the output unit and the inputs in order to correct the remaining misclassi ed patterns at the output unit. The convergence of these algorithms for any Boolean mapping is guaranteed. In the Upstart Algorithm a recursive strategy is used, where each unit added receives connections from the input units and tries to correct the misclassi ed input patterns at the unit previously added, leading to a binary tree network structure. In the BINCOR Algorithm, two units are added at a time between the input units and the output unit, where this units try to correct the remaining misclassi ed input patterns at the output unit, leading to one of two optional network structures: a single hidden layer or a high-order network of multiple layer, where the units in a hidden layer receive connections from the input units and all previously added units. The mapping of random Boolean functions has allowed for a comparison between the Tilling, Upstart and BINCOR algorithms in terms of size of the network built. The introduced BINCOR algorithm produced the network with less units of the three algorithms. The Monk's problems has been used to compare the generalization accuracy of the BINCOR algorithm with a variety of machine learning algorithms. The generalization with the BINCOR algorithm was less accurate than with the Cascade-Correlation Learning Algorithm and the Backpropagation Learning Algorithm, but more accurate than with Decision Tree-Based Learning Algorithms, the mFOIL algorithm and Inductive Learning programs.

Contents

1 Introduction

1.1 1.2 1.3 1.4 What are Arti cial Neural Networks : : : : Current state of Arti cial Neural Networks Constructive Learning Algorithms : : : : : Organization of the Report : : : : : : : : :

8

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

9 10 12 13

2 Non-Constructive Supervised Learning

2.1 The Perceptron Learning Algorithm : : : : 2.1.1 Basic De nitions : : : : : : : : : : : 2.1.2 Algorithm Description : : : : : : : : 2.1.3 Limitations of Perceptrons : : : : : : 2.2 The Backpropagation Learning Algorithm : 2.2.1 Basic De nitions : : : : : : : : : : : 2.2.2 Algorithm description : : : : : : : : 2.2.3 Discussion of Backpropagation : : : 2.2.4 The Quickprop Learning Algorithm 2.2.5 The Bold Driver Method : : : : : : 5

15

16 16 17 19 21 21 23 25 25 29

6

3 Continuous Constructive Supervised Learning

3.1 The Algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : 3.1.1 The Cascade-Correlation Learning Algorithm : : : : : 3.1.2 Modi ed Cascade-Correlation : : : : : : : : : : : : : : 3.2 The Applications : : : : : : : : : : : : : : : : : : : : : : : : : 3.2.1 Inverse Kinematic Transformations for Robot Control 3.2.2 Hamming Code Generation : : : : : : : : : : : : : : : 3.2.3 Pattern Recognition : : : : : : : : : : : : : : : : : : : 3.3 The Simulation Results : : : : : : : : : : : : : : : : : : : : : 3.3.1 Inverse Kinematic Transformations for Robot Control 3.3.2 Hamming Code Generation : : : : : : : : : : : : : : : 3.3.3 Pattern Recognition : : : : : : : : : : : : : : : : : : : 3.4 Discussion and Future Work : : : : : : : : : : : : : : : : : : : 3.4.1 Inverse Kinematic Transformations for Robot Control 3.4.2 Hamming Code Generation : : : : : : : : : : : : : : : 3.4.3 Pattern Recognition : : : : : : : : : : : : : : : : : : : 3.4.4 Future Work : : : : : : : : : : : : : : : : : : : : : : :

30

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

32 32 36 39 39 43 44 46 46 49 50 51 51 51 52 52

4 Discrete Constructive Supervised Learning

4.1 The Algorithms : : : : : : : : : : : 4.1.1 The Pocket Algorithm : : : 4.1.2 The Tilling Algorithm : : : 4.1.3 The Tower Algorithm : : : 4.1.4 The Upstart Algorithm : : 4.1.5 The BINCOR Algorithm : 4.2 The Applications : : : : : : : : : : 4.2.1 Random Binary Functions : 4.2.2 The Monk's Problems : : : 4.3 The Simulation Results : : : : : :

53

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

54 54 57 65 68 76 80 80 80 85

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

7 4.3.1 Random Binary Functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 4.3.2 The Monk's Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 4.4 Discussion and Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90

5 Conclusions and Future Work

91

5.1 Continuous Constructive Supervised Learning Algorithms : : : : : : : : : : : : : : : 92 5.2 Discrete Constructive Supervised Learning Algorithms : : : : : : : : : : : : : : : : : 93 5.3 General Conclusions and Future Directions : : : : : : : : : : : : : : : : : : : : : : : 94

Introduction

1

After having being forgotten for more than two decades, the eld of Arti cial Neural Networks has received in the last years a great deal of attention from the academic community and the industry. Characteristic to the eld is its multidisciplinary character. Neural Network articles are periodically found in Control Systems, Medicine, Communication, Cognitive Sciences, Linguistic, Robotics, Physics, Economics and Arti cial Intelligence journals. Added to this are the journals speci c to the eld of Neural Networks (see Table 1.1). Neural Networks have also seen their way to industry in the last two years. Table 1.2 presents a list of the major holders of U.S. Neural Network patents through 1991 Wenskay, 1992]. This list includes leading companies in Electronics so as Philips, IBM, Intel, Mitsushita and Kodak. Governments are also encouraging developments in this eld. The European Community, as an example, funds several R & D projects in Neural Networks under its ESPRIT II and ESPRIT III programmes Corsi, 1992], Angeniol et al, 1992] (see Table 1.3). These are being carried out jointly by industry and research centers. This chapter begins with an overview of Arti cial Neural Networks and a survey of the techniques for learning in Arti cial Neural Networks. Then the class of constructive learning algorithms is presented. Finally, a brief overview of the structure of the rest of this report is given.

8

CHAPTER 1. INTRODUCTION

9

1.1 What are Arti cial Neural Networks

Arti cial Neural Networks are the primary information processing structures of a technological discipline called Neurocomputing. Neurocomputing is concerned with parallel, distributed, adaptive information processing systems that develop information processing capabilities in response to exposure to an information environment Hecht-Nielsen, 1990]. Arti cial Neural Networks are applied typically in areas such as sensor processing, patterns recognition, data analysis and control, which may require information processing structures for which the algorithms or rules are not known. An Arti cial Neural Network is in the most general sense an information processing structure whose design is motivated by the design and functioning of human brains and components thereof. The formal de nition of a neural network follows. De nition A neural network is a parallel, distributed information processing structure consisting of processing elements (which can possess a local memory and can carry out localized information processing operations) interconnected via unidirectional signal channels called connections. Each processing element has a single output connection that branches ("fans out") into as many collateral connections as desired; each carries the same signal - the processing element output signal. The processing element output signal can be of any mathematical type desired. The information processing that goes on within each processing element can be de ned arbitrarily with the restriction that it must be completely local; that is, it must depend only on the current values of the input signals arriving at the processing element via incoming connections and on values stored in the processing element's local memory. A neural network can be categorized into: 1. Continuous Neural Network, where the input and output space are of a continuous nature, and 2. Discrete Neural Network, where the input and output space are of a discrete nature. As opposed to conventional programmed computing, where a series of explicit commands (a program) processes information on an associated database, Arti cial Neural Networks develop information processing capabilities by learning from examples. Learning techniques can be roughly divided into two categories: 1. Supervised Learning. 2. Self-organization or Unsupervised Learning Supervised learning requires a set of examples for which the desired network response is known. The learning process consists then in adapting the network in a way that it will produce the correct response for the set of examples. The resulting network should then be able to generalize (give a good response) when presented with cases not found in the set of examples. In self-organization or unsupervised learning the neural network is autonomous: it processes the data it is presented with, nds out about some of the properties of the data set and learns to re ect

CHAPTER 1. INTRODUCTION

10

these properties in its output. What exactly these properties are, that the network can learn to recognize, depends on the particular network model and learning method. Many of the learning techniques are closely connected with a certain (class of) network topology.

1.2 Current state of Arti cial Neural Networks

Attention in Arti cial Neural Networks is being focused primarily in four areas: 1. 2. 3. 4. Software simulation. Theoretical foundations. Hardware implementation. Applications.

Software simulation

Most studies carried out to compare the performance of the di erent learning algorithms and network topologies are being done on serial computers that simulate the the operation and ow of data of the di erent models of neural networks. For this purpose, various benchmark problems are used. Among the most common Fahlman, 1988], Fahlman & Lebiere, 1990]: The XOR/Parity Problem. The Encoder/Decoder Problem. The Two-Spirals Problem. The restricted scope of these benchmarks has led to the use of real-world problems, where higher number of inputs and outputs are used, to compare the performance of the di erent algorithms and topologies. Commonly used real-world problems are: Handwritten character set recognition. Inverse Kinematic Transformations for Robot Control. Time-Series prediction. The various algorithms are compare in terms of: Learning speed.

CHAPTER 1. INTRODUCTION

Size of the network. Generalization accuracy.

11

By learning speed, the number of connection crossing or the number of epochs (one epoch is one cycle over the training set) are used as a measure. The size of the network is given by the amount of neurons and the number of layers (and the number of neurons per layer). By generalization accuracy it is meant how well the network performs for input patterns not found in the training set. For this measurement a test set is used and the average, maximum and standard deviation of the network error for the test set is compared. For real-world problems, the training set represents a small part of the input space so the way this set is chosen to important to how well the network learns. Good learning is re ected in good generalization accuracy. Another factor that plays a roll in how well the networks learns is the error tolerance in learning the training set. If the training set is too accurately learned, then the over-training or grand-mothering problem occurs. In this case the network learn the training set very well but generalizes poorly for input patterns outside the training set.

Theoretical foundations

Neural Networks are able to learn from examples, to approximate complex functions, to make predictions of non-linear system outputs. They are used for the diagnosis of faulty equipment, to determine the trajectory of a robot, etc. Still a complete formal foundation of how the various algorithms "learn" and how to achieve optimal learning and optimal network size is lacking. Research is carried out to give a mathematical foundation to the di erent learning algorithms and network topologies.

Hardware implementation

Arti cial Neural Networks are inherently parallel. They consist of simple processing units (neurons) that process data independently of each other. Simulations on conventional sequential computers provide a functionality test. To fully exploit the high-speed potential of neural networks for problems requiring real-time response, work is done in the implementation of neural networks in physical hardware, where each processing element can operate independent and in parallel to other elements. Approaches in hardware implementation include analog, digital as well as hybrid circuits. Two options exist regarding the implementation of Arti cial Neural Networks in hardware: implementing hardware for a speci c application and for a speci c learning algorithm and topology, or a general purpose architecture suitable for di erent learning algorithms and topologies and for applications with di erent sizes of inputs and outputs. The last approach being followed by many researchers and the industry (see Barhen et al., 1992], Daud et al., 1992], Hurdle et al., 1992], Jutten et al., 1990], Corsi, 1992]).

Applications

CHAPTER 1. INTRODUCTION

12

Neural Networks are in a stage of development were researchers continue to discover new applications for this technology. From 1989 there is being an exponential increase in patent activity in Neural Networks, lead by the U.S. with Japan in the second place, followed by Korea, Germany, France and the United Kingdom Wenskay, 1992]. Neural Networks are applied in a wide range of tasks. These tasks include, among others, Quality Control: of video images, VLSI masks, leather, oranges. Forecasting: of electricity consumption, economic and nancial time-series, chip dimensions. Vision: motion detection, recognition of objects, texture. Diagnosis: of metal corrosion, machinery. Control: robot control, industrial process control. Classi cation: of patterns, spectra. The impact of Arti cial Neural Networks is likely to be in areas of arti cial perception - making sense of speech, language and visual images. This is where the promise of going beyond the capabilities of conventional programming arise, simply because conventional programming requires an algorithm which, in turn, requires a full analysis of the problem. But because the mechanisms used by humans when they carry out these tasks are not fully understood, satisfactory algorithms cannot be devised. The expectation is that Arti cial Neural Networks will build up such solutions through being trained on appropriate input-output relationships Aleksander & Morton].

1.3 Constructive Learning Algorithms

This report focuses on a class of Supervised Learning Algorithms for Arti cial Neural Networks, called Constructive or Dynamic Learning Algorithms. Constructive or Dynamic Learning Algorithms have the characteristic that they modify the neural network topology in the process of learning from a set of examples. They start from an initial network topology and they add new elements (neurons) in order to learn the set of examples. The nal topology and the size of the network are dynamically determined by the algorithm and are a function of the set of examples chosen and of the learning parameters. The aim of this report is threefold: 1. to present the class of Constructive or Dynamic Learning Algorithms, 2. to show the power and applicability of these algorithms by using them in applications of di erent nature, and 3. to analyze their advantages and disadvantages in comparison with other non-constructive learning algorithms.

CHAPTER 1. INTRODUCTION

Title: Neural Networks Publisher: Pergamon Press Year: volume 1 in 1988 Title: Neurocomputing Publisher: Elsevier Year: volume 1 in 1989 Title: International Journal of Neural Systems Publisher: World Scienti c Publishing Year: volume 1 in 1990 Title: Journal of Neural Network Computing, Technology, Design, and Applications Publisher: Auerback Publishers Year: volume 1 1989 Title: International Journal of Neural Networks Publisher: Learned Information Year: volume 1 in 1989 Title: Applied Intelligence: The International Journal of Arti cial Intelligence, Neural Networks, and Complex Problem-Solving Technologies Publisher: Kluwer Academic Publishers Year: volume 1 in 1991 Title: Neural Computation Publisher: MIT Press Year: volume 1 in 1989 Title: IEEE Transaction on Neural Networks Publisher: IEEE Year: volume 1 in 1990 Title: Network: Computation in Neural Systems Publisher: IOP Publishing Ltd Year: volume 1 1990 Title:

13

Connection Science: Journal of Neural Computing Arti cial Intelligence and Cognitive Research Publisher: Carfax Publishin Year: volume 1 in 1989

Title: Neural Computing and Applications Publisher: Springer International Year: volume 1 in 1992

Table 1.1: Neural Network Journals

1.4 Organization of the Report

The report is organized as follows: Chapter 2 presents the class of non-constructive supervised learning algorithms. Here, for discrete networks, the Perceptron Learning Algorithm is discussed. Then, for continuous networks, the Backpropagation Algorithm and two variations of it, the Quickprop Algorithm and the Bold-Drive Method, are discussed. Chapter 3 presents the class of constructive supervised learning algorithms for continuous networks. Here the Cascade Correlation Algorithm and variations on it are discussed. Then the performances of these and algorithms presented in chapter 2 are compared. The comparison is done for the following applications: a pattern recognition problem, the Hamming Code generation problem, and the inverse kinematic transformations of a two and a three degrees of freedom robot arm. Chapter 4 presents the class of constructive supervised learning algorithms for discrete networks. Here the Pocket Learning Algorithm, the Tilling Algorithm and the BINCOR Algorithm are discussed. Then the performances of these and algorithms presented in the previous chapters are compared. The comparison is done for the following applications: the Monk's Problem and random boolean functions. Chapter 5 nalizes this report with a discussion of the performances obtained by the learning algorithms of the previous chapters. Conclusions regarding the suitability of Constructive Learning Algorithms are given. An outline of future work directions is presented.

CHAPTER 1. INTRODUCTION

Miscellaneous Univ. IBM Hughes Aircraft GTE Matsushita General Electrics Kodak Hitachi AT&T U.S. Government Nestor Synaptics Intel Philips Motorola

14

Table 1.2: Major Holders of U.S. Neural Network Patents Though 1991 UNDER ESPRIT II 1988-1990 PIGMALION Neurocomputing GALATEA Neurocomputing ANNIE Applications of Neural Networks for Industry in Europe NEUFODI Neural Networks for Forecasting and Diagnosis Applications SPRINT Speech Processing and Recognition Using Integrated Neurocomputing Techniques SUBSYM Self-Organization and Analogical Modeling Using Sub-Symbolic Computing UNDER ESPRIT III 1990CONNY Robot Control Based on Neural Network Systems HIMMARNET Study of Hidden Markov Models and Neural Networks for Robust Isolated Word Recognition NEUROQUACS Neural Networks based Vision and Signal System for Industrial Quality Control WERNICKE Neural Network Based, Speaker Independent, Large Vocabulary, Continuous Speech Recognition System MUCOM Multisensory Control of Movement INSIGHT Vision Systems for a Natural Human Environment

NERVES ELENA-NERVES 2 Innovative Architectures for Neurocomputing Enhanced Learning for Machines and VLSI Neural Networks Evolutive Neural Architectures SSS Smart Sensory Systems Table 1.3: EC Neural Network Projects

Non-Constructive Supervised Learning

2

Supervised learning consists of training a neural network with a set of examples for which the desired outputs are known. The purpose of the learning algorithm is then to adjust the network so that the network produces the correct outputs for the set of examples. Here we distinguish between two classes of Learning Algorithms: Non-Constructive and Constructive. By non-constructive we refer to algorithms for which the topology of the network has to be xed a-priori, while in constructive ones the algorithm itself automatically determines the topology of the network. This chapter presents the class of non-constructive supervised learning algorithms. First the Perceptron Learning Algorithm Rosenblatt, 1962] is described and discussed. Then the Backpropagation Algorithm Rumelhart, Hinton & Williams, 1986] and two variations of it, the Quickprop Algorithm Fahlman, 1988] and the Bold-Drive Method Vogl et al.], are described and discussed.

15

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

16

x0 x1 x2 x3 . . . xn

w0 w1 w2 w3 wn

threshold unit

y

Figure 2.1: Basic Neuron

2.1 The Perceptron Learning Algorithm

In his book "Principles of Neurodynamics", Frank Rosenblatt described the perceptron as a simpli ed network in which certain properties of real nervous systems are exaggerated while others are ignored. This section starts by presenting the basic model of a Perceptron, then the perceptron learning algorithm is described and nally, the capabilities of the Perceptron are discussed.

2.1.1 Basic De nitions

The purpose of the Perceptron Learning Algorithm is to train a processing unit to produce the correct output when provided with a set of input patterns for which the output is known. The algorithm stops when the processing unit produces the correct output for each of the given input patterns. The basic processing unit (neuron) of a Perceptron is shown in Figure 2.1. In Figure 2.1, wi is the weight of the connection from input xi . The output of the unit is given by

y = fh (net) n X net = wi xi ?

=

i=1 n X i=0

(2.1) (2.2) (2.3)

wi xi where x0 = 1 and w0 =

The activation function, fh , is the Threshold function shown in Fig 2.2.

fh (x) =

(

1 if x 0 0 if x < 0

(2.4)

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

y

1

17

0

x

Figure 2.2: The Threshold Function.

2.1.2 Algorithm Description

The Perceptron Learning Algorithm can be summarized as follows Beale & Jackson, 1990]: set the weights randomly present an input calculate the actual output by taking the thresholded value of the weighted sum of the inputs alter the weights to reinforce correct decisions and discourage incorrect decisions- i.e. reduce the error present the next input etc. This leads to the following algorithm:

The Perceptron Learning Algorithm

1. Initialize weights De ne wi (t), (0 < i < n), to be the weight from input i at time t, and to the threshold value in the output node. Set w0 to be ? , and x0 to be always 1. Set wi (0) to small random values, thus initializing all the weights and the threshold. 2. Present input pattern and desired output Present binary input pattern Xp = x1; x2; x3; :::; xn and desired binary output d(t) . 3. Calculate actual output y(t) = fh ( n=0 wi(t)xi (t)) i 4. Adapt weights

P

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

if (y(t) = 0) and (d(t) = 1) for i from 0 to n do wi (t + 1) = wi(t) + if (y(t) = 1) and (d(t) = 0) for i from 0 to n do wi (t + 1) = wi(t) ? if (y(t) = d(t)) for i from 0 to n do wi (t + 1) = wi(t) xi (t) xi (t)

18

where 0 < 1, is a positive gain term that controls the adaptation rate. 5. Next input pattern Chose next input pattern and go to 2.

The algorithm stops when y (t) = d(t) for each of the patterns of the training set, i.e. when all the training set has been learned. Step 4 of the algorithm adds the input values to the weights when the output should be on (1), and subtracts the input values to the weights when the output should be o (0). The weights are unchanged if the network makes the correct decision. Also, weights are not adjusted on input lines which do not contribute to the incorrect response, since each weight is adjusted by the value of the input on that line, xi (t), which would be zero. The gain term has the e ect of slowing down the change in the weights, making the network take smaller steps towards the solutions. Empirical results show that this leads to faster convergence.

Widrow-Ho Delta Rule

A second learning rule, the Widrow-Ho delta rule Widrow, 1962], calculates the di erence between the weighted sum and the point at which the threshold function changes value and calls this the error.

error =

X

n

(2.6) Weight adjustment is then carried out in proportion to that error. This leads to big changes in the weights when the weighted sum is long way from the desired value, while altering them only slightly when the weighted sum is close to that required to give the correct solution. This method achieves faster convergence. Applying the Widrow-Ho Delta Rule means replacing step 4 with the following:

4. Adapt weights - Widrow-Ho delta rule if (y(t) = d(t))

i=1

wi xi ? =

X

n

i=0

wi xi

(2.5)

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

error(t) = 0 if (y(t) = 0) and (d(t) = 1) P error(t) = n=0 wi(t)xi (t) i if (y(t) = 1) and (d(t) = 0) P error(t) = ? n=0 wi (t)xi(t) i for i from 0 to n do wi(t + 1) = wi (t) + error(t)xi (t) where 0 < 1, is again positive gain function that controls the adaptation rate.

19

Bipolar Values

The algorithms can be applied to binary input and output spaces, i.e. the inputs and the output can have the value 0 or 1, or they can work with bipolar input and output spaces, where the inputs and the output can have the value -1 or 1. Using binary inputs means that input lines where xi = 0 are not trained, whereas bipolar values allow all input lines to be trained each time. The use of bipolar values help to speed up the convergence process.

2.1.3 Limitations of Perceptrons

The Perceptron Learning Algorithm is capable of classifying two classes of input patterns, or in the case of a layer of n Perceptron units, each unit can be trained in parallel to recognize one class of inputs from the rest, leading to the classi cation of n di erent classes of input patterns. An example of this is given in Figure 2.3, where a Percetron is used to classi ed the ciphers 0 to 9, drawn in a grid of 8x5 pixels (the gure shows the cipher 6 as example). There is, however, a limitation to the classi cation capabilities of a Perceptron, it can only correctly classify linearly separable patterns. For example, Classifying binary inputs according to their parity, i.e. the Perceptron outputs 1 when the pattern has odd parity and 0 when the parity is even, is an example of a non linearly separable problem which cannot be solve by the Perceptron. For the case of two dimensional input space, the parity problem then becomes the XOR function. From Figure 2.4 it can be seen that no line can separate one class from the other.

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

20

+1 x1 x2 x3

y0

1 if cipher is 0 0 otherwise 1 if cipher is 1 0 otherwise 1 if cipher is 2 0 otherwise

(a)

. . . . . .

x 40

y1

y2

. . .

y9

(b)

1 if cipher is 9 0 otherwise

Figure 2.3: Ciphers classi cation with a Perceptron. (a) Example input: cipher 6. (b) Perceptron Architecture.

x2 x1 0 0 1 1 x2 0 1 0 1 y 0 1 1 0

1

0 0 1

x1

Figure 2.4: The XOR Problem

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

fs

1

21

NET

fs =

0

1 1+e

NET

Figure 2.5: The Sigmoid Function

2.2 The Backpropagation Learning Algorithm

The Backpropagation Learning Algorithm was popularized in 1986 by Rumelhart, McClelland and Williams Rumelhart, Hinton & Williams, 1986]. The Backpropagation algorithm is a supervised learning algorithm that provides a method to adjust the weights in a multi-layer network of connected processing units. The purpose of the weight adjustment is again to produce the correct outputs for a given training set, where the training set consists of patterns of inputs and desired outputs. This section starts by presenting the basic model of a Backpropagation network, then the Backpropagation learning algorithm is described and nally, the advantages and disadvantages of the algorithm are discussed.

2.2.1 Basic De nitions

The basic processing unit (neuron) of a Backpropagation network is similar to the one shown in Figure 2.1. Here the activation function, fs , used is the Sigmoid function 1 o = fs (net) = 1 + e?net (2.7) where net is de ned in equation (2.2). Figure 2.5 shows the shape of a Sigmoid function. The use of the Sigmoid function means that the input and output spaces are continuous. Taking the derivative of fs (net) gives

e?net fs0 (net) = (1 + e?net )2 = fs (net)(1 ? fs (net)) 0 (net) = o(1 ? o) fs

The derivative is therefore a simple function of the output.

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

22

i n p u t p a t t e r n

o u t p u t p a t t e r n

input nodes

1st hidden layer

n-2 hidden layer

n-1 hidden layer

output layer

Figure 2.6: Backpropagation Network Topology. n-layers network. The Backpropagation works on a multi-layer network of connected processing units, where each layer contains one of more units. The determination of the number and size of the network layers is done prior to the learning, usually based on empirical criteria. Figure 2.6 shows a typical network topology. Backpropagation performs a gradient descend in the weight space in order to minimize the error at the outputs. This is done by calculating the error function for each input pattern and then backpropagating the error from one layer to the previous one. The weights of a node are adjusted in direct proportion to the error in the units to which it is connected. The error measure used in the algorithm is the sum of the Mean Square Error at the outputs of the network, for all patterns of the training set. The error for one pattern p is de ned as: X (2.8) Ep = (dpo ? opo )2

o

The total error is de ned as:

E=

X

1 X(d ? o )2 = X 1 E po po p p 2 p 2 o

(2.9)

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

23

where dpo is the desired value at output o, for input pattern p and opo is the actual value at output o, for input pattern p. The index p refers to each pattern of the training set of inputs and desired outputs. To minimize the total error, the change made to each weight is proportional to the derivative of the error measure with respect to that weight, where wij is the weight between the output of unit i and unit j in the next layer (one layer closer to the output layer).

p wij

@E = ? @w p ij

(2.10)

The derivative of the error is given by

@Ep = ? o pj pi @wij

where for an output unit

pj

(2.11) (2.12) (2.13)

= fj0 (netpj )(dpj ? opj ) = fj0 (netpj )

X

and for a hidden unit

pj k pk wjk

where pk are the error signals from one layer closer to the output layer that receive a connection from hidden unit j and k is the number of such connections.

2.2.2 Algorithm description

The Backpropagation Algorithm for a multi-layer network of processing units, where the units have a sigmoid function, is presented below:

The Backpropagation Learning Algorithm

1. Initialize weights Set all weights to small random values. 2. Present input pattern and desired output Present input pattern Xp = xp;0; xp;1; xp;2; :::; xp;n and desired output pattern Dp = dp;0 ; dp;1; dp;2; :::; dp;m?1, where n is the number of input nodes and m the number of output nodes. 3. Calculate actual output - Forward pass Starting from the rst hidden layer, calculate hidden units outputs and pass them as inputs to the next layer, until the output layer is reached.

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

for l from 0 to N do for j from 0 to P ??1 do nl 1

olpj (t) = fs

nl i=0

24

? l wij (t)0lpi 1 (t)

Where 0 l N refers to the layer number, N being the total number of layers. 0 j (nl ? 1) refers to ? the unit number within layer l, nl being the number of units in layer l, and 0lpi 1 (t) being the output of unit l i of the previous layer l ? 1 (where 00 (t) is the input xi(t)). wij is the weight from unit i (of layer l ? 1) to pi unit j of layer l. 4. Adapt weights - Backward pass Start from the output layer, and working backwards, adapt weights.

for l from N to 1 do for j from 0 to nl ? 1 do

l l l wij (t + 1) = wij (t) + pj (t)0lpj (t) l where is a gain term, and pj is an error term for pattern p on unit j of layer l.

N For output units: pj (t) = 0N (t)(1 ? 0N (t))(dpj ? 0N (t)) pj pj pj

Pn +1 l+1 l+1 l For hidden units: pj (t) = 0N (t)(1 ? 0N (t)) k=0 ?1 pk (t)wjk (t) pj pj

l

where for hidden units, the sum is over the units of one layer above the layer of unit j. 5. Next input pattern If p is not the last input pattern then p = p + 1 (chose next input pattern) and go to step 2. 6. Calculate Total Error Compute the total error by calculating the outputs (step 3) for all training patterns and using equation (2.4). If the total error is within the desired value STOP else select the rst pattern and go to step 2.

In practice, the extra forward pass in step 6 for each training input is avoided and the partial errors per pattern are calculated in step 3. This means however that the total error is not the correct one since the weight updates of the latter patterns are not taken into account for the earlier patterns, their in uence is neglected. In order to speed up the learning in the Backpropagation algorithm, a momentum term can be used in the weight update. This momentum term is a part of the previous weight change and is added to the new weight. Step 4 of the Backpropagation algorithm becomes then:

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

4. Adapt weights with momentum term - Backward pass Start from the output layer, and working backwards, adapt weights.

25

for l from N to 1 do for j from 0 to nl ? 1 do

l l l wij (t + 1) = wij (t) + pj (t)0lpj (t) +

l wji(t ? 1)

l where pj remains the same as without a momentum term and 0 < < 1.

2.2.3 Discussion of Backpropagation

The Backpropagation algorithm is a non-constructive learning algorithm that requires an a-priori network topology, i.e., the determination of the number and size of the network layers is done prior to the learning. The Backpropagation Algorithm presents the problem of being very slow at learning for realworld problems where many inputs and outputs are required Antsaklis, 1992], Fahlman, 1988], Daud et al., 1992]. Another problem of the algorithms is the step-size problem. To nd the global minimum in the overall error function, the Backpropagation Algorithm computes the rst derivative of the overall error function with respect to each weight in the network. If small steps are taken down the gradient vector, a local minimum of the error function may be reached and not the global minimum. If large steps are taken, then the network could oscillate around the global minimum, not reaching it. The Backpropagation Algorithm is based on the approximation that changes in one weight have no e ect on the error gradient at other weights. In reality when one weight is changed, the error gradient at other weights varies as well. The algorithm doesn't take this into consideration, so the descent in the error space may be sometimes wrongly directed, causing a slowdown in the speed of convergence of the algorithm. This problem is referred to as the moving target problem Fahlman & Lebiere, 1990]. Several variations has been suggested to solve the above mentioned problems and to speed up the convergence of the algorithm. What follows are two variations on the original Backpropagation Algorithm, the Quickprop Learning Algorithm and the Bold Driver Method.

2.2.4 The Quickprop Learning Algorithm

Quickprop Fahlman, 1988] is a modi cation on the Backpropagation learning algorithm presented above. The Quickprop Learning Algorithm di ers from Backpropagation in the way the weights are adapted. The algorithm assumes that the total error vs. weight curve can be approximated by a parabola and that the change in the slope of the error curve, as seen by each weight, is not a ected by all the other weights that are changing at the same time. For each weight, a copy of the previous error slope and the previous weight change is kept. Then, together with the current

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

26

error slope, a jump is done to the minimum of the parabola and the weight is updated with this minimum. The weight updating rule then becomes:

l Sji (t) l l wji (t) = S l (t ? 1) ? S l (t) wji (t ? 1) ji ji l l l where Sji (t) and Sji (t ? 1) are the current and previous error slopes with respect to weight wji (t). l Sji (t) is de ned as: P ?1 @E P ?1 X X @E l l l Sji (t) = @wl ((tt)) = @wlp((tt)) = ? pj Opi ji p=0 ji p=0 l where E (t) and pj have the same value as in standard back-propagation and P is the number of training patterns. l The error slope is the derivative of the total error measure E , so the error terms, pj , have to be accumulated over all P input patterns. This means that the weights are only adjusted after the presentation of all patterns of the training set and that extra storage is needed per weight to accumulate the error terms. Also extra storage is needed per weight to keep the previous error slope. l A way is need to start the updating rule or to continue it if the previous error slope, Sji (t ? 1), was 0. This is done in the algorithm by changing the weight updating rule to:

8 > < :

l wji (t) = >

l Sji (t) l (t?1)?S l (t) Sji ji l Sji (t) l l Sji (t?1)?Sji (t)

l l l wji (t ? 1) if (Sji (t):Sji (t ? 1) < 0) l l l l wji (t ? 1) + "Sji (t) if (Sji (t):Sji (t ? 1) 0)

Here, the current slope times a gain term, ", is added to the weight if the previous slope is 0, or if the previous and the current slope are in the same direction, i.e. have the same sign. The last from empirical results showing that this increases the speed of convergence. The parameters is also introduced in the weight updating rule, to avoid taking too big steps. The weight updating rule then becomes:

l l Wji (t) = min( wji (t); l wji (t ? 1))

where Fahlman suggests a value of = 1.75.

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

27

The Quickprop Learning Algorithm has been shown to achieves faster convergence for the Encoder/Decoder problem and the XOR problem Fahlman, 1988].

The Quickprop Learning Algorithm

1. Initialize weights Set all weights to small random values. 2. Calculate error slopes For all the training patterns, accumulate the error slopes. For all error slopes (error slopes initialization) l l Sji (t ? 1) = Sji (t) l (t) = 0 Sji for p from 0 to P-1 do 2.1 Calculate actual output - Forward pass 2.2 Accumulate error slopes - Backward pass 2.1. Calculate actual output - Forward pass Starting from the rst hidden layer, calculate hidden units outputs and pass them as inputs to the next layer, until the output layer is reached.

for l from 0 to N do for j from 0 to nl ??11 do P

l ypj (t) = fh nl i=0

? l wij (t)olpi 1 (t)

Where 0 l N refers to the layer number, N being the total number of layers. 0 j (nl ? 1) refers to the unit number within layer l, nl being the number of units in layer l, and 0li?1 (t) being the output of unit l i of the previous layer l ? 1 (where 00(t) is the input xi(t)). wij is the weight from unit i (of layer l ? 1) to i unit j of layer l.

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

2.2 Accumulate error slopes - Backward pass Start from the output layer, and working backwards,

28

for l from N to 1 do for j from 0 to nl ? 1 do If l=N (for output units) N N N N pj (t) = 0pj (t)(1 ? 0pj (t))(dpj ? 0pj (t)) l (t) = S l (t) ? l (t)Ol (t) Sji ji pj pi

If l 6= N (For hidden units) Pn +1 ?1 l+1 l+1 l N N pj (t) = 0pj (t)(1 ? 0pj (t)) k=0 pk (t)wjk (t) l l l l Sji (t) = Sji (t) ? pj (t)Opi (t)

l

3. Calculate Total Error Compute the total error using equation (2.4). If the total error is within the desired value STOP, 4. Adapt weights For all weights

l wji(t) = >

: 8 > <

l Sji (t) l S t?1)?Sji (t) l Sji (t) l l Sji (t?1)?Sji (t) l ji (

l l l wji(t ? 1) if (Sji (t):Sji (t ? 1) < 0) l l l l wji(t ? 1) + "Sji (t) if (Sji (t):Sji (t ? 1) 0) l wji(t ? 1))

l l Wji(t) = min( wji(t);

5. Next training set presentation Go back to step 2.

CHAPTER 2. NON-CONSTRUCTIVE SUPERVISED LEARNING

29

2.2.5 The Bold Driver Method

The Bold Driver Method Vogl et al.] is a modi cation on the Backpropagation Learning Algorithm. In order to overcome the problem of the Backpropagation Learning Algorithm, the Bold Driver Method makes two changes to the way weights are adjusted: 1. the weights are updated after the presentation of all input patterns (as in Quickprop), 2. the Learning rate is varied with each weight update, and The new update rule becomes:

l wji (t) =

(t)

X

p

l l ( pj (t)Opi(t)) +

l wji (t ? 1)

After each update, the total error E(t) is compare with the previous value E(t-1). If the update reduced the error, then the learning rate is increased by a factor > 1. Otherwise all weight changes are rejected and the learning rate is decreased by multiplying it by a factor < 1 and is set to zero. When a successful step is taken, is reset to its original value.

Continuous Constructive Supervised Learning

3

The previous chapter have presented algorithms that belong to the class of non-constructive supervised learning algorithms. What characterizes the non-constructive algorithms is the need to determine the network topology, i.e. the amount of layers and the number of processing units (neurons) per layer, before the learning algorithm is applied. The learning accuracy and the generalization accuracy are, then, a function of the chosen network topology. Choosing the optimal network topology given a problem remains an unsolved problem. In this chapter, a di erent class of learning algorithms, referred to as constructive or dynamic learning algorithms, is presented. What characterizes constructive supervised learning algorithms is that the determination of the network topology is part of the algorithm. Usually constructive algorithms starts with a minimal network consisting of only input and output units interconnected, their number dictated by the input and output space of the problem given. Then the constructive learning algorithm gradually adds units between the input and output units, where these new units can form one or more hidden layers. The criteria when to add new units, how many at a time and where the new units will be placed in the network depends on the algorithm chosen. The growth process is nished when the network has learned the training set to an acceptable degree, which is usually a measure of the remaining output error for the training set, as in the non-constructive learning algorithms presented in the previous chapter. The advantage of constructive learning algorithms over non-constructive ones comes from not having to determine a priori the topology of the network, which in the non-constructive case is usually done based on empirical criteria and may require a trial and error process to determine the best topology for a given problem and a given training set. 30

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

31

Constructive learning algorithms may also give a faster speed of convergence than non-constructive ones since the training is done on an increasingly growing network, so the weights adjusted at each step are less than in the case of training the full network at each step. A di culty that remains with constructive learning algorithms is the fact that the nal topology achieved by the algorithm is not necessarily the optimal one for the given problem. The nal topology is a function of the construction method of the algorithm, i.e. the way units are added to the network, and a construction method that will achieve the optimal network topology, in terms of generalization accuracy, number of layers and number of units, for any given problem, is yet to be found. One approach, which the results shown later in this chapter suggest, is for a given problem to try di erent construction methods, and select the one that achieves best results (where the criterion for comparison should usually be the generalization accuracy). This chapter deals with constructive supervised learning algorithm for continuous networks, where by continuous networks it is understood networks with continuous input and output spaces (as opposed to discrete networks with discrete input and output spaces). The algorithm rst discussed in section 3.1 is the Cascade Correlation Learning Algorithm Fahlman & Lebiere, 1990]. Variations based on the Cascade Correlation Algorithm are then presented. In the variations Simon, Corporaal & Kerckho s, 1992], Simon, Kerckho s & Corporaal, 1992] and Sjogaard, 1991], a di erent construction method is employed. Neural Networks are a powerful computing paradigm that can be used in a wide range of elds. In section 3.2, three types of applications where Neural Networks can be used are presented. They are the inverse kinematic transformations of two and three degrees of freedom robot arm controllers, a pattern recognition problem and the Hamming Code generation problem. The inverse kinematic transformations are suitable to compare di erent learning algorithms, since they constitute a real world application and they require multiple outputs. These characteristics distinguish them from commonly used benchmarks, as the XOR/Parity problem and the two spiral problem, where one output is required, and the Encoder/Decoder problem, where there are multiple outputs, but for all output patterns, only one output is on at a time. The pattern recognition problem and the Hamming Code generation problem, which require multiple outputs, are also used for comparison among the learning algorithms. In this way, the comparison among the algorithms is based upon three problems of di erent nature, belonging to three di erent areas. In section 3.3, the performances of the algorithms in this and the previous chapter, in terms of generalization accuracy, learning speed and size of network, are presented and are used to compare the di erent algorithms. Section 3.4 discusses the results obtained in the previous section and outlines future work.

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

32

OUTPUTS

O1

O2

I N P U T S bias

Figure 3.1: Initial Cascade-Correlation Architecture.

3.1 The Algorithms

This section presents and discusses the Cascade Correlation Learning Algorithm and variations based on it. The Cascade Correlation Learning Algorithm is a constructive supervised learning algorithm for continuous networks. The variations presented here consists of modi cations in the construction method of the original algorithm.

3.1.1 The Cascade-Correlation Learning Algorithm

Algorithm Description

The initial architecture of the Cascade Correlation Learning Algorithm is illustrated in Fig. 3.1. It begins as a one layer perceptron, with each input unit connected to each output unit. The "X " shown in the gure represent the connection weights. The learning algorithm is divided in two parts: training of output units and new unit creation. In the rst part the direct input-output connections are trained as well as possible over the entire training set using the Quickprop learning algorithm Fahlman, 1988]. Here no back propagation is needed since only one layer is trained. When no signi cant error reduction has occurred after a certain number of training cycles, the

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

33

OUTPUTS

O1

O2

H1

I N P U T S bias

Figure 3.2: Cascade-Correlation Architecture: after 1 hidden unit is added. Boxed connections are frozen, x connections are trained repeatedly.

OUTPUTS

O1 H2 H1

I N P U T S bias

O2

Figure 3.3: Cascade-Correlation Architecture: after 2 hidden units are added.

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

34

entire training set is run through the network one last time to measure the error. If the error is less than the speci ed tolerance the algorithm stops. If it is necessary to reduce this error, then the new unit creation part of the algorithm begins. In the unit creation part of the algorithm, a new hidden unit, receiving trainable input connections from all of the network's external inputs and from all existing hidden units, is trained as well as possible over the entire training set using the Quickprop learning algorithm. Here the training is done to maximize the correlation of the output of this new hidden unit to the sum of the remaining errors at the outputs. When no signi cant error reduction has occurred after a certain number of training cycles, the output of this new unit is connected to all output units of the network. Fig. 3.2 shows the network after one hidden unit has been added. Fig. 3.3 shows the network after two hidden units have been added. The way a new hidden unit is trained by the Quickprop learning algorithm, to maximize the correlation of its output to the sum of the remaining errors at the outputs, is explained below. Note that the new hidden unit is trained before it is added to the network. The goal of the training is to maximize S , the sum over all output units o of the magnitude of the correlation between oh , the new hidden unit's output value, and Eo, the residual output error observed at output unit o. S is de ned as

S=

X X

o

p

(oh ? oh )(Ep;o ? Eo) p

where Ep;o is the error at output o for input pattern p, oh is the output of the new hidden unit for p input patter p, and the quantities oh and Eo are the values of oh and Eo averaged over all patterns. To maximize S , the partial derivative of S with respect to each of the new unit's incoming weights, wih , is computed. This gives

@S=@wih =

X

p;o

o (Ep;o ? Eo )fp Ii;p

0

where o is the sign of the correlation between the new unit's output value, oh , and the error of p 0 output o, Ep;o. fp is the derivative for pattern p of the new unit's activation function with respect to the sum of its inputs. Ii;p is the input the new unit receives from unit i for pattern p, where unit i is either an input unit or the output of a previously added hidden unit. Now, a gradient ascent is performed to maximize S using again the Quickprop learning algorithm. Here the only weight that are adjusted by the Quickprop algorithm are the new hidden unit's incoming weights, wih . This means the Quickprop algorithm is used for single layer training. The incoming weights of previously added hidden units remain frozen. In this way no backpropagation is required. When S stops improving, the new unit's output is connected to each of the network's output units and its input weights remain frozen for the rest of the training. Now the training of the output units starts again, this time with the extra weights coming from the added hidden unit.

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

35

To speed up the training of a new unit, a pool of candidate units are trained in parallel, each starting with a di erent set of random weights. The candidate unit that scores the highest S is chosen as the new unit to be added to the network. Since the weights of the hidden units don't change, hidden units' output values can be cached. By correlating the value of the hidden unit to the remaining error at the output units, the output units are able to reduce the error in the next training round since for the input patterns that cause high errors in the outputs, the hidden unit will produce a high value.

Discussion of Cascade-Correlation

The Cascade-Correlation learning algorithm avoids the need to propagate the error backwards though the network, as in the Backpropagation algorithm. This is achieved though training only one layer at a time to decrease the error function, while other weights in the network remain frozen. By keeping part of the weight space frozen, the moving target problem discussed in chapter 2 for the Backpropagation algorithm, should be reduced. The step-size problem found discussed in chapter 2 for the Backpropagation algorithm is avoided here by dynamically adding new hidden units to the network. The algorithm won't get stucked at a local minimum because the residual error will cause the algorithm to add a new unit, changing the weight space. This avoids also the the problem of oscillating around the global minimum. The Cascade-Correlation algorithm still lacks a bound on the number of units it produces. The size of the network is not know in advance and the lack of a bound on the maximum number of units can be a problem for the implementation of the algorithm.

Second Order Cascade-Correlation

A variation made on Cascade-Correlation Sjogaard, 1991] consists of allowing the network to grow horizontally instead of vertically. In other words, every time a hidden unit is added, it is added to the one and only hidden layer. This is achieved by letting the new hidden units receive connections only from the input and bias units and not from previously added hidden units as in the original version. This is shown in Fig 3,4. In this case the network build is a two-layer perceptron. Two-layer perceptrons can be used in a way range of classi cation problems Sjogaard, 1991].

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

36

OUTPUTS

O1 H2 H1

I N P U T S bias

O2

Figure 3.4: Second-Order Cascade-Correlation Architecture

3.1.2 Modi ed Cascade-Correlation

Algorithm Description

The new version of the Cascade-Correlation algorithm Simon, Corporaal & Kerckho s, 1992] changes the way new units are added to the network. Here a number of new units equal to the number of outputs is added every time the training of output units cannot decrease the error. This should accelerated the convergence of the network, since each hidden unit helps correct the remaining error at one output and not the sum at all outputs, as in the original algorithm. New units are denoted Hl:o . The index l denotes the layer to which the hidden unit belongs to. The index o denotes to which output the hidden unit belongs or correlates to (o = 1; 2; :::; No, where No equals the number of output units). Each new unit is independent of the other new units in the same layer so they can be trained in parallel. Each new unit receives trainable input connections from all of the network's external inputs and from previously added units with the same o index (see Fig. 3,6). Before the outputs of the new units are connected, a number of passes are run over the training set, adjusting each new unit's input weights after each pass. The goal of this adjustment is to maximize each So , the sum of the magnitude of the correlation between ol:o , the output value of the new hidden unit Hl;o, and Eo , the residual output error observed at output unit o. So is de ned as

So =

X

p

(ol:o ? ol:o )(Ep;o ? Eo) p

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

37

OUTPUTS

OUTPUTS

O1

O2

O1

O2

H1.1

I N P U T S bias I N P U T S bias

H1.1

H1.2

Mode 1

H1.2

Mode 2

Figure 3.5: New Architecture after 1 groups of 2 hidden units have been added: (a) Mode 1, (b) Mode 2.

OUTPUTS

OUTPUTS

O1 H2.1 H1.1

I N P U T S bias

O2 H2.1 H1.1

I N P U T S bias

O1

O2

H1.2 H2.2

Mode 1

H1.2 H2.2

Mode 1

Figure 3.6: New Architecture after 2 groups of 2 hidden units have been added: (a) Mode 1, (b) Mode 2.

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

38

where Ep;o is the error at output o for input pattern p, ol:o is the output of the new hidden unit p Hl:o for input patter p, and the quantities ol:o and Eo are the values of ol:o and Eo averaged over all input patterns. To maximize So , the partial derivative of So with respect to each incoming weight, wil:o , to new hidden unit Hl:o, is computed. This gives

@So =@wil:o =

X

p

o (Ep;o ? Eo)fp Ii;p

0

where o is the sign of the correlation between the new unit's output value, ol:o , and the error of p 0 output o, Ep;o. fp is the derivative for pattern p of the new unit's activation function with respect to the sum of its inputs. Ii;p is the input the new unit receives from unit i for pattern p, where unit i is either an input unit or the output of a previously added hidden unit belonging to the same l output, op0 :o for l0 < l. Now, a gradient ascent is performed to maximize each So , using the Quickprop learning algorithm with only the incoming weights to input layer l, wil:o , being adjusted. As in the original algorithm, a pool of candidate units, here for each new unit, are trained in parallel, each starting with a di erent set of random weights. Two modes exist for the way the hidden units are connected to the output units. In the rst mode, when all So stop improving, each new unit's output is connected only to the respective output unit (Fig. 3,5(a) and Fig. 3.6(a)). In the second mode, when all So stop improving, each new unit's output is connected to all output units (Fig. 3,5(b) and Fig. 3.6(b)). After the hidden units have been connected, either in mode 1 or 2, their input weights remain frozen and the algorithm starts again with the training of the output units.

Discussion of Modi ed Cascade-Correlation

The Modi ed Cascade-Correlation changes the way the network is constructed. Now the number of hidden units added at a time is a function of the number of output units. The fact that each added hidden unit is trained to correlate to the error of only one output unit, should lead to a faster correction of the errors at the output units. The fact that the training of the hidden units can occur in parallel should make this algorithm faster for problems with multiple outputs. Notice that in case the network has only one output unit, the original algorithm and the new version are exactly the same.

Second Order Modi ed Cascade-Correlation

Here, as in the original Cascade-Correlation, the second-order version is achieved by connecting the hidden units only to the output and bias units as shown in Fig. 3.7.

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

OUTPUTS OUTPUTS

39

O1 H2.1 H1.1

I N P U T S bias

O2 H2.1 H1.1

I N P U T S bias

O1

O2

H1.2 H2.2

Mode 1

H1.2 H2.2

Mode 1

Figure 3.7: Second-Order New Architecture

3.2 The Applications

Neural Networks are a powerful computing paradigm that can be used in a wide range of elds. In this chapter three types of applications where Neural Networks can be used are presented. They are the the inverse kinematic transformations of two and three degrees of freedom robot arm controllers, a pattern recognition problem and the Hamming Code generation problem.

3.2.1 Inverse Kinematic Transformations for Robot Control

Control Engineering is one area for which neural networks are attractive because of their ability to learn, to approximate functions, to classify patterns and because of their potential for massive parallel hardware implementation Antsaklis, 1992].

Robot Controller

One application in this area is the control of a robot arm. Fig. 3.8 shows shows the con guration of such a robot control. Consider a jointed two degrees of freedom robot arm that operates in a plane, as shown in Fig. 3.9(a). To make the robot arm move, the coordinates of the desired end point (x,y) are fed to a robot controller so that it generates the required joint angles ( 1 ; 2). The solution to the inverse

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

(x,y,z) d controller A

-1

40

(x,y,z) a

(θ ,θ ,θ ) 1 2 3

robot A

Figure 3.8: Robot Controller Con guration

Z L2

3 2 2

Y L1

L3

L2 L1

1 1

Y

X

(a)

X

(b)

Figure 3.9: Robot Arm: (a) two degrees of freedom, (b) three degrees of freedom. kinematic transformation that maps the end position to the joint angles is given by 2 2 ? L2 2 cos 2 = x + y2L L 1 ? L2 1 2 y ) ? arctan( L2 sin 2 ) 1 = arctan( x L + L cos

1 2 2

Fig. 3.9(b) presents the three-dimensional case where the robot arm consists of three segments. In this case the solution is given by

1

cos

3 2

= ?arctan( x ) y 2 + y 2 + (z ? L1 )2 ? L2 ? L2 2 3 = ?x 2L2L3 z 3 = arctan( p ? L1 2 ) ? arctan( L L3sincos ) x2 + y 2 + L3 3

A conventional robot controller relays on the mathematical model of the coordinate transformations required to direct the robot arm to the desired position. In real world situations the wearing down

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

41

x,y

Robot Controller

θ ,θ

1

2

Σ

θ ,θ

1

2

Neural Network

?θ 1 , ?θ 2

Figure 3.10: Controller with Corrective Neural Network. of the robot and external disturbances cause the the model to be inaccurate so the controller does not achieve the right coordinate transformations. A neural network can be used as robot controller by making it learn the inverse coordinate transformations of the robot arm Rabelo & Avula, 1992]- Bassi & Bekey, 1989]. The advantage of using neural networks is that no mathematical model of the robot, the wear out and external disturbances is required. The neural network can learn the coordinate transformation of the robot from a set of transformation examples from real robot experiments. Here, the mathematical model of the coordinate transformations is used to generate the training sets and test sets. Ideal conditions with no disturbances are considered.

Corrective network

A related problem presented in Josin, Charney & White, 1988] is the use of a neural network as a corrective network for the controller of a damaged robot arm. The con guration of such a network is shown in Fig. 3.10. Here the damage is modeled by a bend at a certain point along the robot arm, as shown if Fig. 3.11. In this case the neural network learns the di erence in angles ( 1 ; 2) needed to correct the error between the theoretically calculated joint angles of the kinematic transformation and the actual angles required to place the robot arm at the desired position according to this model. In this case the input to the neural network consists of the desired end coordinates and the uncorrected joint angles calculated by the robot controller. The outputs are the joint angle corrections. In general, the error function of a controller is not precisely known. Therefore it is not possible to directly determine the corresponding correcting output signals of the correcting network (necessary to generate the training set)for a given set of inputs. This problem can be solved in the following way. A training set for the correcting network is composed of a set of input points (spatial locations and controller joint angle response to those locations) and targets (di erence between the controller output and the input of the robot that is required to reach the given spatial positions). For each

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

42

L A β L’ α

L’ = A + (L - A) -2A(L - A) Cos ( π ?α )

2

2

Sin β = L-A Sin α L’

Figure 3.11: Model of a Bent Arm

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

43

spatial position, two sets of joint angles are required to calculate the di erence that is used as target. One set contains the controller generated angles, the other set contains the actual required angles to reach the spatial position. The problem encountered is that the inverse transformation for a given spatial position is , due to unknown changes in the robot, not know. This is solved by using the controller twice. First the controller output is determined for a given position in space. This will move the robot to a position in space that has a slight position deviation from the input position. By determining the position of the robot, a set of joint angles and the corresponding spatial position is formed. Next the actual spatial position is the input to the controller. This will move the robot to a new spatial position using the newly generated joint angles. The measured spatial position is used as input part of the training set. The target part of the training set is calculated by subtracting the controller generated joint angles for the input point from the actual joint angles corresponding to the input point.

3.2.2 Hamming Code Generation

At present, the benchmark that appears most often in the neural network literature is the exclusiveor problem, also called the XOR problem: given two binary inputs, the output will turn on if one of the inputs is on, but not if both inputs are on. This problem can be generalized to N-input parity problem, where the output is on if an odd number of inputs are on Tesauro & Janssens, 1988]. Here, another benchmark is presented, the Hamming Code Generation problem. The Hamming Code Generation problem is related to XOR/Parity problem. It represent a more general problem, requiring multiple outputs, for which the XOR/Parity problem is a special case, i.e. a Hamming Code Generation problem with one output becomes the parity problem. The Hamming Code Generation problem goes one step further than the XOR/Parity problem since it can be de ned with any number of outputs. This makes it suitable to test how learning algorithms scale to multiple outputs. How learning algorithms scales to multiple outputs is an essential point of comparison since most real live applications require multiple outputs.

Problem Description

The Hamming Code Generation problem consists of determining the r check bits of the Hamming Code Hamming, 1950] for a binary input of m bits, where m + r + 1 2r . Each check bit is the parity bit of a di erent set of data bits, where the r sets are not disjoint. The uses of redundant information, i.e. the fact that each data bit belongs to more than one set of data bits from which the parity is computed, makes it possible to localize and so to correct faults in the data. The number of bit positions in which two codewords di er is called the Hamming Distance. It signi cance is that if two codewords are a Hamming Distance d apart, it will require d single bit errors to convert one into the other. Given a list of codewords, the Hamming Distance of the complete code is equal to the Hamming Distance of the two codewords whose Hamming Distance is minimum. To correct d errors, a code with a Hamming Distance of 2d + 1 is needed. This since the original codeword will still be closer than any other codeword, so it can be uniquely determined. To design a code with m message bits and r check bits that will allow all single errors to be

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

Character ASCII code Hamming code H a c d m n 1001000 1100001 1100011 1100100 1101101 1101110 00110010000 10111001001 11111000011 11111001100 11101010101 01101010110

44

Table 3.1: Hamming code of ASCII characters corrected, consider that each of the 2m legal messages has n = m + r illegal codewords at a distance 1 from it (this are formed by systematically inverting each of the n bits in the n-bit codeword formed from the m message bits). Thus each of the 2m legal messages requires n + 1 bit patterns dedicated to it. Since the total number of bit patterns is 2n , it is required that (n + 1)2m 2n . Using n = m + r this requirement becomes (m + r + 1) 2n . Given m, this puts a lower limit on the number of check bits needed to correct single errors. This lower limit can be achieved with the following method Hamming, 1950]: the bits of the codeword are numbered consecutively, starting with bit 1 at the left end. The bits that are powers of 2 (1, 2, 4, 8, 16, etc.) are check bits. The rest (3, 5, 6, 7, 9, etc.) are lled up with the m data bits. Each check bit forces the parity of some collection of bits, including itself, to be even. A bit may be included in several parity computations. To see which check bits the data bit in position k contributes to, rewrite k as a sum of powers of 2. For example, 11 = 1 + 2 + 8 and 29 = 1 + 4 + 8 + 16. A bit is checked by just those check bits occurring in its expansion (e.g. bit 11 is checked by bits 1, 2, and 8). When a codeword arrives, the receiver initializes a counter to zero. It then examines each check bit k (k=1, 2, 4, 8, ... ) to see if it has the correct parity. If not, it adds k to the counter. If the counter is zero after all the check bits have been examined (i.e., if they were all correct, the codeword is accepted as valid. If the counter is nonzero, it contains the number of the incorrect bit. For example, if check bits 1, 2, and 8 are in error, the inverted bit is 11, because it is the only one checked by bits 1, 2, and 8. Table 3.1 shows some 7-bit ASCII characters encoded as 11-bit codewords using a Hamming code. The data are found in bit positions 3, 5, 6, 7, 9, 10, 11.

3.2.3 Pattern Recognition

The following application is a pattern recognition problem, where it is required to identify one of eight di erent patterns, even if a pattern is rotated 90, 180 or 270 degrees. The patterns are presented in a 3x3 grid and the are are shown in Fig. 3.12. The binary code shown under each pattern uniquely identify that pattern and should then be the output of the network

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

45

000

001

010

011

100

101

110

111

Figure 3.12: Patterns with corresponding binary codes when that pattern is presented as input, invariant of rotation.

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

46

3.3 The Simulation Results

The results presented below give a comparison between the Cascade-Correlation algorithm and the Modi ed Cascade-Correlation algorithm. For both algorithms the high order version was used.

3.3.1 Inverse Kinematic Transformations for Robot Control

We compare the learning speed of the algorithms as a function of training set size and as a function of the output tolerance, given in radians. The training set consists of uniformly spread points in the rst quadrant of the X-Y plane for the robot with two degrees of freedom and for the corrective network, p for the three degrees of freedom robot in the positive X-Y-Z space. All robot links, and 1 Li, are 2 2 length units. The training of the output units uses a linear function and the training of the hidden units the symmetric Sigmoid function. This combination gave the best results. The same training parameters are used in all simulations. The number of candidate units is eight. All candidate units of all new units are trained in parallel. All simulations are run on a Sun4 workstation.

Robot Arm Controller

Table 3.2 compares the average in training epochs, de ned as the number of passes through the training set, and number of hidden units for di erent sizes of the robot controller's training set. The tolerance for the output angles was 0.025 rad. The notation C ? C indicates Cascade-Correlation, Mode 1 and Mode 2 indicate the two modes of the Modi ed Cascade-Correlation and the success rate p=q indicates that p out of q trials were successful, where each trial starts from the initial state with di erent random weights. The speedup is given for the average epochs. The results show that the new version learns faster and scales better when the training set increases. The speedup in average epochs increases with the training set size. In the two degrees of freedom case, when the training set size is more than 81, Mode 2 is not only faster but requires the same number or less hidden units than the original algorithm. Notice that for the Cascade-Correlation algorithm, the average number of hidden units is equal to the average number of hidden layers, since each layer has one hidden unit. For the Modi ed Cascade-Correlation algorithm, the number of average hidden layers is equal to half the number of average hidden units in the two dimensional case, and a third of the number of average hidden units in the three dimensional case. This since for two-output problems, two hidden units are added to each hidden layer and for threeoutput problems, three hidden units are added to each hidden layer. This means that the Modi ed Cascade-Correlation algorithm results also in less deep networks. Mode 2 architecture, where each output receives connections from all hidden units , performs better than Mode 1. This could be explained by the existence of a correlation between the outputs, so hidden units that correlate to one output help other outputs. We tested the network with 100 previously unseen points forming a circle. Table 3.3 gives the average, maximum and standard deviation of the error for di erent sizes of the training set. The error is measured as the Euclidean distance between the actual end point of the robot arm and the desired end points in length units. The tolerance during the learning was 0,025 rad. The results are given for the original Cascade-Correlation and for the new algorithm with Mode 2. The original

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

Training Set

Algorithm C-C Avg. Units 1 2 2 5 8 6 12 23 16 30 33 27 Avg. Epochs 91 106 103 504 529 402 1753 1419 1123 4368 2042 1833 Success 10/10 10/10 10/10 10/10 10/10 10/10 10/10 10/10 10/10 9/10 10/10 10/10 10/10 10/10 10/10

SpeedUp

47

1 0.86 0.88 1 0.95 1.25 1 1.24

4

9

Mode 1 Mode 2 C-C Mode 1 Mode 2 C-C

Training Set

Algorithm C-C

Avg. Units 1 2 2 3 8 6 17 37 24 36 65 49

Avg. Epochs 143 163 134 364 318 249 2386 1584 1332 5545 2806 2688

Success 10/10 10/10 10/10 10/10 10/10 10/10 10/10 10/10 10/10 8/10 9/10 10/10

SpeedUp

1 0.88 1.07 1 1.14 1.46 1 1.51 1.79 1 1.98 2.06

4 8 27 64

Mode 1 Mode 2 C-C Mode 1 Mode 2 C-C

25

81

121

Mode 1 Mode 2 C-C Mode 1 Mode 2 C-C Mode 1 Mode 2

1.56 1 2.14 2.38 1 2.45 2.87

Mode 1 Mode 2 C-C Mode 1 Mode 2

31 40 31

(a)

6200 2534 2159

(b)

Table 3.2: Average training epochs: (a) two degrees of freedom, (b) three degrees of freedom algorithm could not converge with a training set size of 289 patterns. The results show a decrease in average error and standard deviation as the set size increases. Both algorithms exhibit comparable generalization capabilities. For the two degrees of freedom case, we can compare the accuracy of generalization of the new algorithm (Mode 2) with that obtained by Josin Josin, Charney & White, 1988], who used backpropagation. In Josin, Charney & White, 1988], the architecture used consisted of one hidden layer of 32 neurons, 2 input and 2 output units. Josin reports that the test set, 100 points on a circle of radius 0.2 length units, was within an Euclidean distance of 0.025 length units from the desired end points when the training set, a rectangular grid covering the test circle, was of 8 to 25 points. Less training points could not achieve that error tolerance. Using the same training and test sets, we obtained the same error tolerance of 0.025 length units with a training set of 8 points and the network built by the algorithm had 4 hidden neurons. With a test set of 9 points an error tolerance of 0.015 length units was achieved and the network required 9 hidden units. Finally with a test set of 25 points, an error tolerance of 0.012 length units was achieved with 9 hidden units. Table 3.4 presents the average in training epochs and number of hidden units for di erent tolerances in the corrective network. The results show that the new version also scales better as low tolerance is required. For a tolerance of 5x10?3 or less the new algorithm requires less hidden units and converges faster than the original.

Comparison with Back-propagation

Corrective Network

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

Train Set Avg. Algorithm Error C-C 9 Mode 2 C-C 25 Mode 2 C-C 81 Mode 2 C-C 121 Mode 2 C-C 289 Mode 2 0.0228 0.1557 0.0300 0.0215 0.0590 0.0253 0.0189 0.0539 0.0234 ---125 Max. Error Std. Dev. Train Set Avg. Algorithm Error C-C 8 Mode 2 C-C 27 Mode 2 C-C 64 Mode 2 C-C Mode 2 0.0734 0.2250 0.0887 0.0447 0.2261 0.0637 0.0364 0.1235 0.0262 Max. Error Std. Dev.

48

0.0386 0.0869 0.0425 0.0503 0.1736 0.0620 0.0298 0.0903 0.0351 0.0283 0.0928 0.0331 0.0163 0.0501 0.0197

0.2172 0.2848 0.2195 0.2713 0.3689 0.2794 0.0837 0.7346 0.1479 0.0667 0.1245 0.0712 0.1228 0.4078 0.1560

0.0161 0.0801 0.0137 (a) (b)

Table 3.3: Position Errors: (a) two degrees of freedom, (b) three degrees of freedom

Tolerance Algorithm C-C Avg. Units 0 0 0 4 3 2 15 10 8 60* 17 10 Avg. Epochs 6 6 6 1000 286 238 3439 1117 984 13270 1892 1201 Success 10/10 10/10 10/10 10/10 10/10 10/10 7/10 10/10 10/10 0/10 10/10 10/10 -5 -4 -3 Tolerance Algorithm C-C Mode 1 Mode 2 C-C Mode 1 Mode 2 C-C Mode 1 Mode 2 C-C Mode 1 Mode 2 Avg. Units 0 0 0 13 2 3 31 17 15 60* 37 22 Avg. Epochs 8 8 8 2951 327 359 6982 2036 1930 13466 4488 2831 Success 10/10 10/10 10/10 10/10 10/10 10/10 4/10 10/10 10/10 0/10 10/10 10/10

5x10 5x10 5x10 5x10

-2 Mode 1 Mode 2 C-C -3 Mode 1 Mode 2 C-C -4 Mode 1 Mode 2 C-C -5 Mode 1 Mode 2

5x10 5x10 5x10 5x10

-2

* Max. no. Units

(a)

(b)

Table 3.4: Average training epochs, (a) 9 train cases. (b) 16 train cases.

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

Tolerance

5x10

-2

49

Std. Dev.

Avg. Error

Max. Error

Std. Dev.

Tolerance

Avg. Error

Max. Error

0.0251 0.0320 0.0256 0

5x10 -2 0.0089 0.0130 0.0095 5x10 -3 0.0018 0.0043 0.0021 5x10 -4 0.0047 0.0246 0.0067 5x10 -5 0.0047 0.0107 0.0054 (b)

5x10 -3 0.0024 0.0049 0.0027 5x10 -4 0.0058 0.0122 0.0067 5x10 -5 0.0068 0.0153 0.0083 (a)

Table 3.5: Position Errors: (a) 9 train cases, (b) 16 train cases m r Algorithm C-C 3 3 Mode 1 Mode 2 C-C 4 3 Mode 1 Mode 2 C-C 5 4 Mode 1 Mode 2 C-C 6 4 Mode 1 Mode 2 Average Average Units Epochs Speedup 3 170 1 3 55 3.09 3 55 3.09 3.5 319 1 3 86 3.71 3 90 3.54 4,9 455 1 8 197 2.31 8 203 2.24 6.8 590 1 8.4 195 3.03 8,4 208 2.84

Table 3.6: Hamming Code: average training epochs for m data bits and r check bits For a tolerance of 5x10?5 the original algorithm is not able to converge after adding 60 hidden units. Again Mode 2 architecture performs better than Mode 1. Table 3.5 gives the average, maximum and standard deviation of the error for di erent tolerances for the test circle. The results are again for the new algorithm with Mode 2. The results show the well known over-training problem. When the training set is too accurately learned, the network generalizes less accurately.

3.3.2 Hamming Code Generation

Table 3.6 presents the average in training epochs and number of hidden units for di erent sizes of m and r. The results were obtained with Sigmoid output units and hidden units whose outputs are a Gaussian function of weighted inputs. This combination gave the fastest learning. The number

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

Average Average Algorithm Units Epochs Speedup C-C 2.3 325 1 Mode 1 4.2 233 1.40 Mode 2 3 177 1.84 Table 3.7: Pattern Recognition: average training epochs of candidate units is eight. The tolerance is 0.4. The combination m = 4 r = 3 is optimal: m + r + 1 = 2r .

50

3.3.3 Pattern Recognition

Table 3.7 presents the average in training epochs and number of hidden units for this problem. The results were again obtained with Sigmoid output units and Gaussian hidden units. The number of candidate units is eight. The tolerance is 0.4.

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

51

3.4 Discussion and Future Work

The Cascade-Correlation Learning Algorithm and modi ed versions presented in this chapter has the advantages that the algorithms determine the size and topology of the network and also automatically increases its size when lower tolerance is required or more training cases are used. This exibility makes the proposed algorithms attractive for real-world applications and solves the problem of determining the architecture of the network. The algorithm and architecture allow incremental learning.

3.4.1 Inverse Kinematic Transformations for Robot Control

For the robot control problems, Modi ed Cascade-Correlation was shown to learn quicker, scale better to more training cases and to lower tolerance, and to have a higher success rate than the original Cascade-Correlation. The robot arm control with two degrees of freedom allowed a comparison with previous results using Backpropagation. In this case the accuracy of generalization achieved by the Modi ed Cascade-Correlation algorithm is equal or better and with a much smaller network than achieved by Josin using back-propagation. Also no trial and error stage is needed to determine the amount and size of the hidden layers. The results show that only an increase in the training set size improves the accuracy of the controller. Learning the training set with very high accuracy produces the overtraining problem, where for points not in the training set the accuracy decreases. With the Modi ed Cascade-Correlation algorithm, connections from hidden units to all outputs make the convergence faster (Mode 2). This could be explained by the existence of a correlation between the outputs, so hidden units that correlate to one output help other outputs. The results obtained show that a method is required to determine the tolerance in training that will maintain the best generalization capability of the neural network. Further research should focus on this problem.

3.4.2 Hamming Code Generation

For the Hamming Code Generation problem, Modi ed Cascade-Correlation was shown to learn faster than the original Cascade-Correlation, For this problem the original Cascade-Correlation algorithm achieved a smaller number of hidden units than the Modi ed Cascade-Correlation algorithm. On the other hand, since the original Cascade-Correlation algorithm has one unit per hidden layer, the number of layers is less in the Modi ed Cascade-Correlation algorithm. Another important point in the results achieved is that for the Hamming Code Generation problem, the Modi ed Cascade-Correlation algorithm with Mode 1 architecture is faster in learning than Mode 2. This is opposite to the result obtained for the Inverse Kinematic Transformations for Robot Control. This could be explained by the lack of correlation between the check bits.

CHAPTER 3. CONTINUOUS CONSTRUCTIVE SUPERVISED LEARNING

52

3.4.3 Pattern Recognition

For the Pattern Recognition problem, Modi ed Cascade-Correlation was shown to learn faster than the original Cascade-Correlation, For this problem the original Cascade-Correlation algorithm achieved a smaller number of hidden units than the Modi ed Cascade-Correlation algorithm. On the other hand, there are less hidden layers in the Modi ed Cascade-Correlation algorithm. In this problem, so as with the Inverse Kinematic Transformations for Robot Control, the Modi ed Cascade-Correlation algorithm with Mode 2 architecture is faster in learning than with Mode 1 architecture. This again could be explained by the existence of a correlation between the outputs.

3.4.4 Future Work

The results presented here show that the answer to the question which algorithm is the best or which is the best architecture is dependent on the type of problem. Future work should test the algorithms presented in this chapter in other types of real-word applications.

Discrete Constructive Supervised Learning

4

In the previous chapter, the class of constructive learning was presented. There constructive supervised learning algorithm for continuous networks were discussed. This chapter deals with constructive supervised learning algorithm for discrete networks, where by discrete networks it is understood networks with discrete input and output spaces. Five learning algorithms are presented and discussed in sections 4.1. The rst four algorithms are the Pocket Algorithm Gallant, 1990]. the Tilling Algorithm Mezard & Nadal, 1989], the Tower Algorithm Gallant, 1990] and the Upstart Algorithm Frean, 1990]. The fth, the BINCOR Algorithm, is a new learning algorithm, introduced here. In section 4.2 two types of applications are presented. They are the Monk's problems and random boolean functions. Finally, in section 4.3, the performances of the algorithms in terms of generalization accuracy, learning speed and size of network are presented.

53

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

54

4.1 The Algorithms

4.1.1 The Pocket Algorithm

The Pocket algorithm Gallant, 1990] is a variation on the Perceptron algorithm. So as the Perceptron algorithm, the Pocket algorithm is used for single unit training. Although the Pocket algorithm is not a constructive algorithm, it is presented here since it is used by the Tilling algorithm, the Tower algorithm and the BINCOR algorithm, which are constructive algorithms discussed in this chapter. The background for the Pocket algorithm is the following: since a single unit cannot correctly classify all training patterns of non-linearly separable problems, the best that can be hoped for is a set of weights that correctly classi es as large a fraction of the training patterns as possible. Such a set of weights is called optimal. The Perceptron algorithm is not well behaved for non-linearly separable problems. While it will eventually visit an optimal set of weights, it will not converge to any set of weights. The algorithm can even go from an optimal set of weights to a worst-possible set in one iteration, regardless of how many interactions have been taken previously. The Pocket algorithm makes perceptron learning well behaved by adding positive feedback in order to stabilize the algorithm. The basic idea of the Pocket algorithm is to run the Perceptron algorithm, while keeping an extra set of weights in a "pocket". Here the Perceptron algorithm is run with a random presentation of the training patterns. Whenever the perceptron weights have a longest run of consecutive correct classi cations of the randomly selected training patterns, these perceptron weights replace the pocket weights. The pocket weights are the output of the algorithm. The Pocket algorithm is presented below.

The Pocket Algorithm

1. Initialize weights and variables De ne wi (t), (0 < i < n), to be the weight from input i at time t, and to the threshold value in the output node. Set w0 to be ? , and x0 to be always 1. Set wi (0) to zero, thus initializing all the weights and the threshold. Set runW = runW = num okW = num okW = 0, where runW = number of consecutive correct classi cations using perceptron weights W. runW = number of consecutive correct classi cations using pocket weights W . num okW = total number of training examples that W correctly classi es.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

num okW = total number of training examples that W correctly classi es. 2. Present input pattern and desired output

55

Present randomly chosen input pattern Xp = xp;1; xp;2; xp;3; :::; xp;n and corresponding desired output Dp . 3. Calculate actual output y(t) = fh ( n=0 wi(t)xp;i (t)) i 4. Adapt weights

P

If W correctly classi es Xp, i.e. y(t) is equal to the desired output of pattern Xp , Then If runW > runW Then Compute num okW by checking every training pattern If num okW > num okW Then

Set W = W Set runW = runW Set num okW = num okW If all training patterns are correctly classi ed Then Stop runW = runW +1

Else

Adapt weights W(t + 1) = W(t) + Dp Xp Set runW =0

4. Next iteration If the speci ed number of iterations has not been taken then go to 2.

The output of the algorithm is then the pocket weights W . As an example of the Pocket algorithm, consider the XOR problem. The training set, with bipolar input and output spaces, is given below (including the bias): X1 = (+1 -1 -1) D1 = -1 X2 = (+1 -1 +1) D2 = +1 X3 = (+1 +1 -1) D3 = +1 X4 = (+1 +1 +1) D4 = -1 Fig. 4.1 gives a sequence of nine iterations of the algorithm. At iteration eight both pocket and perceptron weights are optimal, classifying three out of four patterns correctly. At iteration nine the initial perceptron weights (0 0 0) have been reached. These misclassify every training pattern, whereas the pocket weights, (1 -1 -1), still get three out of four patterns correct.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

56

Iter. 1 2 3 4 5 6 7 8 9

(0 0 0) (-1 -1 -1) (-1 -1 -1) (0 -2 0) (1 -1 -1) (1 -1 -1) (1 -1 -1) (1 -1 -1) (0 0 0)

W

runW

0 0 1 0 0 1 2 3 0

(0 0 0) (0 0 0) (-1 -1 -1) (-1 -1 -1) (-1 -1 -1) (-1 -1 -1) (-1 -1 -1) (1 -1 -1) (1 -1 -1)

W

runW

0 0 1 1 1 1 2 3 3

Choice OK? Action X4 no W = W - X4 runW = 0

X4 X2 X3 X4 X2 X3 X1

...

yes no no yes yes

runW = runW + 1 W =W runW = runW W = W + X2 runW = 0 W = W + X3 runW = 0 runW = runW + 1

runW = runW + 1 W =W runW = runW yes runW = runW + 1 runW = runW

no

W = W - X1 runW = 0

Figure 4.1: Pocket Algorithm iterations.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

57

4.1.2 The Tilling Algorithm

The Tilling algorithm Mezard & Nadal, 1989] builds a network layer by layer in order to learn the mapping of any Boolean function of Ninp inputs. The input and output spaces of the mapping are bipolar, i.e. -1 and 1. The network is constructed from threshold units. The algorithm is described for mappings requiring one output unit. The training set consists of Np bipolar patterns, where Np 2Ninp . The algorithm is based on the idea of adding new layers in such a way that input patterns having a di erent output get a di erent internal representation in the newly added layer. Suppose that the Lth layer has been built, and it is made of NL units. To each input pattern Xp there correspond a set of values of the neurons in the Lth layer. This set of values is referred to as the internal representation of input pattern Xp in the layer L (the L=0 layer is the input layer,where the representation is the input patterns themselves). In the Tilling algorithm the concept of classes is introduced. Two input patterns belong to the same class for layer L, if they have the same internal representation, which is called the prototype L of the class. If there are Np di erent patterns of NL + 1 elements (1 extra for the bias unit), then L di erent classes in layer L. Each pattern being the internal representation of at least there are Np one input pattern. It holds then

Np NpL Np0 = Np

The algorithm introduces also the concept of faithful classes. Faithful Class: A class in layer L is said to be faithful if and only if all input patterns belonging to that class have the same desired output. The Tilling algorithm then makes sure all classes in an added layer are faithful. The algorithm then maps the prototype of each class onto its desired output, which is the desired output of all input patterns belonging to that class. If this mapping is still not linearly separable then only part the classes is mapped and a new layer is required. At each new layer the previously mapped classes are not spoiled and a new internal representation is found for the unmapped classes of the previous layer in such a way that at least a part of them will be correctly mapped. The way this is done is explained below. In each layer L, the algorithm distinguishes two types of units. The rst unit is called the master unit and plays a special role. The rest of the units are called ancillary units. There is a certain number of input patterns, NeL , for which the output of the master is not the desired output, and the remaining Np ? NeL , for which the master unit gives the desired output. From layer L ? 1 the algorithm will generate the master unit of layer L trying to minimize the number of errors NeL , and, in particular, it will always nd a solution such that

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

NeL NeL?1 ? 1

58

This means that the algorithm will converge after a nite number of layers. In the last layer L , NeL = 0, and the master unit is the desired output unit. All other units in layer L, the ancillary units, are used to create internal representations in the case the master unit has not become the desired output unit. The addition of ancillary units is done in order to maintain the faithfulness condition. The following theorem explains how convergence is achieved. Convergence Theorem: Suppose that all the classes in layer L ? 1 are faithful, and that the number of errors of the master unit, NeL?1 , is non-zero. Then there exists at least one set of weights w connecting the L ? 1 layer to the master unit of layer L such that NeL NeL?1 ? 1. Furthermore, one can construct explicitly one such set of weights. The proof is this theorem is a way to construct the weight vector. L L L L? Proof: Let XpL?1 = (xp;?1; xp;?1; xp;?1; :::; xp;N1L?1), where 1 p NpL?1 , be the prototypes in 0 1 2 L L layer L ? 1. Each prototype Xp ?1 is the internal representation of a certain number Rp ?1 of input patterns

L Np ?1 X

p=1

L Rp ?1 = Np

L and all of them have the same output op ?1 . The master unit is connected to the NL?1 + 1 units L L L L L of the L ? 1 layer by a set of weights W1L = (w0;1 ; w1;1; w2;1; :::; wNL?1;1). Note that w0;1 is the L weight connecting the bias unit to the master unit of layer L, and w1;1 is the weight connecting the L master unit of layer L ? 1 to the master unit of layer L. Let Xp0?1 be one prototype in layer L ? 1 L0 L for which xp0?11 (its master unit value) is not the desired output op0 of the RP ?1 input patterns ; L having Xp0?1 as internal representation in layer L ? 1 (the faithfulness condition implies that all these input patterns have the same desired output). Let W1L be de ned by L w1;1 = 1 L L wj;1 = L?1 op0 xp0?1 ; for j 6= 1 ;j L For a prototype Xp ?1 the master unit output will then be

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

NL?1 X j =0

59

oL;p = sgn( 1

L L? wj;1 xp;j 1 ) NL?1 X j 6=1;j =0 L L? xp0?1 xp;j 1 ) ;j

L = sgn(xp;?1 + L?1 op0 1

If

L?1 >

NL?1

1

L then the prototype XP 0?1 (and all input patterns belonging to that prototype) will have its desired output as output of the master unit of layer L. This is shown below NL?1 X

L oL;p0 = sgn(xp0?11 + L?1 op0 1 ;

= sgn(?op0 + = o p0

j 6=1;j =0 L?1 o 0 NL?1 ) p

L L xp0?1 xp0?1 ) ;j ;j

L What now remains to shown is that for all other prototypes Xp ?1, where p 6= p0 , the output of the master unit of layer L is equal to the output of the master unit of layer L ? 1 (this means a prototype has been corrected without spoiling other prototypes that could be already correct). P L L L? Consider then the quantity j N6=1?;j1=0 xp0?1 xp;j 1 ) j. It holds then that j ;j

j

NL?1 X j 6=1;j =0

L L? xp0?1 xp;j 1 ) j (NL?1 + 2); where p 6= p0 ;j

since each prototype di ers in at least one bipolar element. If

L?1 =

NL?1 ? 1

1

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

then

60

j

L?1 o 0 p

NL?1 X j 6=1;j =0

L L? xp0?1 xp;j 1 ) j < 1; for p 6= p0; and ;j L oL;p = xp;?1 ; for p 6= p0 1 1

The result is then that the master unit of layer L will give

L0 NeL NeL?1 ? RP?1 NeL?1 ? 1

The Tilling algorithm is then based upon the theorem of convergence presented above. The algorithm consists then of 1. a strategy for generating the master unit of a new layer, in such a way that at each layer NeL is at worst equal to the value given by the weight vector of the previous theorem, and 2. a strategy for adding ancillary units in a layer, in such a way as to end up with faithful representations. In this way the convergence of the algorithm is guaranteed.

Generating the Master Unit

The generation of the master unit of a new layer employs a variant of the Pocket algorithm. L L The algorithm tries to learn the mapping Xp ?1 ! op for all prototypes 1 p Np ?1 of layer L ? 1. This is done with a perceptron algorithm. At each weight update (after presenting the set of prototypes), the algorithm keeps in a "pocket" the weight the produced the smallest number of errors with respect to the input patterns. This is done by weighting each prototype by the L number of input patterns Rp ?1 that belong to that prototype. If the particular weight vector of the convergence theorem is taken as initial pocket in the algorithm, then it is ensured that L0 NeL NeL?1 ? RP?1 NeL?1 ? 1. The point of the Pocket algorithm is to speed up the convergence (i.e. generate less layers). If the Pocket algorithm converges, i.e. NeL = 0 this means that the internal representations of the input patterns in layer L ? 1 are linearly separable. Then the master unit in layer L becomes the output unit and the Tilling algorithm stops.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

61

Generating the Ancillary Units

If the master unit of layer L is not identical to the desired output unit, then the Tilling algorithm adds units to the layer L in order to obtain faithful classes. This is explained below. L Layer L ? 1 produces Np ?1 prototypes of NL?1 + 1 elements (NL?1 units plus the bias unit). Each prototype represents a faithful class of layer L ? 1 and is the internal representation of a set of input patterns with the same desired output. In layer L the zeroth unit is the bias unit and is always one. The rst unit is the master unit. At this point the prototypes of layer L ? 1 belong to one of the two classes in layer L whose prototypes, formed by the bias and the master unit, are xL;0 = 1; xL;1 = 1 and xL;0 = 1; xL;1 = ?1. The fact 1 1 2 2 that the master unit is not equal to the desired output unit means that at least one of the two classes is unfaithful. The next unit is then obtained by picking one of the unfaithful classes, say L class 1 (xL;0 = 1; xL;1 = 1). and trying to learn by a perceptron algorithm the mapping Xp ?1 ! op 1 1 for prototypes of layer L ? 1 belonging to that class, i.e. prototypes for which the master unit of layer L outputs 1. If the mapping is learned, class 1 is then broken into two faithful classes. If not (the mapping is not linearly separable), the perceptron learning algorithm can learn a well chosen part of the class in such a way that it will be broken into two classes, one being faithful and the other remaining unfaithful. Then a new unit is added to separate the the remaining unfaithful class, and so on until the classes are separated and are then faithful. With this procedure, for each L new ancillary unit one class is broken into two classes, so that with most Np ?1 units all classes will be faithful. Fig. 4.2 shows the network architecture achieved by the Tilling algorithm.

An Example: The XOR Problem

The XOR problem will serve as an example to illustrate the working of the Tilling algorithm. Fig. 4.3(a) shows the four input patterns with their desired outputs for the XOR problem. The 1 tilling algorithm will start by creating the master unit of layer 1 (M1 in the gure). The training of the master unit with the Pocket algorithm cannot map the four patterns to their desired outputs, since the XOR problem is not linearly separable (as shown in Fig. 4.3(a)). After being trained, the master unit will produce the outputs shown in Fig. 4.3(b), leaving one pattern misclassi ed (Ne1 = 1). Layer 1 have now two classes, class one with prototype (+1,-1) and class two with prototype (+1,+1). Class one, which contains only one input pattern (-1,-1) is faithful. Class two, which contains three input patterns, is an unfaithful class since one input pattern belonging to it has ?1 as desired output and the other two patterns belonging to the class have +1 as desired output. An ancillary unit is then added (A1 in the gure). This unit is trained by a perceptron 2 algorithm to learn the mapping A B C -1 +1 +1 +1 -1 +1 +1 +1 -1

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

bias +1 bias +1 M1

1

bias +1

bias +1

i n p u t u n i t s . . .

M1

2

M1

L-1

M1

L

Figure 4.2: The Tilling Architecture.

A2

1

A2 2 . . . . . . .

A2 . . . AN

L-1

. . .

L-1 L-1

2 AN

2

AN

1 1

62

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

63

(where only the three patterns belonging to class two are used in the training set). The above mapping is linearly separable (as shown in Fig. 4.3(b)). Now the master unit and the ancillary unit separate the input patterns into three faithful classes whose prototypes are shown in Fig. 4.3(c). Finally a master unit is added in a new layer (layer 2). This master unit has to learn the mapping of the three prototypes of layer 1 to their desired outputs (since the classes of layer 1 are faithful, all input patterns belonging to the same class have the same desired output). This mapping is shown in Fig. 4.3(d) and since, as shown also in the gure, the mapping is linearly separable, the Pocket algorithm will be able to learn the mapping. This means that the master unit of layer two is now the output unit since it produces the desired outputs for all input patterns.

Multiple Outputs

The Tilling algorithm has been described for problems requiring one output. To generalize this algorithm to multiple output units, a master unit per output would be needed in each layer. Then ancillary units would be needed to build up faithful classes in relation to each output. The problem how to generate the ancillary units remains to be researched. One approach is to consider each output as a separate problem so that a separate group of ancillary units is constructed for each master unit. Another approach that should reduce the number of ancillary units is to use common ancillary units to all master units. This approach requires further research. A third approach can be the use of master units belonging to an output as ancillary units for the other master units. This approach requires also further research.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

A XOR A B -1 -1 -1 +1 +1 -1 +1 +1 C -1 +1 +1 -1 Np = 4 X1 = (+1,-1,-1) X2 = (+1,-1,+1) X3 = (+1,+1,-1) X4 = (+1,+1,+1) N1 = 1 e prototypes layer 1 X 1 = (+1,-1)

-1 +1 -1 1

64

Ninp = 2

+I A B

w1 0

1 w1

M1 M1

1 -1

1 +1

+1 -1

B

w1 2 (a) A

1 2

A

+1

A B -1 -1 -1 +1 +1 -1 +1 +1

1 M1

-1 +1 +1 +1

B

X 1 = (+1,+1) 2

A

1 2

(b)

M1

1

A

A B -1 -1 -1 +1 +1 -1 +1 +1

1 2

N1 = 1 e prototypes layer 1 X 1 = (+1,-1,+1) X 1 = (+1,+1,+1) 2 X 1 = (+1,+1,-1) 3

1

+I A B (c)

+I

1 2

-1 +1 +1 +1

+1 +1 +1 -1

M1

M1

A

1 2

M1 A

1

1 2

M1

2

2 Ne = 0

M1

2

M1

1

+1

-1 +1 +1 +1 +1 -1

-1 +1 -1

-1 -1

+1

A

1 2

(d)

Figure 4.3: The XOR Problem

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

65

4.1.3 The Tower Algorithm

The Tower algorithm Gallant, 1990] is an algorithm similar to the Tilling algorithm presented above. The di erence being that the Tower algorithm employs the input units in order to achieve faithful classes, instead of adding ancillary units at each layer. The architecture of the Tower algorithm is shown in Fig. 4.4. By connecting the master unit of layer L to the master unit of layer L ? 1 and to all the units of the input layer, then the internal representation of an input pattern in layer L ? 1 is the union of the input pattern itself and of the associated state of the master unit of layer L ? 1. If there are Np input patterns then there are Np distinct internal representations, representing Np faithful classes of one element each. The master unit is generated with the Pocket algorithm as in the Tilling algorithm.

Multiple Outputs

In Gallant, 1990] the solution given for problems requiring multiple outputs is the use of a separate network for each output. This is shown in Fig. 4.5(a) for a problem requiring two outputs, where there is a set of master units per output. Another approach is the connection of all master units of of previous layer to all all master units of the next layer, as shown in Fig. 4.5(b), This could make the classes more easily separable, which may lead to faster convergence.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

66

bias

+1

bias +1

bias +1

bias +1

M1

1

M1

2

. . . .

M1

L

i n p u t u n i t s . . .

Figure 4.4: The Tower Architecture

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

bias +1 bias +1 bias +1 bias +1

67

M1

1

M1

2

. . . .

M11

L

i n p u t u n i t s

. . .

M1

1

M1 (a)

2

. . . .

M12

L

bias

+1

bias +1

bias +1

bias +1

M1

1

M1

2

. . . .

M11

L

i n p u t u n i t s . . .

M1

1

M1 (b)

2

. . . .

M12

L

Figure 4.5: The Tower Architecture, (a) separate network, (b) joint network

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

68

4.1.4 The Upstart Algorithm

The Upstart algorithm Frean, 1990] builds a network layer by layer in order to learn the mapping of any Boolean function of Ninp inputs. The input and output spaces of the mapping are binary, i.e. 0 and 1. The network is constructed from threshold units. The Upstart Algorithm uses a di erent strategy for building a network than the Tilling and the Tower algorithms presented above. Instead of building layers from the input outwards until convergence, new units are added between the input layer and the output. The role of these units is to correct mistakes made by the output unit. This strategy is similar to the one used in the Cascade-Correlation algorithm, but here applied to binary input and output spaces, in a network consisting of threshold units. Also the network produced by the Upstart algorithm as a binary tree of threshold units. This makes the algorithm di erent, as it will be seen below. As the Tilling algorithm, the Upstart algorithm uses the Pocket algorithm for one layer training. The units used in the network are threshold units with output given by

op =

p

(

=

X

N

1 if p 0 0 otherwise

(4.1) (4.2)

i=0

wi xp;i

where wi are the connection weights and xp;i is the value of the ith component of the input pattern p. Here again xp;0 is the bias and is equal to 1, w0 = - , where is the threshold. The algorithm is considered for mappings requiring one output unit. Given a training set of patterns Xp = xp;1; xp;2; xp;3; :::; xp;n and the corresponding desired outputs Dp , there are four possibility for the output of a threshold unit: 1. Correctly On: the output of the threshold unit is 1 and the desired output Dp of the input pattern Xp is 1 (op = Dp = 1). 2. Correctly O : the output of the threshold unit is 0 and the desired output Dp of the input pattern Xp is 0 (op = Dp = 0). 3. Wrongly On: the output of the threshold unit is 1 and the desired output Dp of the input pattern Xp is 0 (op = 1, Dp = 0). 4. Wrongly O : the output of the threshold unit is 0 and the desired output Dp of the input pattern Xp is 1 (op = 0, Dp = 1). Note that this applies to discrete networks with threshold units. In continuous networks with Sigmoid units the output of the unit and the desired output have continuous values and this doesn't hold.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

output

69

o

1

2

+1 inputs

Figure 4.6: Correcting a Parent Unit. These four possible situations for the output of a threshold unit open the door for the Upstart algorithm and for the BINCOR algorithm to be presented in the next section. The idea of the algorithms is then to add two daughter units, 1 and 2, between the output unit (the parent) and the input units. Daughter unit 1 will try to correct input patterns that produce "Wrongly On" errors at the parent unit (the output unit), and daughter unit 2 will try to correct input patterns that produce "Wrongly O " errors at the parent unit (the output unit), This is shown in Fig. 4.6. Lets now de ne the size of an error. De nition: The size of an error, for a given input pattern Xp0 , that causes a "Wrongly O " or "Wrongly On" error at the output, is the absolute value of p0 .

Size(p) =

(

j p0 j if op0 6= dp0 0 if op0 = dp0

(4.3)

where op0 and p0 are de ned in Eq. (4.1) and (4.2). o o o Lets now divide the input patterns into four separate sets: Pcorrect on , Pcorrect off , Pwrong on and o Pwrong off , where

o 1. Pcorrect on is the set of all input patterns that produce a "Correctly On" at the output,

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

2. 3. 4.

o Pcorrect off is the set of all input patterns that produce a "Correctly O at the output, o Pwrongly on is the set of all input patterns that produce a "Wrongly On at the output, o Pwrongly off is the set of all input patterns that produce a "Wrongly O at the output,

70

o Consider the set (Pwrongly on ). The output for these patterns can be corrected if unit 1 is active (equals 1) only for these patterns

o1 (p)

(

=

o 1 if p 2 Pwrongly on o o o 0 if p 2 (Pwrongly off Pcorrect on Pcorrect off )

(4.4)

where o1 (p) is the output of unit 1 for input pattern p. What is needed is to connect unit 1 to the output unit by a negative weight whose absolute value o is larger than the largest error size of the input patterns in Pwrongly on .

o w1 = ?(MAX (Size(Pwrongly on )) + )

(4.5)

where > 0. o o If the output of unit 1 is inactive (equals 0) for patterns belonging to (Pwrongly off Pcorrect off o Pcorrect on ), the output at the output unit for these patterns will remain the same. This is shown below = =

X

p

N

i=0 p + w 1 o1 p

wi xp;i + w1 o1 p

where represents the network after daughter unit 1 has been added, w1 is the weight connecting unit 1 to the output unit and its value is given in Eq. (4.5), and o1 is the output of unit 1 for input p pattern p. o For every p0 that belongs to Pwrongly on

p0

< 0

= = =

1 1 p0 + w op0 1 p0 + w 1 o p0 ? MAX (Pwrongly on ) ?

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

o o so op0 = 0, for all p0 that belong to Pwrongly on . The class Pwrongly on has then being corrected. o o o Now for every p" that belongs to (Pwrongly off Pcorrect off Pcorrect on ) p" p"

71

= = =

p" + w1 o1" p 10 p" + w p"

so the output is not modi ed for these classes. The same idea can be applied to the input patterns for which the output is "Wrongly O " o (Pwrongly off )

o2 (p)

(

=

o 1 if p 2 Pwrongly off o o o 0 if p 2 (Pwrongly on Pcorrect on Pcorrect off )

(4.6)

where o2 (p) is the output of unit 2 for input pattern p. The output for these patterns can be corrected if unit 2 is active (equals 1) only for these patterns. Here a positive connecting weight to the output unit is needed, whose absolute value is larger than o the largest error size of the input patterns in Pwrongly off .

o w2 = MAX (Size(Pwrongly off )) +

(4.7)

Note that the strategy for correcting the output unit presented above only works for threshold units whose outputs are binary, or more precisely, for threshold units where one of the two output is zero. If the outputs of the threshold units were bipolar (+1 and -1), this strategy would not work, since while trying to correct "Wrongly On" ("Wrongly O ") errors, new "Wrongly O " ("Wrongly On") errors would be added. o Unit 1 and unit 2 don't play a roll at the same time since for input patterns belonging to Pwrongly on o only unit 1 is active, for input patterns belonging to Pwrongly off only unit 2 is active, and for input o o patterns belonging to Pcorrect on or Pcorrect off none of them are active. Unit 1 and unit 2 can be trained to produce their respective mappings, given in Eq. (4.4) and (4.6). In the Upstart algorithm, the training is done with the Pocket algorithm. Another point to consider is the fact that if unit 1 is active not only for patterns belonging to o o Pwrongly on but for patterns belonging to Pcorrect off , the e ect is only to reinforce the output o o response to Pcorrect off , without introducing any error. This means that the patterns of Pcorrect off can be eliminated from the training set of unit 1. Whether unit 1 is active or not for patterns of o Pcorrect off , is of no importance since no correctly classi ed patterns will be misclassi ed. The same

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

72

o holds for unit 2 and Pcorrect on . This elimination of patterns during training makes the problem easier and faster to solve. The training sets for unit 1 and 2 become now

o1 (p)

(

=

(

o 1 if p 2 Pwrongly on o o 0 if p 2 (Pwrongly on Pcorrect on ) o 1 if p 2 Pwrongly off o o 0 if p 2 (Pwrongly on Pcorrect off )

(4.8)

o2 (p) =

(4.9)

Since the mapping n Eq. (4.8) are (4.9) are not necessarily linearly separable, then units 1 and 2 cannot always fully learn their respective mappings with the Pocket algorithm. This means that extra units have to be added. The Upstart algorithm uses then a recursive strategy in introducing new units, where if unit 1 (unit 2) has not fully learned its mapping then two new units are recursively added to correct the mapping of unit 1 (unit 2), one that corrects the "Wrongly On" errors and the other that corrects the "Wrongly O " ones. This results in a binary tree network, shown in Fig. 4.7. How this guarantees convergence will now be explained. Convergence Theorem: Daughter units can always make fewer errors than their parent unit, and connecting a daughter unit to a parent unit with the appropriate weight will always reduce the errors made by the parent. Proof: The output errors are

e(o) = e(o)on + e(o)off

where

o e(o)on = j Pwrongly on j o e(o)off = j Pwrongly off j

Unit 1 can always have less errors than the output unit. For example, unit 1 can be active for at o least one input pattern of Pwrongly on and inactive for all other patterns (the needed weight vector of unit 1 for this solution is the one given for the the Tilling algorithm, and this weight vector can be the initial pocket of the Pocket algorithm). Therefore

e(1) < e(o)on e(o)

where e(1) is the number of errors in unit 1 (for the mapping in Eq. (4.8)).

(4.10)

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

73

output

. . . +1 i n p u t s

. . .

. . .

. . .

. . .

. . .

. . .

. . .

Figure 4.7: The Upstart Architecture.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

A similar argument applies to unit 2. It also follows that the output errors are reduced by connecting unit 1 to the output unit, since

74

e(o with 1)

e(1)on + e(1)off + e(o)off e(1) + e(o)off < e(o)

(4.11) (4.12) (4.13)

and similarly for unit 2 on its own. o To understand inequation Eq. (4.11), consider that for each pattern belonging to Pwrongly on , unit 1 should be active. If unit 1 is well active then the pattern is corrected in the output. If not then the pattern remains misclassi ed in the output, and this pattern is one of the e(1)off pattern of o o unit 1 (it is "Wrongly O " in unit 1). For patterns belonging to (Pwrongly off Pcorrect on ), unit 1 o o should be inactive in order not to misclassify Pcorrect on or make Pwrongly off more "o ". If unit 1 is well active for some of these patterns, these patterns are "Wrongly On" in unit 1, giving e(1)on errors in unit 1, and they could introduce errors in the output unit, but not necessarily. When considering the joint action of units 1 and 2, the same result holds:

e(o with 1; 2) < e(o) ? 1

(4.14)

Using this result the convergence of the Upstart algorithm can be proved: the Upstart algorithm adds then recursively two units between unit 1 and the input units, one that corrects e(1)on and the other e(1)off . The same is done for unit 2. The inequations in (4.11) to (4.13) show that every time a daughter unit is added, the error at the parent unit is diminished at least by one. So if two units daughter units are added to unit 1, then

e(1 with daughters) < e(1) ? 1 e(0).

(4.15)

and when going up the tree, e(1) in inequation in (4.12) is reduced, which leads to a reduction in

Daughter units are only generated by the algorithm if the parent makes errors, and the number of errors decreases at each branching. It follows that eventually none of the terminal daughters makes any mistake, so neither do their parents, and so on. Therefore every unit in the whole tree produces its target output, including the output unit. Hence the classi cation is learned. 2 The Upstart algorithm is now presented:

The Upstart Algorithm

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

1. Initial training of output unit Train output unit with the Pocket algorithm. After the training, the connection weights between the inputs and the output unit remain frozen. The following two steps are then applied recursively, generating a binary tree of units. 2. Train daughter units

75

If parent unit makes "Wrongly On" mistakes, train daughter unit 1 with the Pocket algorithm, having the input units as inputs, to learn the mapping in Eq.. (4.8) and go to 3.a. If parent unit makes "Wrongly O " mistakes, train daughter unit 2 with the Pocket algorithm, having the input units as inputs, to learn the mapping in Eq.. (4.9) and go to 3.b. (The training of daughter units 1 and 2 can be done in parallel). If parent unit makes no "Wrongly On" and no "Wrongly O " mistakes, stop. 3.a Add daughter unit 1 Adds daughter unit 1 to the parent unit, the connection weight as de ned in equations (4.5). Go to 2 with unit 1 as parent unit. 3.b Add daughter unit 2 Adds daughter unit 2 to the parent unit, the connection weight as de ned in equations (4.7). Go to 2 with unit 2 as parent unit.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

Output Output H1+

+1

76

+1

I1 I2 I3

I1 I2 I3

IN

IN H1-

(a)

(b)

Figure 4.8: BINCOR Architecture: (a) initial, (b) after 2 hidden units are added. Boxed connections are frozen, x connections are trained repeatedly.

4.1.5 The BINCOR Algorithm

The BINCOR algorithm is a new learning algorithm, introduced here. The BINCOR algorithm builds a network in order to learn the mapping of any Boolean function of Ninp inputs. The input and output spaces of the mapping are binary, i.e. 0 and 1. The network is constructed from threshold units. As the Upstart algorithm, the BINCOR algorithm is based on the idea of adds two units at a time between the output unit and the input units. where one unit will try to correct input patterns that produce "Wrongly On" errors at the output unit, and the other unit will try to correct input patterns that produce "Wrongly O " errors at the output unit, as shown in Fig. 4.6. The BINCOR algorithm follows a di erent strategy for adding pairs of units to the network than the Upstart algorithm. The BINCOR algorithm knows two possibilities building the network: 1. Horizontal growth: where pairs of units are added to the same hidden layer till the output errors are corrected. 2. Vertical growth: where pairs of units are added, each pair forming a new hidden layer containing only the pair, till the output errors are corrected. Notice that possibility 2 is di erent from the binary tree architecture of the Upstart algorithm, where the number of units in each layer grows exponentially with factor 2. The way these two network possibilities are built, and how convergence is achieved is now explained. Fig. 4.8(a) shows the initial network. The BINCOR algorithm starts as the Upstart algorithm. It trains the output unit with the Pocket algorithm. If misclassi ed patterns remain, unit 1 and 2 are

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

Output H2+ H1+

+1

77

Output

H2+ H1+

+1

I1 I2 I3

I1 I2 I3

IN H1H2(a)

IN H1H2(b)

Figure 4.9: Network after adding two pairs of hidden units. (a) Horizontal Growth Network. (b) Vertical Growth Network. trained with the Pocket algorithm to learn the mappings of Eq. (4.8) and (4.9). These units are then connected to the output unit with the weights w1 and w1 as given in Eq. (4.5) and (4.7). As shown in Eq. (4.11) to (4.13), this leads to a reduction in the number of errors at the output. Fig. 4.8(b) shows the network after unit 1 and 2 are added. Now the BINCOR algorithm continues in a di erent way than the Upstart algorithm. The output unit is further trained with Pocket algorithm, this time with the extra two weights from unit 1 and 2. This training will lead to either a further reduction of the number of errors at the output or at worse maintain the previously achieved reduction with the introduction of unit 1 and 2. If there remain errors at the output unit, two new units, 1' and 2', are added. where unit 1' corrects the input patterns that now produce "Wrongly On" errors at the output unit, and unit 2' the ones that now produce "Wrongly O " errors at the output unit, This is done again by training these two units with the Pocket algorithm to learn the mappings of Eq. (4.8) and (4.9), this time with o o o o the new sets Pcorrect on , Pcorrect off , Pwrong on and Pwrong off of input patterns. Units 1' and 2' are then connected to the output unit with the weights w1 and w1 as given in Eq. (4.5) and (4.7) (with the new sets of input patterns). The steps of training the output unit and then adding two new units, correcting the new errors at the output, is repeated till no errors remain. If the horizontal growth variant is chosen, then each pairs of units added have the input units as inputs. If the vertical growth variant is chosen, then each pairs of units added have the input units plus all previously added pairs of units as inputs.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

78

Fig. 4.9 shows the network after unit 1' and 2' are added. Fig. 4.9(a) shows the horizontal growth case and Fig. 4.9(b) the vertical growth case. Notice that here new pairs of units try to correct the errors at the output unit, and not the errors at the parent unit, as in the Upstart algorithm. Convergence is guaranteed by the Convergence Theorem given for the Upstart algorithm, where it was proven that every pair of units added to the output unit reduces the number of errors at the output.

Candidate Units

In order to speed up the learning in Upstart algorithm, the strategy presented below is taken. Instead of training two units at each iteration, the one correcting the remaining input patterns that produce "Wrong On" errors at the output unit, and the other one correcting the "Wrong O " ones at the output unit, two sets of "candidate" units are trained. The idea is that the Pocket algorithm starts the training with di erent random weights for each candidate in a set. From each set of candidates, the candidate that has the best result, i.e. the smallest number of errors, will be connected to the output unit.

Algorithm Description

The BINCOR algorithm is now presented:

The BINCOR Algorithm with Horizontal Growth

1. Initial training of output unit Train output unit with the Pocket algorithm. 2. Train hidden units If the output unit makes "Wrongly On" mistakes, train a new unit with the Pocket algorithm, having the input units as inputs, to learn the mapping in Eq.. (4.8). Then add this unit to the output unit with the connection weight as de ned in equations (4.5). If the output unit makes "Wrongly O " mistakes, train a second new unit with the Pocket algorithm, having the input units as inputs, to learn the mapping in Eq.. (4.9). Then add this unit to the output unit with the connection weight as de ned in equations (4.7). If output unit makes no "Wrongly On" and no "Wrongly O " mistakes, stop. 3. Train output unit Train output unit with the Pocket algorithm (this time with the new weights from the added hidden units of step 2).

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

If output unit makes no mistakes, stop. Otherwise go to step 2.

79

The BINCOR Algorithm with Vertical Growth

1. Initial training of output unit Train output unit with the Pocket algorithm. 2. Train hidden units If the output unit makes "Wrongly On" mistakes, train a new unit with the Pocket algorithm, having the input units and the previously added hidden units as inputs, to learn the mapping in Eq.. (4.8). Then add this unit to the output unit with the connection weight as de ned in equations (4.5). If the output unit makes "Wrongly O " mistakes, train a second new unit with the Pocket algorithm, having the input units and the previously added hidden units as inputs, to learn the mapping in Eq.. (4.9). Then add this unit to the output unit with the connection weight as de ned in equations (4.7). If output unit makes no "Wrongly On" and no "Wrongly O " mistakes, stop. 3. Train output unit Train output unit with the Pocket algorithm (this time with the new weights from the added hidden units of step 2). If output unit makes no mistakes, stop. Otherwise go to step 2.

Multiple Outputs

One way of extending the algorithm for problems with multiple outputs will be to build a separate network for each output. This is shown in Fig. 4.10(a) for a two output problem. Another approach, that should build smaller networks when some correlation exists between the outputs, is to connect the pairs of units that have being trained to correct the error at one output, to all other outputs. Fig. 4.10(b) shown the network built for a two output problem after two pairs of units have been added, a pair per output unit. If pairs of units are added, and the vertical growth option is used, then each new pair of hidden units (one per output) will receive connection from all inputs, the bias and their corresponding previously added pairs of hidden units, i.e. the one which correct the same output error.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

O1 H1+ H1+

+1 +1 1 2

80

O1 O2

O2 H1+ H1+

1 2

I1 I2 I3

I1 I2 I3

IN H12 H11

IN H1H1(b) 2 1

(a)

Figure 4.10: Topology for a two output problem. (a) Separate Network. (b) Combined Network.

4.2 The Applications

4.2.1 Random Binary Functions

A random Boolean function is achieved by assigning each of the 2N input patterns a desired output 0 or 1 with 50% probability. This is a di cult problem for a learning algorithm since there is no correlations and structure in the input space. This problem is then used to compare the performances of the learning algorithms presented.

4.2.2 The Monk's Problems

The Monk's problems belong to the arti cial robot domain. Robots are described by six di erent attributes Wnek et al., 1990]:

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

x1 : headshape 2 (round,square,octagon) x2 : bodys hape 2 (round,square,octagon) x3 : issmiling x4 : holding x6 : hast ie

81

2 2 2

(yes,no) (sword,balloon, ag) (yes,no)

x5 : jacketc olor 2 (red,yellow,green,blue)

These means that there are 432 possible robots (3x3x2x3x4x2=432). The learning task is a binary classi cation task. Each problem is given by a logical description of a class and then the task is to determine whether a robot belongs to that class or not. In order to test how well a learning algorithm generalizes the training set is constructed with a subset of all possible 432 robots, then the network is tested with all 432 robots.

Problem M1

(heads hape = bodys hape) or (jacketc olor = red) For this problem, 124 robots were randomly selected for the training set. The were no misclassi cations.

Problem M2

exactly two of the six attributes have their rst value For this problem, 169 robots were randomly selected for the training set. The were no misclassi cations. ((jacketcolor = green) and (holding = sword)) or ((jacketcolor 6= blue) and (bodys hape 6= octagon)) For this problem, 122 robots were randomly selected for the training set. The were 5% misclassi cations. Problem M1 is in standard disjunctive normal form (DNF). Problem M2 is similar to parity problems. It combines di erent attributes in a way which makes it complicated to describe in DNF of CNF using the given attributes only. Problem M3 is again in DNF and serves to evaluate the under the presence of noise. The resulting diagrams of training and testing sets are found in Fig. 4.11 to Fig. 4.13. The Monk's problems with the given training and test sets were used at the 2nd European Summer School on Machine Learning, held in Belgium during the summer of 1991 to compare di erent

Problem M3

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

82

Figure 4.11: Problem M1 : Training Set: 124 examples, no noise, and Test Set

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

83

Figure 4.12: Problem M2 : Training Set: 169 examples, no noise, and Test Set

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

84

Figure 4.13: Problem M3: Training Set: 122 examples, 6 misclassi cations, and Test Set

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

learning techniques Thrun et al., 1991].

85

4.3 The Simulation Results

4.3.1 Random Binary Functions

In Frean, 1990], a fast and well behaved version of perceptron learning was used to train the weights, in preference to the Pocket algorithm. In this version the weights changes given by the usual Perceptron Learning Rule were multiplied by the factor

T ?j pj T0 exp( T )

Then weights changes then become

(4.16)

T W (t + 1) = W (t) + Dp Xp T exp( ? jT p j )

0

(4.17)

The new factor in Eq. (xx) decreases with j j, which measures how "serious" the error is. The rationale behind this approach is that an error where j j is large is di cult to correct without causing other errors and should therefore be weighted as less signi cant than those where j j is small. The temperature T controls how strongly this weighting is biased to small j j. In the results presented in Frean, 1990] here T was reduced linearly from T0 = 1 to zero, over the training period of 1000 epoch, i.e. 1000 passes over the whole training set. At high T the Perceptron Rule is recovered, but as T decreases the weights are frozen. For the BINCOR algorithm, the variant presented above gave also better results than the Pocket algorithm. In Fig. 4.14 a comparison in terms of number of units generated for a random Boolean function is given between the Tilling algorithm, the Upstart algorithm, the BINCOR algorithm and also the Continuous Constructive Supervised Learning algorithm Cascade-Correlation, presented in chapter 3. Results are given for random Boolean function with 2,4,6,8,9 and 10 inputs. In Fig 4.15 a comparison is given between the results obtained with the BINCOR algorithm when the Pocket algorithm was used and when the above variant was used. Results are given for N = 2,4,6,8 and 9.

4.3.2 The Monk's Problems

In Table 4.1, the results obtained with the BINCOR algorithm for the three Monk's problems are given. No results were available with the other algorithms presented in this chapter.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

86

BINCOR 250 Upstart Tilling

Number of units generated

200

150

100

50

0 0 250 500 750 1000

Number of training patterns

Figure 4.14: Random Binary Functions: Comparison of Results

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

87

with variation 100 with Pocket alg.

Number of units generated

80

60

40

20

0 0 100 200 300 400 500

Number of training patterns

Figure 4.15: Random Binary Functions: BINCOR algorithm

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

Problem Network No. Hidden Test Set Order Units Error Monk 1 Second-Order 4 97.7 % Monk 2 High-Order 34 77.8 % Monk 3 High-Order 4 96.5 % Table 4.1: Monk Results of BINCOR

88

For the rst Monk problem, the best results were obtained with a second-order network (horizontal growth). For the others, a high-order network (vertical growth) gave the best results. In Table 4.2 the results presented in Thrun et al., 1991] for the Monk's problems are given. Although the Cascade-Correlation and the Backpropagation algorithms achieve better generalization, the BINCOR algorithm scores better than Decision Tree-Based Learning Algorithms, the mFOIL algorithm and Inductive Learning programs. The results for the Modi ed Cascade-Correlation algorithm are not reported since there is no di erence between the Modi ed Cascade-Correlation algorithm and the Cascade-Correlation algorithm for single output problems and the Monk's problems require a single output.

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

89

Table 4.2: Generalization results for the Monk's problems

CHAPTER 4. DISCRETE CONSTRUCTIVE SUPERVISED LEARNING

90

4.4 Discussion and Future Work

The results achieved with the random binary functions by the constructive learning algorithms presented in this chapter point to the dependence of these algorithms on the one-layer training algorithm used by them. Further improvements in one layer training algorithms, so as the Perceptron algorithm, should lead to faster learning, smaller networks and better generalization in the constructive algorithms presented. The comparison the Tilling algorithm, the Upstart algorithm and the BINCOR algorithm for the random binary functions shows that the BINCOR achieves the smallest amount of hidden units. The results obtained with the Tilling algorithm are achieved with the Pocket algorithm, so this results could improve if the Tilling algorithm were to employ the variant used in the Upstart and the BINCOR algorithms. The Tilling algorithm produces a variable number of units in each hidden layer. The Upstart algorithm produces a binary tree network, so the number grows exponentially with factor two with each subsequent layer. The BINCOR algorithm has two units in each hidden layer. Future work could be to try the various constructive algorithms with new improved versions of one-layer training algorithms. Also, for a more complete comparison of the algorithms, they should be applied to real-world applications. Finally, ways of extending these algorithms to multiple output cases in a way that will reduce as much as possible the number of hidden units by sharing hidden units between outputs should be studied.

Conclusions and Future Work

5

As opposed to non-constructive learning algorithms, where the topology of the network is selected before the algorithm starts and remains xed, Constructive Supervised Learning Algorithms start from an initial network topology and they add new units in order to learn the set of examples. The nal topology and the size of the network are dynamically determined by the algorithm and are a function of the set of examples chosen and of the learning parameters. The power and applicability of Constructive Supervised Learning Algorithms has been shown by using them in six applications of di erent nature. The applications are: 1. 2. 3. 4. 5. 6. Solution to the Inverse Kinematic Transformations for Robot Control. Corrective networks for damage Robot Controllers. Hamming Codes Generators and Parity Generators. A Pattern Recognition problem. Mapping of Random Boolean Functions. The Monk's problems.

The above applications have allowed an analysis and comparison of the di erent constructive learning algorithms presented. The comparison have been done in terms of learning speed, size of the network and generalization accuracy. 91

CHAPTER 5. CONCLUSIONS AND FUTURE WORK

92

5.1 Continuous Constructive Supervised Learning Algorithms

For Continuous Constructive Supervised Learning Algorithms, the Cascade-Correlation Learning Algorithm and a modi ed version of it have been presented. The Cascade-Correlation Algorithm builds a network by adding a hidden unit at a time between the previously added hidden units and the output units, until the training set is learned. Hidden units are trained to correlate to the sum of the remaining outputs' errors before they are connected. The Modi ed Cascade-Correlation Algorithm has been introduced for problems were multiple outputs are required. In this algorithm, a new unit is added per output and is trained to correlate to the remaining error at the respective output before it is connected. Then the hidden unit is connected to all output units. The Inverse Kinematic Transformations for a two and a three degrees of freedom Robot Controller has allowed for a comparison between the Modi ed and the original Cascade-Correlation algorithms, For this problem, the Modi ed Cascade-Correlation was shown to learn quicker, scale better to more training cases and to lower tolerance, and to have a higher success rate than the original Cascade-Correlation. The Inverse Kinematic Transformation for a two degrees of freedom Robot Controller allowed also a comparison of the Modi ed Cascade-Correlation algorithm with previous results obtained with the non-constructive Backpropagation learning algorithm. In this case the accuracy of generalization achieved by the Modi ed Cascade-Correlation algorithm was equal or better and with a much smaller network than achieved by using the Backpropagation algorithm. The other advantage given by the use of the Modi ed Cascade-Correlation algorithm over the Backpropagation algorithm is that no trial and error stage is needed to determine the amount and size of the hidden layers in the network, what is the case with the non-constructive Backpropagation algorithms. The determination of an a-priori network topology when applying non-constructive learning algorithms, so as the Backpropagation algorithm, remains a di cult problem in the area of Robot Control. The results achieved for the Inverse Kinematic Transformation and for a Corrective Networks for a damage Robot Controllers, point to the applicability of the Modi ed Cascade-Correlation learning algorithm for this area of application. For the Pattern Recognition problem and for the Hamming Code Generation problem, the Modi ed Cascade-Correlation Learning Algorithm was also shown to learn faster than the original Cascade-Correlation Learning Algorithm, When tested for generalization accuracy with the Inverse Kinematic Transformations and the Corrective networks for a damage Robot Controllers, the Modi ed Cascade-Correlation algorithm also scored better.

CHAPTER 5. CONCLUSIONS AND FUTURE WORK

93

5.2 Discrete Constructive Supervised Learning Algorithms

For Discrete Constructive Supervised Learning Algorithms, the Tilling Learning Algorithm, the Tower Algorithm, the Upstart Algorithm and a new algorithm introduced in this report, the BINCOR Algorithm, have been presented. The Tilling Learning Algorithm and the Tower Learning Algorithm build a network by adding a layer at a time, where the internal representation of the input patterns in each layer remain di erent for input patterns with di erent outputs. The process goes on until the rst unit added at a new layer produces the correct mapping for all input patterns. The convergence of these algorithms for any Boolean mapping is guaranteed. The Tilling Algorithm adds a variable number of units in each layer. The Tower Algorithm adds only one unit in each layer and uses "shortcut" connections, i.e. it not only connects the unit of a new layer to the unit of the previous layer, so as in the Tilling Algorithm, but also to the input units. The Upstart Algorithm and the BINCOR Algorithm use a di erent strategy than the previous two algorithms for building a network. In the Tilling and the BINCOR algorithms, units are added between the output unit and the inputs in order to correct the remaining misclassi ed patterns at the output unit. The convergence of these algorithms for any Boolean mapping is guaranteed. In the Upstart Algorithm a recursive strategy is used, where each unit added receives connections from the input units and tries to correct the misclassi ed input patterns at the unit previously added, leading to a binary tree network structure. In the BINCOR Algorithm, two units are added at a time between the input units and the output unit, where this units try to correct the remaining misclassi ed input patterns at the output unit, leading to one of two optional network structures: a single hidden layer or a high-order network of multiple layer, where the units in a hidden layer receive connections from the input units and all previously added units. The mapping of random Boolean functions has allowed for a comparison between the Tilling, Upstart and BINCOR algorithms in terms of size of the network built. The introduced BINCOR algorithm produced the network with less units of the three algorithms. The Monk's problems has been used to compare the generalization accuracy of the BINCOR algorithm with a variety of machine learning algorithms. The generalization with the BINCOR algorithm was less accurate than with the Cascade-Correlation Learning Algorithm and the Backpropagation Learning Algorithm, but more accurate than with Decision Tree-Based Learning Algorithms, the mFOIL algorithm and Inductive Learning programs.

CHAPTER 5. CONCLUSIONS AND FUTURE WORK

94

5.3 General Conclusions and Future Directions

A next step the research of learning algorithms for Arti cial Neural Networks should take is the extension of learning algorithms to multiple output networks. Extension to multiple outputs is of special relevance since many real-world applications require multiple outputs. This also holds for the Discrete Constructive Supervised Learning Algorithms presented in this report, for which extensions to multiple output networks have to be studied. There a need to develop good multiple output benchmarks, in order give a proper comparison between the various learning algorithms. Many of the comparison of learning algorithms found in the literature are still done based on single output problems so as the XOR/Parity problem, the two-spirals problem and the mapping of a random Boolean function, or on problems with multiple output but where the outputs are orthogonal, i.e. only one output is "on" at a time, so as the Encoder/Decoder problem. In this report the Hamming Code Generation problem is suggested as a benchmark requiring multiple outputs. The Hamming Code Generation problem is a more general problem which becomes the Parity problem when there is only one output. The performances of the algorithms discussed in this report for the various applications presented. show that the best network topology, in terms of speed of learning, size of the network, and generalization accuracy, can vary from for di erent applications. This is the case for example for the Hamming Code Generation problem and the mapping Inverse Kinematic Transformations, where in the rst one, separate networks for each output achieved faster learning and less units, and in the second, a combined network for all outputs achieved the best results. The above points to the advantage of constructive learning algorithms with optional network topologies, over constructive learning algorithms which lead to a single type of network topology. In this way, various topology options can be tried in order to chose the best one for a given application. The Modi ed Cascade-Correlation algorithm and the BINCOR algorithm presented are an example of constructive algorithms with optional topologies.

Bibliography

Corsi, 1992] Corsi P., "The Esprit Program", in Proc. Neuro-Nimes '92, Nimes, France, 1992, pp. 671-682. Angeniol et al, 1992] Angeniol, B. et al., "The Galatea Project", in Proc. Neuro-Nimes '92, Nimes, France, 1992, pp. 683-695. Wenskay, 1992] Wenskay, D. L., "Intellectual Property Protection for Neural Networks",in Proc. Neuro-Nimes '92, Nimes, France, 1992, pp. 699-710. Hecht-Nielsen, 1990] Hecht-Nielsen, R., Neurocomputing, Addison-Wesly Publishing Company, London, UK, 1990. Fahlman, 1988] Fahlman, S. E., "Faster-Learning Variations on Back-Propagation: An Empirical Study", in Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann. Fahlman & Lebiere, 1990] Fahlman, S.E., and Lebiere, C., "The Cascade-Correlation Learning Architecture", Advances in Neural Information Processing Systems (Vol. 2), D.S. Touretzky (Ed.), San Mateo, CA, Morgan Kaufmann, 1990. Aleksander & Morton] Aleksander, I., and Morton, H., An Introduction to Neural Computing, Chapman & Hall, Reading, Massachusetts, 1991. Barhen et al., 1992] Barhen, J., Toomarian, N., et al., "New Direc tions in Massively Parallel Neurocomputing", in Proc. Neuro-Nimes '92, (Nimes, France), EC2 publ., pp. 543-554, November, 1992. Daud et al., 1992] Daud, T. et al., "VLSI Implemented Building Block Neural Network Chips for Classi cation/Detection Applications",in Proc. Neuro-Nimes '92, Nimes, France, 1992, pp. 565-576. 95

CHAPTER 5. CONCLUSIONS AND FUTURE WORK

96

Hurdle et al., 1992] Hurdle, J. F., L. Josephson, E. L. Brunvand and, Gopalakrishnan G., "Asynchronous Models for Large Scale Neurocomputing Applications",in Proc. Neuro-Nimes '92, Nimes, France, 1992, pp. 577-588. Jutten et al., 1990] Jutten, C., A. Guerin, et al., "Simulation Machine and Integrated Implementation of Neural Networks", in Neural Networks, L.B. Almedia and C.J. Wellekens (eds.), Berlin: Springer-Verlag, pp. 244-266, 1990. Rosenblatt, 1962] Rosenblatt, F, Principles of Neurodynamics , Spartan Books, New York, 1962. Rumelhart, Hinton & Williams, 1986] Rumelhart, D.E., Hinton, G.E., and Williams, R.J., "Learning Internal Representations by Error Propagation," in Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Rumelhart, D.E., McClelland, J.L., and the PDP Research Group, Editors, vol. 1. Cambridge,MA. MIT Press, 1986. Vogl et al.] Vogl et al.,"Accelerating the Convergence of the Backpropagation Method", Biological Cybernetics, vol 59. Beale & Jackson, 1990] Beale, R., and Jackson, T., Neural Computing: An Introduction, IOP Publishing Ltd, Bristol, UK, 1990. Widrow, 1962] Widrow, B., "Generalization and information storage in networks of ADALINE neurons," in Yovitts, G.T., Editor, Self-Organizing Systems, Spartan Books, Washington DC, 1962. Antsaklis, 1992] Antsaklis P.J., Guest Ed., Special Issue on Neural Networks in Control Systems, IEEE Control Syst. Mag., Apr. 1992, pp. 8-10. Sjogaard, 1991] Sjogaard, S.,"A Conceptual Approach to Generalization in Dynamic Neural Networks",, Masters Thesis, Computer Science Department, Aarhus University, October 1991. Simon, Corporaal & Kerckho s, 1992] Simon, N., Corporaal, H., and Kerckho s, E., "Variations on the Cascade-Correlation Learning Architecture for Fast Convergence in Robot Control", in Proc. Neuro-Nimes '92, (Nimes, France), EC2 publ., pp. 455-564, November, 1992. Simon, Kerckho s & Corporaal, 1992] Simon, N., E. Kerckho s and H. Corporaal,"Variations on the Cascade-Correlation Learning Architecture for Fast Convergence", Neural Network World, 1992, vol 2 (5), pp. 497-510, VSP International Science Publishers. Rabelo & Avula, 1992] Rabelo L.C. and Avula J.R., "Hierarchical Neurocontroller Architecture for Robotic Manipulation", IEEE Control Syst. Mag., Apr. 1992, pp. 37-41. Sobajic, Lu & Pao, 1988] Sobajic D., Lu J. and Pao Y., "Intelligent Control of the INTELLEDEX 6057 Robot Manipulator", in Proc. IJCNN, San Diego, CA, 1988, pp.633-640. Josin, Charney & White, 1988] Josin G., Charney D. and White D., "Robot Control Using Neural Networks", in Proc. IJCNN, San Diego, CA, 1988, Vol. II, pp.625-631. Guez & Ahmad, 1988] Guez A. and Ahmad Z., "Solution to the Inverse Kinematic Problem in Robotics by Neural Networks", in Proc. IjcNN, San Diego, CA, 1988, PP.617-624.

CHAPTER 5. CONCLUSIONS AND FUTURE WORK

97

Guez & Ahmad, 1989] Guez A. and Ahmad Z., "Accelerated convergence in the Inverse Kinematics via Multi-Layered Feedforward Networks", in Proc. IJCNN, Washington, DC, 1989, pp.341-347. Liu, Iberall & Bekey, 1989] Liu H., Iberall T. and Bekey G., "Neural Network Architecture for Robot Arm Control", IEEE Control Syst. Mag., Apr. 1989, pp. 38-42. Bassi & Bekey, 1989] Bassi D. and Bekey G., "High Precision Position Control by Cartesian Trajectory Feedback and Connectionist Inverse Dynamics Feedforward", in Proc. IJCNN, Washington, DC, 1989, pp. 325-332. Parker, 1985] Parker, D. B., "Learning Logic", Tech. Rep. TR-47, Center for Computational Research in Economics and Management Science, MIT, Apr. 1985. Tesauro & Janssens, 1988] Tesauro, G., and Janssens, B., "Scaling Relationships in BackPropagation Learning", Complex Systems vol. 2, pp. 39-44, 1988. Hamming, 1950] Hamming, R.W., "Error Detecting and Error Correcting Codes", Bell System Tech. Journal, vol. 29, pp. 147-160, 1950. Gallant, 1990] Gallant, S.I., "Perceptron-Based Learning Algorithm s", in IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 179-191, June 1990. Mezard & Nadal, 1989] Mezard, M., and Nadal, J., "Learning in Feedforward Layered Networks: the Tilling Algorithm", J. Physics A, vol. 22, no. 12, pp. 2191-2203, 1989. Nadal, 1989] Nadal, J., "Study of a Growth Algorithm for Neural Networks", International J. of Neural Systems, vol. 1, no. 1, pp. 55-59, 1989. Frean, 1990] Frean, M., "The Upstart Algorithm: A Method for Constructing and Training FeedForward Neural Network", Neural Computation, vol. 2, pp. 198-209, 1990. Thrun et al., 1991] Thrun, S.B. et al., The MONK's Problems A Performance Comparison of Different Learning Algorithms. Rep. no. CMU-CS-01-197, Carnegie Mellon University, December 1991. Wnek et al., 1990] Wnek, J., Sarma. A., Wahab, A., and Michalski, R., "Comparison Learning Paradigms via Diagrammatic Visualization: A Casa Study in Sigle Concept Learning Using Symbolic, Neural Net and genetic Algorithm Methods", Technical Report, George Mason University, Computer Science Department, 1990.

- 129 Artificial Neural Networks
- Supervisor prof. dr. ir. A. J. Berkhout Mentors dr. ir. D. de Vries
- Artificial Neural Networks
- editors prof.dr. P.M.E. De Bra prof.dr.ir. J.J. van Wijk
- Promotors Prof. dr. ir. P. van der Veer
- Abstract Artificial Neural Networks The next intelligence
- geboren te Hanoi, Vietnam. Dit proefschrift is goedgekeurd door de promotor Prof.dr.ir. L.J
- Upscaling of Nonlinear Reactive Transport from Pore to Core Promotoren prof. dr. ir. S. E.
- Supervisors Dr. ir. Dick de Ridder, Prof. Dr. ir. Marcel J.T. Reinder Contents

更多相关文章：
**
陈远筑博士简历.doc
**

Using Graph Domination -Thesis*supervisor*: *Dr*. Arthur...*Networks*, in Applied *Artificial* Intelligence, 22(4...Lbestman. A Zonal Algorithm *for* Clustering Ad ... 更多相关标签：

Using Graph Domination -Thesis