9512.net

甜梦文库

甜梦文库

当前位置：首页 >> >> # The use of background knowledge in inductive logic programming

Programming Research Group

THE USE OF BACKGROUND KNOWLEDGE IN INDUCTIVE LOGIC PROGRAMMING Rui Camacho March 1994

Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford OX1 3QD

The use of Background Knowledge in Inductive Logic Programming

Rui Camacho March 1994

Abstract

This report describes experiments in learning models for basic ight manoeuvres from behavioural traces of a human pilot when using a ight simulator. A rst set of experiments using decision trees is presented. The auto-pilot built with the generated decision trees ies more smoothly than the human pilot. However the results show also that propositional logic-level representations, like decision trees, are inadequate to fully solve the problem. A learning system using a rst-order representation is required. However, current Inductive Logic Programming systems have severe limitations when dealing with such complex domains due to ine ciencies of searching large hypothesis spaces. An important issue to make the hypothesis space search tractable and efcient is the use of background knowledge. Some rst results are reported based on a system under development that already shows some uses of background knowledge at a \local" level of learning a single predicate. Identi cation of more possible uses of background knowledge at both a local and global level are required. Problems of relevance of knowledge will also be addressed in the rest of the thesis.

Acknowledgements

It has been a pleasure to study under Dr Stephen Muggleton's supervision. I wish to thank him for introducing and motivating me to ILP. Thanks are also due to Dr Ashwin Srinivasan for the excellent discussions we have about Progol and Indlog and for giving me excellent advice. I would like to thank Professor Donald Michie for suggesting the domain of the ight simulator and for insightful advice and support. I thank JNICT (Junta Nacional de Investigac~o Cient ca e Tecnologica) a from Portugal for the scholarship.

Contents

1 Introduction

1.1 Structure of the report : : : : : : : : : : : : : : : : : :

4

6 7 12 13 14 19

2 Experiments with Decision Trees

2.1 Decision Trees : : : : : : : : : : : 2.1.1 The selection criterion : : 2.1.2 Windowing : : : : : : : : 2.2 Experiments with decision trees 2.3 Limitations : : : : : : : : : : : : 3.1 3.2 3.3 3.4 3.5 3.6

: : : : :

: : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

7

3 ILP systems

Concepts, de nitions and framework : A generalisation ordering : : : : : : : Golem : : : : : : : : : : : : : : : : : MIS : : : : : : : : : : : : : : : : : : FOIL : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : :

22

22 25 27 30 32 36

4 The Indlog approach

4.1 Constructing the bottom clause : : : : : : : : : : : : : 4.2 Indlog : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.2.1 The meta-language : : : : : : : : : : : : : : : : 4.2.2 Further reductions of ?H 's size : : : : : : : : : 4.2.3 The algorithm : : : : : : : : : : : : : : : : : : : 4.2.4 Memory requirements and computation complexity for Reduction : : : : : : : : : : : : : : : : : 4.3 Admissible pruning strategies : : : : : : : : : : : : : : 4.4 Problems in the ying domain : : : : : : : : : : : : : : 4.4.1 Learning constants in the Herbrand Universe of E 2

38

39 40 40 42 44

+

50 52 53 53

4.4.2 Dealing with large amounts of examples : : : : 4.5 Results : : : : : : : : : : : : : : : : : : : : : : : : : : :

56 59

5 Conclusions A Decision Trees results

A.1 Human pilot ight data : : : : : : : : : : : : : : : : : : A.2 Auto-pilot learning data and results : : : : : : : : : : :

62 65

65 65

B Logic Programming De nitions

B.1 The language : : : : : : : : : : : : : : : : : : : : : : : B.2 Models and substitutions : : : : : : : : : : : : : : : : : B.3 Resolution and deduction : : : : : : : : : : : : : : : : :

71

71 73 74

3

Chapter 1 Introduction

Inductive Logic Programming (ILP) Mug91] is rapidly becoming an important area of study within Machine Learning (ML). Much of this interest is owed not just to its formal approach and sound theoretical basis, but also because of its success in dealing with complex, real world tasks. ILP systems like Golem MF90] and Progol SSMS94] have been successfully applied in domains such as drug design KLS92], nite element mesh design DM92], prediction of protein secondary structure KS90, Kin91] and model identi cation in naive qualitative physics BMV91]. This report describes the application of an ILP system called Indlog to the problem of learning models of basic ying manoeuvres from behavioural traces of human pilots.

1

Symbolic learning applied to automatic building of control models has some advantages over a Control Theory approach. It is sometimes di cult to construct controllers for complex systems using classical Control Theory methods. AM91] state that present-day autolanders are not designed to handle large gusts of wind when close to landing. A very complicated situation to model is one in which a helicopter pilot must manoeuvre an aircraft in high winds while there is a load slung beneath the helicopter. Learning control rules by induction aims at providing a new way of building complex control systems quickly and easily. Another advantage, referred to in SZ93], is that the methodology

1

Indlog stands for Induction in Logic.

4

could also be useful in training new pilots. Major aircraft manufacturers spend large amounts of money in training new pilots either in ight simulators or real airplanes. A model of an expert pilot could be built using the reported methodology. During training of new pilots the model could be used to give advice whenever the trainee makes a mistake or to detect more subtle mistakes. By comparing the trainee models at di erent stages of the training process with the model of the expert one can decide if it is worthwhile continuing training or if the trainee is not a promising candidate. The report rst describes experiments following a methodology already applied successfully by Bain et al. MBHM90] in the pole-balancing setting. Later on Sammut et al. SHKM92] found that the technique scaled-up to learning an auto-pilot for a Cessna ight simulator. Our results MC94] (using a simulation of a F-16 combat aircraft) show that the synthesised auto-pilots y much more smoothly than the original (human) pilot. This \averaging" e ect was rst noticed by Bain in the pole-balancing setting and termed \clean-up e ect" MBHM90]. Both Bain and Sammut used decision trees as the induction tool. Some of the basic manoeuvres, like level turn, may be quite complex as several goals and constraints have to be considered in real-time and information from di erent instants of the ight have to be considered. Some of these features make the task very di cult to be accomplished fully by a learning system using a propositional logic-level representation such as decision trees. The application of ILP to the learning of the basic ying manoeuvres requires some extensions to the existing ILP systems and techniques. One important aspect to consider is the use of background knowledge to make the learning task feasible and e cient. A study of di erent uses of background knowledge is needed and some preliminary results are described in this report. As will be shown in Chapter 3 current ILP systems have still some limitations when dealing with complex and real-world problems. Improvements in the Indlog system (under development) could be signicant towards the applicability of ILP systems to complex domains. Numerical data is not handled elegantly by existing ILP systems as Mug93a] states. Since the data from the ying domain is mainly nu5

merical the problem will be addressed in the Indlog system. Some preliminary results are already shown in Chapter 4.

1.1 Structure of the report

The structure of this report is as follows. Chapter 1 gives some de nitions about decision trees, describes the experiments carried out using a decision tree building package and reports the ndings and problems encountered with this approach. Chapter 2 presents some de nitions and the general setting of ILP systems. Some representative ILP systems are reviewed and main limitations are collected at the end of the chapter. These are used to justify the need for the development of a new system to cope with the stated limitations. Chapter 3 describes the current version of the Indlog system (under implementation and improvement). Some results already achieved with the new system are also presented in this chapter. Conclusions A collection of open problems and lines of research to follow are described in the concluding chapter.

6

Chapter 2 Experiments with Decision Trees

2.1 Decision Trees

Consider a partition of a universe U of objects into disjoint subsets. Each subset is called a class and has a label called class name. Let C be the set of class names. The objects in U are characterised by a xed set of features called attributes. Each element's attribute Ai can take one out a set AVi of possible values. De nition 2.1 An attribute Ai is a function from objects of the universe into the set of possible values AVi :

Ai : U ! AVi:

Attributes may be divided into two categories according to the nature of the corresponding AVi. They may be continuous or discrete. An attribute Ai is continuous if AVi is the set of real numbers. Ai is discrete if AVi is a nite set of discrete values. Let n be the xed number of attributes for describing the objects then:

De nition 2.2 A classi cation C is a mapping from U to C: C : U ! C:

7

De nition 2.3 A decision tree T is a total function from the ndimensional space of attributes into C:

are the corresponding values for the A1 ; : : :,An attributes and the last argument is an element of C (the class to which the element belongs).

T : (A ; A ; : : : ; An) ! C: De nition 2.4 An instance is a (n+1)-tuple whose n rst arguments

1 2

De nition 2.5 A training set is a set of instances from which a decision tree is constructed. De nition 2.6 A test set is a set of instances disjoint from the training set.

De nition 2.7 An error in classi cation or a misclassi cation is made whenever the result of applying a decision tree to an object di ers from the class name of that object. De nition 2.8 Let S be a set of N instances and M be the number of misclassi ed instances. The percentage of instances correctly classi ed by a decision tree is de ned as the accuracy of the decision tree on S. De nition 2.9 An estimate of the accuracy of a decision tree on unseen objects, is called the predictive accuracy. De nition 2.10 A training set is adequate if no two instances, that

have the same values for all corresponding attributes, belong to di erent classes. Otherwise it is inadequate.

accuracy = N ? M 100% N

We may take each non-leaf node of a decision tree as a test on one attribute with a branch for each of the possible outcomes and each leaf as labeled with the name of a class. The tests involve only one of the attributes, each test being

A < Th or A >= Th

8

for a continuous attribute A, where Th is some threshold, or

A = Vi or A in S

for a discrete attribute A, where Vi 2 AV and S AV. In order to classify an object, we start at the root of the tree, evaluate the test, and take the branch appropriate to the outcome. The process continues until a leaf is encountered, at which time the object is asserted to belong to the class named in the leaf. As an example consider the task of setting the value for the throttle at di erent stages of a ight. Let the throttle's values depend solely on the altitude and on the state of ight. Let the attribute altitude be continuous and the attribute state be discrete taking values in the set ftake o , climbing, straight and level ight, turn, align, descend, landg. Let the possible class names be in fo , minimum, medium, maximumg. Instances using these attributes are shown in Table 2.1. The decision tree for classifying those instances is shown in Figure 2.1. To determine the value of throttle when ying at 200ft in the landing state we might apply the decision tree using the following procedure. We start at the root of the tree. The root node tests the state of ight. Since the value of the state of ight is land we take the corresponding branch that leads to a test on the altitude. The value of altitude is 200ft and therefore we take the branch > 100 and we end-up in a leaf. Once we reached a leaf the value of the throttle is the label of that leaf that is minimum. The speci cation for the induction task can be informally described as follows. The input is a pre-classi ed set of objects. Each object is described by a xed (predetermined) set of attributes. The output is a decision tree that correctly establishes the class of the objects in the training set as a function of their attribute values. The tree is constructed in a top-down fashion as described in algorithm 2.1 Qui86]. So long as two or more Tri's are nonempty each Tri is smaller than Tr. Thus, provided that a test can always be found that gives a nontrivial partition of any set of instances and that the training set is

1

These procedures are often called TDIDT methods. TDIDT stands for TopDown Induction of Decision Trees.

1

9

altitude

class 0.0 take o maximum 500.0 climbing maximum 1500.0 straight and level ight medium 1500.0 turn medium 1500.0 align medium 1000.0 descend medium 500.0 land minimum 100.0 land o Table 2.1: Instances for setting throttle

state

throttle=off

?@@ >= 100 < 100 ? ? @ @ ?

altitude

?? lnd ?? ?

state

PPP

tko, clb

throttle=maximum

PPPPstlv,turn,align,dsc PPP PPP P

throttle=medium

throttle=minimum

Figure 2.1: Decision tree for setting the throttle. In the arcs of the tree: tko stands for take-o ; clb stands for climbing; stlv stands for straight and level ight; dsc stands for descending and; lnd stands for landing.

10

Algorithm 2.1

input: output:

training set Tr. decision tree Tree.

Set Tr to the input training set. IF Tr is empty or just contains objects of a single class THEN Tree is a leaf labelled with the class name.

ELSE DO

Choose a test T on an attribute. Compute the partition that T makes on Tr. = There will be a subset Tri for each possible outcome O1 ; : : : ; Ow of T. = FOR i = 1 TO w DO Call recursively the algorithm with Tri . Tree has T as root and a branch for each tree result of applying the algorithm to Tri (i2 1; w]).

nite, this algorithm will always produce a decision tree that correctly classi es each instance in Tr. The choice of the test is crucial if the tree is to be simple. A good choice should assist the development of single-class subsets which do not require further division. The choice of what attribute to test on is called the selection criterion and is addressed in Section 2.1.1. If the attributes are adequate, it is always possible to build a decision tree that correctly classi es each object in the training set, and usually there are many such correct decision trees. In the experiments reported here 10 trial decision trees were built with each training set. The criterion for selecting the \best" out of the ten generated makes use of the test set as described in Section 2.2. The selection of the set of attributes may also be a hard problem and need a domain expert. Quinlan Qui82] reports that about two man-months were required to nd an adequate set of attributes (49 attributes) for the 3-ply king-rook king-knight chess end-game. 11

Experiments in this report were conducted with C4.5 Qui87]. This uses an information-based selection criterion. The justi cation is the following. When a decision tree is used to classify an object, it returns a class. Consider that there are only two possible classes C=fP, Ng. Class P has p objects and class N has n objects. A decision tree can thus be regarded as a source of a message \P" or \N", with the expected information needed to generate this message given by p p n n I (p; n) = ? p + n log p + n ? p + n log p + n If attribute A with values AV=fA , A ; : : :, Av g is used for the root of the decision tree, it will partition the training set Tr into fTr , Tr ; : : :, Trv g where Tri contains those objects in Tr that have value Ai of A. Let Tri contain pi objects of class P and ni objects of class N. The expected information required for the subtree for Tri is I(pi, ni). The expected information required for the tree with A as the root is then obtained as the weighted average v X +n E (A) = ppi + ni I (pi; ni) i where the weight for the ith branch is the proportion of the objects in Tr that belong to Tri. The information gained by branching on A is therefore gain(A) = I (p; n) ? E (A): This selection criterion is called the gain criterion Qui86]. Kon84] pointed out that the gain criterion as previously de ned tends to favour attributes with many values. Analysis supports this nding. Let A be an attribute and AV=fA , A ; : : :, Av g and let A0 be an attribute formed from A by splitting one of the values into two. If the values of A are su ciently ne for the induction task at hand, we would not expect this re nement to increase the usefulness of A. Rather, it might be anticipated that excessive neness would tend to obscure the structure in the training set so that A0 was in fact less useful than A. However, gain(A0) is greater than or equal to gain(A), being equal to it only when the proportions of the classes are the same for both subdivisions of the original value. In general, then, gain(A0)

2 2 1 2 1 2 =1 1 2

2.1.1 The selection criterion

12

will exceed gain(A) with the result that the evaluation function would prefer A0 to A. By analogy, attributes with more values will tend to be preferred to attributes with fewer. As another way of looking at the problem, let A be an attribute with random values and suppose that the set of possible values of A is large enough to make it unlikely that two objects in the training set have the same value for A. Such an attribute would have maximum information gain, so the gain criterion would select it as the root of the decision tree. This would be a singularly poor choice since the value of A, being random, contains no information pertinent to the class of objects in the training set. To overcome the problem of bias towards attributes with larger numbers of possible values a correction is introduced Qui86]. Inquiring about the value of attribute A itself gives rise to information, which can be expressed as v X +n +n IV (A) = ? ppi + ni log ( ppi + ni ): i

2 =1

Quinlan Qui86] argues that a good choice of an attribute is the one whose ratio gain(A)=IV (A) is as large as possible. This ratio, however, may not always be de ned or it may tend to favour attributes for which IV(A) is very small. The following criterion is used: from among those attributes with an averageor-better gain, select the attribute that maximises the above ratio. Qui79a] proposes the use of windowing as a method for accelerating and reducing the amount of machine storage required in the induction process. The system starts by randomly selecting a number of instances W (the window) from the training set and generates a decision tree with them. The tree is then used to classify the objects of the training set that are not in W . Those objects that are misclassi ed by the current tree are called exceptions. If there are no exceptions the process terminates and the decision tree is the result of the induction task. Otherwise a 13

2.1.2 Windowing

new window is generated using the old window and a percentage of the exceptions. The number of exceptions added into the new window is called the exceptions limit. The previous decision tree is discarded and a new tree is then generated using the new window. The process is repeated until there are no exceptions. Only one tree is built at each iteration and the nal tree classi es correctly the entire training set. The process could, of course, be halted as soon as a number of exceptions fall below some threshold. Quinlan presents two possibilities for generating the window for the next iteration. In one method a randomly selected set of exceptions is simply added to the current window. The number of added exceptions is limited by the exceptions limit parameter. The second method forms a new window by blending instances from the old window with exceptions in such a way that the window size remains constant (W ). An attempt is made to ensure that at least one instance corresponding to each subcomponent of the old rule is included in the new window so that each section of the old rule is not forgotten. The exceptions limit is used not to stop the window growing too rapidly, but rather to prevent the old window being swamped, in each iteration. The methods are heuristic in nature and neither can guarantee convergence on a correct tree. The rst will stop when the window size reaches the bounds of the storage available for it, so at least it will always terminate. The second, though, may search forever attempting to construct a window of xed size from which a correct rule for all objects in the training set can be induced. If there is no such subset of the training set | for instance if the number of subcomponents of any correct rule exceeds the number of instances allowed in the window | then the search is doomed to failure. O'K83] has noted that the iterative process cannot be guaranteed to converge on a nal tree unless the window can grow to include entire training set. One interesting empirical result reported in Qui79b] is that if the initial size and exceptions limit are set very low, a correct rule is (eventually) obtained with a very compact nal window.

2.2 Experiments with decision trees

The experiments were carried out using the ACM ight simulator of an F-16 aircraft. The ight simulator was downloaded into an HP900/735 14

workstation running the Unix operating system. ACM runs under X windows version 11 release 5. C4.5 Qui87] was used as the decision tree induction system. The methodology consisted of the following steps: 1. State and control variables were de ned to be sampled and recorded as representatives of the situation of aircraft status and decisions made by the human pilot (the behavioural traces). 2. A ight plan was de ned. 3. The author trained himself by repeatedly piloting the simulated F-16 through the successive stages of the de ned ight plan. 4. Forty ights were recorded and taken as the data for the learning task. 5. The recorded data from the forty ights was pre-processed before being input to the learning package. Some pre-processing included rounding up the values of variables and making the training set adequate. 6. A decision tree was built for each stage of ight and for each decision variable. 7. The trees were converted into C code and linked to the ight simulator. 8. Finally the ight simulator was then run in an \auto-pilot" mode where all commands were determined by the interpretation of the decision-trees for the corresponding stage. The values of control and state variables were sampled and recorded each 100 milli-seconds (ms). The variables and their types were:

State variables

real: airspeed magnitude, airspeed y coordinate (knots) (Geoparallel system) real: x, y, altitude (ft) (position in Geoparallel system) 15

real: climb rate (ft/h) real: z coordinate of g-force (in aircraft system) real: roll rate, pitch rate, yaw rate (rad/sec) real: heading, pitch, roll (rad) (Euler angles for aircraft system) real: angle of attack, angle of sideslip (rad) real: elevators setting (radl) real: ailerons setting (rad) real: rudder setting (rad) real: elevator trim setting (NOT used) real: aps setting (rad) real: speedBrake setting (rad) integer: throttle boolean: gear handle, brakes, afterBurner

Control variables:

elevators, rollers, rudder, aps, speed brake angle, throttle, gear handle (boolean), brakes (boolean), after burner (boolean). The de ned ight plan consisted of the following manoeuvres: 1. Take o . 2. Climb to 1000 feet. 3. Reduce climbing angle and thrust attaining level ight at 1500 feet. 4. Fly parallel to the runway's long axis for a distance of 220 kilo-feet. 5. Turn left 270 6. Turn right to line up with the runway. 7. As soon as distance to runway is less than 70 kilo-feet, start descent to runway keeping in line. 8. Land on the runway. The 40 ights were rst split into 30 ights for the training set and 10 ights for the test set. Each ight was broken down into the states of the ight plan and the data was processed state by state. A decision tree was generated for each control variable and for each stage of ight as follows. The current control variable is taken as the class and all other variables are taken as the attributes. Consider an \event" as being signalled by the occurrence of a control action (a change in the value of any of the control variables). The pre-processing consists 16

of searching for events of the current control variable. Whenever an event is found, the value of the class (the current control variable) is associated with previous values of the attributes (other variables). This delay between the attribute's sample time and the class sample time accounts for the human delay between assessing a situation and reacting to it by performing some action (changing the control variable value). In the reported experiment the test set was used to assess the delays involved in each state and control variable. For each state and control variable di erent delays were tried (from 1 second to 3 seconds). For each delay value 10 trail decision trees were built and one test set was used. From the 10 decision trees the one with best accuracy on the test set was selected. Among those best trees for each delay we select the one, again, with best accuracy on the corresponding test set. That was the delay chosen for that control variable and state. As stated above, for each of the ight plan's 8 stages 9 separate decision trees were synthesised. On average a mission took around 12 minutes. Each 100 ms the state-vector was sampled, therefore about 7,200 state-vectors were record in each mission. From the training set of typically 30 missions yielding about 10,000 recorded events, a complete autopilot is synthesised of the form: ight plan plus 72 decision trees. A list of events for the elevators an rollers at each state can be found in Table A.2 in Appendix A. All decision trees were converted into C code and linked to the ight simulator. The program can then be run in an \auto-pilot" mode where the values for the control variables are obtained at each moment by interpreting the decision trees. The delays that account for the human recognise-act cycle are introduced before each value is actually used. The clean-up e ect became evident when autopilots synthesised according to the formulation below were substituted for human control. The magnitude of this gain in steadiness of control was signi cant. The gain in steadiness may be con rmed either by Figure 2.2 or numerically by Table 2.2. There are major advantages in using a symbolic learning system over a non-symbolic one like a neural network. The model obtained using the rst method is, in general, \intelligible" to a human expert. Some small decision trees like the ones depicted in Figure 2.3 were found. The decision trees presented in Figure 2.3 are understandable for an expert pilot and may provide insight into the task being modeled. 17

Figure 2.2: The `clean-up' e ect. Plotted lines show distances traveled in the horizontal plane from take-o by human pilot (light lines) and the autopilot (heavy line)The y axis represents deviations in the horizontal plane from straight ight.

18

human pilot auto-pilot best ight worst ight take o y 0.0 0.0 0.0 sy 0.0 0.1 0.0 climbing y 0.0 3.2 -0.5 sy 0.0 4.1 0.8 levelling y 0.0 19.9 -6.6 sy 0.0 2.4 2.0 stlv y 0.2 31.7 -12.7 sy 22.7 50.7 v 9.7 uP u (y ?y )2 t =1 n Py sy = y = i (n?1)

n i i

state

i=1

Table 2.2: Errors in the horizontal plane for auto-pilot and human pilot. The value of the y coordinate in the ight plan is zero (between take o and straight and level ight). NOTE: stlv stands for straight and leveled ight. This is a very important result for pilot training feedback. A list of statistics of the generated decision trees can be found in Appendix A. A summary of positive results can be stated. The methodology for skill-grafting from behavioural traces works. The methodology scales-up from the pole-balancing experiment to the ying experiment. The clean-up e ect still operates in this experimental context and plays a very large, almost dominating, r^le. o

2.3 Limitations

Although the reported experiments show some positive results, there are some limitations and problems with the approach taken. 19

if (altitude -97) apsSetting = 0; else if (altitude > -97) apsSetting = 0.3489; (a) if (velocity.y 0.2) f if (y 31) rollers = 1e-05; else if (y > 31) rollers = -2e-05;

g

else if (velocity.y > 0.2) f if (y -32) rollers = 0; else if (y > -32) rollers = -1e-05;

g

(b) Figure 2.3: Decision trees for: (a) actuating the aps during climbing; (b) actuating the rollers during straight and leveled ight. The auto-pilot ies exactly the same ight plan that the human pilot ew during the training sessions. If there is an unexpected event (like a strong wind) and the plane is put out of the ight path, there is no chance of recovery. The problem with decision trees (a kind of propositional logic level language) is that there are no variables in the language therefore whenever a tree is generated for a mission the auto-pilot will carry out only the learned ight plan. If, however, the ight plan is changed new trees must be built for the new auto-pilot. It would be desirable to learn a set of basic manoeuvres (basic operators) valid in all circumstances. For that purposes the language has to have variables that can be instantiated when the operator is being used in each particular circumstance. The experiments show that for some more complex manoeuvres like turning or aligning with the landstrip, the generated trees are quite complex. For those manoeuvres the \inteligibility" property is lost. Either the use of structured attributes or a more powerful representation is required to model those manoeuvres. 20

We do not have a process to determine the several delay(s) that account for the elapsed time the human pilot requires between assessing a situation and carrying out an action. The delays may vary signi cantly with the circumstances. Various delays must be recognised according to the state of ight and the control variable to be learned. All the attributes must be present during the learning phase. There is no possible invention of new attributes in the decision tree framework. It is not obvious which attributes the human pilot uses. The pilot may build structured attributes out of simple ones or may extract them from the landscape image. Di erent coordinate systems (polar, spherical cylindrical etc.) may be used, according to circumstances. A \proper" choice of attributes may simplify considerably the complexity of a decision tree. When building the decision trees there is no direct way of using causal relations among the attributes. If insu cient data is provided and enough pruning is done, the most \interesting" attribute from a discrimination point of view may disregard any notion of causal dependency that may exist between the attribute and the class. It is not surprising that a tree generated for elevators during take-o used altitude as the main attribute. The tree stated that elevators=0 if altitude=0 and elevators = 1e-3 if altitude=10ft. Of course if you do not pull the elevators the altitude does not change and with such a tree the aircraft does not get o the ground. The impossibility of using background knowledge is a severe limitation of decision trees. This limitation could be solved by just using an ILP system. Mug91] states that a propositional logic-level system cannot be used in areas requiring essentially relational knowledge representations.

21

Chapter 3 ILP systems

ILP Mug91] is a special case of a broader area within Machine Learning (ML) called supervised learning. The learning task may be stated in a set-theoretic perspective. It aims at learning an intensional description of a certain set in a universe of elements (U). The intensional description is called the concept. In an ILP setting the concept is usually referred to as the hypothesis. Elements in the target set are called instances of the concept or positive examples. Let Q denote the target set. Elements in Q? = f:x j x 2 U n Q g are called negative examples or counter-examples and are elements of the universe that are not instances of the target concept. Q and Q? may be in nite. Learning systems usually consider nite subsets of Q and Q?. These nite subsets will be denoted by E ( Q ) and E? ( Q?). If not stated otherwise E will be referred as the positive examples and E? will be referred as the negative examples.

+ + + + + + +

In this and the following chapters, concepts and de nitions from Logic Programming are extensively used. A list of the Logic Programming concepts used can be found in Appendix B.

3.1 Concepts, de nitions and framework

ILP is concerned with the generation and justi cation of hypotheses from a set of examples making use of prior knowledge. The representation used is Horn clauses, a subset of First-Order Predicate Calculus. The induced hypotheses are represented as a nite set or conjunction 22

of clauses denoted by H. H is of the form h ^ : : : ^hl where each hi is a nonredundant clause (see De nition 3.4). The prior knowledge, also called background knowledge, will be denoted by B. B is a nite set or conjunction of clauses, B = C ^ : : : ^Cm , typically de nite clauses. In the ILP setting a positive example is typically represented by a positive unit ground clause. E is a conjunction of ground atoms. E = e ^e ^ : : : ^en where ei is a individual positive example. Negative examples are typically negative unit ground clauses. E? represents the conjunction of negated ground atoms. If we denote a negative example by fi then E? = f ^ f ^ : : : ^ fr . E ^ E ? is the training set. Some ILP systems can use non-ground examples and not all of them need negative examples to arrive at a concept description.

1 1 + + + 1 + 2 + + 1 2 +

The induction process, in the ILP framework, meets the following conditions. The background B and the training set are assumed to be consistent. We call these conditions consistency conditions:

B 6j= 2 E ^ E ? 6j= 2 The background knowledge should be consistent with the negative examples. In other words, B should not logically imply any of the negative examples. This condition is called prior satis ability Mug93a]:

+

B ^ E ? 6j= 2:

To justify the need for the induction process it is necessary to satisfy the so called prior necessity condition Mug93a]:

B 6j= E :

+

The induced hypotheses should satisfy the posterior satis ability condition Mug93a]:

B ^ H ^ E ? 6j= 2:

That means that the hypotheses found should be consistent with the negative examples. The set of hypotheses should not be vacuous and 23

explain the positive examples. Condition stated by the posterior sufciency condition Mug93a]: B ^ H j= E The last condition states that each hypothesis hi 2 H should not be vacuous. This condition is called posterior necessity condition: B ^ hi j= e _ e _ : : : _ en (8hi; hi 2 H ) As an example imagine that the system is learning the concept of a virtuoso player. Consider that the information given to the system is the following.

+ + 1 + 2 +

B

8 > > > > > > > < > > > > > > > > :

plays instrument(glenn gould; piano) plays instrument(david oistrach; violin) plays instrument(peter; piano) plays instrument(john; violin) performance(glenn gould; piano; superb) performance(david oistrach; violin; superb) performance(fisher; piano; lousy) performance(fisher; chess; superb) performance(john; violin; lousy)

virtuoso(glenn gould) virtuoso(david oistrach) ( virtuoso(peter) E? virtuoso(john) The previously stated prior conditions are met. B is trivially consistent. E ^E? are also trivially consistent. B does not logically entail the negative examples or any of the positive examples since the predicate symbol virtuoso does not appear in B. A possible hypothesis generated by an ILP system could be the following single clause: virtuoso( Player ) plays instrument( Player, Instrument ) ^ performance( Player, Instrument, superb).

E

(

+

+

24

The hypothesis found satis es the posterior conditions. B ^ H does not logically entail E? since neither sher nor john are capable of a superb performance when playing an instrument.

3.2 A generalisation ordering

The number of hypotheses satisfying the previously stated conditions is, in general, very large and even in nite in most cases. However, it is possible to organise the hypothesis space by imposing an ordering on the set of clauses. The learning algorithm may then take advantage of the existence of that ordering. The search space may be systematically searched and some parts of the space may be justi ably ignored during the search. One such ordering over the set of clauses was mentioned already in Rob65] and is called subsumption: De nition 3.1 If C and D are two distinct nonempty clauses, then C subsumes D and we write C D i there is a substitution such that C D. De nition 3.2 A clause C is subsumption equivalent to a clause D and we write C s D i C D and D C. A clause is reduced if it is not subsumption equivalent to any proper subset of itself. Subsumption is the most common ordering over the set of clauses used in ILP systems. -subsumption ( Plo69]) is usually the name used in ILP to refer to the concept of subsumption. The ordering over the set of clauses is sometimes called a generalisation model Bun88]. If not stated otherwise, the generalisation ordering assumed in the de nitions of the rest of the report is subsumption. The concept of redundancy follows naturally from the idea of an ordering over the set of clauses and the concept of equivalence between clauses or sets of clauses. There are two kinds of redundancy. A literal may be redundant within a clause and a clause may be redundant within a set of clauses. De nition 3.3 A literal l is redundant in clause C _ l relative to background theory B i B ^ (C _ l) B ^ C: 25

De nition 3.4 A clause C is redundant in the theory B ^ C i B ^ C B:

The subsumption ordering imposes a lattice over the set of clauses.

De nition 3.5 A lattice is a partially ordered set in which every pair of elements a,b has a greatest lower bound (glb) (represented by a u b) and least upper bound (lub) (represented by a t b). De nition 3.6 A generalisation ordering (or generalisation model)

is a partial order1 over the set of clauses. The lattice imposed by a generalisation ordering is called a generalisation lattice.

The top element of the subsumption lattice is 2, the empty clause. The glb of two clauses C and D is called the most general instance (mgi) and is the union of the two clauses mgi(C,D) = C D. The lub of two clauses C and D is called the least general generalisation (lgg) Plo69] of C and D. Under subsumption the glb and lub of clauses is unique up to a renaming of variables.

De nition 3.7 Two literals are compatible if they have the same

predicate symbol and sign.

De nition 3.8 Let l1 = p(t11,: : :,t1n) and l2 = p(t21, : : : ;t2n) be two compatible literals. The lgg( l1 , l2 ) = p(lgg(t11,t21),: : :, lgg(t1n , t2n )), where the lgg(t1i, t2i) is computed as follows.

if t1i = t2i then lgg(t1i , t2i) = t1i = t2i . if t1i = f(t011; : : :,t01k ) and t2i = f(t021; : : :,t02k ) then lgg(t1i , t2i ) = f(lgg(t011, t021),: : :,lgg(t01k , t02k )). otherwise, lgg(t1i , t2i) is a new variable V and V is used everywhere as the lgg of these terms. The lgg of two compatible literals is unique up to an equivalence of literals.

1

A partial order is a re exive, anti-symmetric and transitive binary relation.

26

Example: The lgg( p(f(a, g(U)), V, g(U)), p(h(a, g(W)), W, g(W))) is p( Y, X, g(Z)).

De nition 3.9 Let C and D be two clauses (sets of literals) then lgg(C; D) = flgg(lc; ld) j lc 2 C and ld 2 D and lc and ld are compatibleg

As was already pointed out by Mitchell Mit82] the task of concept learning can me mapped into a search through a space of hypothesis. A generalisation ordering is a crucial concept in ILP for it is the basis of an organised search of the hypothesis space. The search for an hypothesis is mapped into the traversal of the generalisation lattice. The traversal of the generalisation lattice is, in general, the computationally expensive stage of the learning task. As will be seen later, background knowledge can play a central r^le in either reducing the search space or increasing o the e ciency of the traversal algorithm. These are some of the ILP problems addressed in this report and proposed as the subject of the D.Phil. thesis.

3.3 Golem

Golem MF90] is based on the concept of relative least general generalisation (rlgg) (see De nition 3.10) with ground theories and functional and variable connection restrictions to limit the size of the hypotheses. Golem accepts as input a set of positive and negative examples and ground model of background knowledge. Positive and negative examples are represented as ground unit clauses. The model of background knowledge is also a set of ground unit clauses representing the heasy model (see De nition 3.11) of B. The main control cycle of Golem is shown in Algorithm 3.1 MF90]. Golem incrementally constructs clauses which cover as many positive examples as possible without covering any negative until all positive examples are covered. The rlgg clause is often too speci c and therefore is reduced by removing literals from the body of the clause until no more removals can be performed without covering a negative example. Contrasting with systems like MIS, Progol and Indlog, Golem uses a bottom-up (speci c to general) reduction algorithm. 27

E+ /* Positive examples */ ? E /* Negative examples */ Mh(B) /* h-easy model of background knowledge B */ sample /* sample limit */ Pairs(s) /* random sample of pairs from E+ */ Lggs = fC:fe,e0g2Pairs and C = rlgg(fe, e0 g) w.r.t. Mh (B) and C consistent w.r.t. E? g S = the pair fe, e0 g whose rlgg has the greatest cover in Lggs

Algorithm 3.1

do

E+ (s) = random sample of size s fromS + E Lggs = fC: e0 2E+(s) and C = rlgg(S fe0g) consistent w.r.t E? g nd e0 S which produces the greatest cover in Lggs S = S fe0 g E+ = E+ - cover( rlgg(S)) while increasing cover c1 ; : : :;cnc and D = Dh d1 ; : : :;dnd be two clauses and b1; : : :;bm be the h-easy model of the background knowledge B. Let C' = Ch c1 ; : : :;cnc ;b1 ; : : :;bm and D' = Dh d1 ; : : :;dnd ; b1 ; : : :;bm be the extension of C and D with the atoms from the h-easy model of B. The relative least general generalisation (rlgg) of C and D relative to B is rlgg( C, D) = lgg( C', D'). Golem uses and introduces several important concepts to restrict either the size of the computed rlggs and the size of the hypothesis space. The ground background knowledge is generated using the concept of h-easy model suggested initially by Blum and Blum Blu75] and adapted by Shapiro Sha81] De nition 3.11 Given a logic program B an atom a is h-easy with respect to B i there exists a derivation of a from B involving at most h binary resolutions. The Herbrand h-easy model of B, Mh(B) is the set of all Herbrand instantiations of h-easy atoms of B. To guarantee a nite h-easy model of B, Golem restricts the language of B to be the Horn clause language with a further constraint

De nition 3.10 Let C = Ch

28

called syntactic generative and de ned as follows.

The logic program B is syntactically generative i every clause in B is syntactically generative.

De nition 3.12 The clause h b ; : : : ;bn is syntactically generative whenever the variables in h are a subset of the variables of b ; : : :;bn .

1 1

Golem uses the concept of ij-determinacy that is a constraint on the hypothesis language. ij-determinacy is a wide spread concept in other ILP systems like FOIL Progol or Indlog. ij-determinacy is de ned after the de nition of determinate term:

De nition 3.13 Let B be a logic program and the examples E be a set of ground atoms. Let h b ; : : : ;bm,bm ; : : :;bn be an ordered Horn clause and t be a term found in bm . t is a determinate term with respect to bm i for every substitution such that h 2 E and fb ; : : : ;bmg M(B) there is a unique atom bm in M(B), i.e. for

2 1 +1 +1 +1

any such there is a unique valid ground substitution whose domain is the variables in t.

+1

1

+1

Note that the valid substitutions for a determinate term t in any literal bm are a function of the substitutions applied to a certain number of other terms t ; : : :;td within bm . In this case we say that literal bm has degree d with respect to t. The limit of the maximum degree of any literal with respect to any term is denoted by j.

1 +1 +1

De nition 3.14 Let B be a logic program and the examples E be a

set of ground atoms. Every unit clause is 0j-determinate. An ordered clause h b1; : : :;bm ,bm+1 ; : : : ;bn is ij-determinate i a) h b1 ; : : :;bm is (i-1)j-determinate, b) every literal bk in bm+1 ; : : :;bn contains only determinate terms and has degree at most j.

Golem may be told which predicates from the background knowledge determine the target concept. Statements that convey that information are called determination declarations. Functional constraints that are used in the Indlog and Progol systems existed already in Golem and MIS. The functional constraints

2

there exists a total ordering over the literals in the body of the clause.

29

state which arguments of each predicate are input arguments and which are output arguments. Generating all the background facts can be a time-consuming and tedious process, and it is not always easy to know what to generate unless the training examples are known in advance. Although Golem has been successfully applied to some real world problems KS90], it is unable to learn non-determinate clauses. The reduction algorithm of Golem does not guarantee any optimality of the reduced clause. By doing a systematic search during reduction some optimal results may be achieved in systems like Progol and Indlog using a top-down reduction algorithm.

3.4 MIS

The Model Inference System (MIS) Sha83] is a program that corrects and synthesises logic programs. Used as a debugging tool it can detect errors and replace the program de nitions that are in error. When used for synthesis it will query the user about the behaviour of an intended program and learn such a program from the example instances classi ed by the user. MIS uses the concept of re nement operator to search the hypothesis space. Let P be a set of program clauses. The following de nitions are from Sha83]. Re nement imposes an order over clauses drawn from the hypothesis language LH .

De nition 3.15 Let size be a measure of the complexity of a clause and a function into the naturals. For de nite clauses C and D, and a language LH : D is a re nement of C if C, D 2 LH and C D and size(D) > size(C). Size is some structural complexity measure which is a function from some sentences of LH to natural numbers, with the property that for every n > 0 the set of sentences of size n is nite.

D re nes C only if D is more speci c than C.

De nition 3.16 A re nement operator is a mapping from clauses

to nite subsets of of their re nements. It is a function:

: clause ! 2clause 30

of nodes and R a relation such that R N N. Then a rooted graph G 0 = (N, R, N0) is a directed graph such that, N0 2N and for all nodes N0 2 N there exists a nite path from N0 to N0

De nition 3.17 Let G = (N, R) be a directed graph, where N is a set

De nition 3.18 A re nement graph is a rooted graph (LH , , C ),

where C0 is some initial clause.

0

De nition 3.19 A re nement operator is complete with respect to a set of sentences if all sentences can be generated from the empty clause by successive applications of the re nement operator. LH = (2), the transitive closure starting with 2.

Under the subsumption ordering a re nement operator maps a clause c to a set of its re nements (c) by means of two syntactic operations on clauses: substituting a variable with a term, and adding a literal to the body of a clause. If the set of re nement operators is complete under subsumption the re nement operators will generate all clauses of the subsumption lattice. MIS searches the re nement graph top-down, in a breadth- rst fashion. MIS algorithm is illustrated in Algorithm 3.2. MIS uses a pruning mechanism based on the property that if a clause does not cover an example, none of its specialisations will.

31

Algorithm 3.2 Set H to f2g repeat read the next fact repeat while the conjecture H is too strong do

forever

(i.e, it covers a negative example) apply the contradiction backtracking algorithm, and remove from H the refuted hypothesis. while the conjecture H is too weak do (i.e, it does not cover a positive example) add to H re nements of previously refuted hypotheses. until the conjecture H is neither too strong or too weak (w.r.t. the facts read so far).

The backtracking algorithm starts from a derivation tree of a refutation proof and interactively tests the terms resolved upon by constructing ground atoms for each term and asking the user to validate them. The algorithm is only applied when a logic program succeeds with a wrong answer. If the logic program fails, MIS can only add new clauses. The search used in MIS su ers from some ine ciencies. The set of re nement operators is redundant. It generates duplicate nodes in the re nement graph. For complex problems the space complexity of a breadth- rst search is prohibitive.

3.5 FOIL

FOIL's hypotheses language is function free Horn clause Logic extended with negative literals in the right hand-side of a clause. Qui90] considers that FOIL is a natural extension of some ideas from the attribute-value learning paradigm to a rst order logic setting. Consider a target relation based on a k-ary predicate. FOIL's input is a bindings set containing ground k-tuples, each tuple representing a value assignment to the variables of the target predicate. Each of these 32

tuples are labelled with a or to indicate whether or not it is in the target relation. Tuples labelled with can either be given explicitly by the user or generated using the closed-world assumption. If a tuple is not explicitly given as a component of the relation then it is assumed that it is not in the relation. For a particular relation, FOIL nds clauses one at a time using the cover Algorithm 3.3. At each cycle FOIL generates a clause, removes the tuples covered by that clause and cycles again until no positive tuples are left. The set of clauses that cover the entire set of positive tuples is then output.

Algorithm 3.3 (Cover algorithm) input: T. /* the initial set and tuples */

Tcur = T. H = ;. /* initial set of clauses */

repeat

. /* h is the target predicate with all arguments non instantiated */ Best = specialiseClause( C, Tcur ). Best = S reduce( Best ). H = H fBestg. Tcur = Tcur - covered( Best, Tcur ). until Tcur = ; or encoding constraints violated.

C=h

output: H.

Post-processing of clauses is done by removing irrelevant literals. A literal li is considered irrelevant in a clause C _ li if the clause C has the same accuracy as C _ li. The encoding constraints are violated whenever the number of bits for encoding H exceeds the number of bits needed to encode the covered tuples. FOIL specialises a clause by adding literals to it. In the specialisation algorithm, FOIL uses an information gain criterion to decide which is the next best literal to add to the current clause. If the literal bi to be added to the clause introduces n new variables and the current set Ti is made up of k-tuples then the next set of bindings Ti will

+1

33

contain (k+n)-tuples that satisfy the new clause Ci .

+1

The gain criterion used to choose the next literal is focused on the information provided by signaling that a tuple is one of the kind. If the current Ti contains Ti tuples and Ti tuples, the information required for this signal from Ti is given by I (Ti) = ?log ( T Ti T ? ) i + i If the selection of a particular literal bi would give rise to a new set Ti , the information given by the same signal is

+ + + 2 + +1

i I (Ti ) = ?log ( T T+ T ? ) i i

+1 2 + +1 + +1 +1

Let Ti be the tuples in Ti that are represented by one or more tuples in Ti . The total information gained regarding the tuples in Ti is given by the number of them that satisfy bi multiplied by the information gain regarding each of them, i.e.

++ +1

gain(bi) = Ti

++

(I (Ti) ? I (Ti )):

+1

In FOIL the gain criterion is the primary measure of the utility of a literal bi. Unnegated literals may, however, receive \a small amount of credit" Qui90] if they introduce new variables even if the literal does not give a higher concentration of tuples. The motivation for this procedure is that if no literal produces any apparent gain, it may be helpful to introduce a new variable and try again with an enlarged collection of possible literals. Literals in FOIL may be of the six kinds: Xj = Xk or; Xj 6=Xk or; Xj = a or; Xj 6=a or; Q(V , V ; : : : ;Vr) or; :Q(V , V ; : : : ; Vr). Where Xis are existing variables and Vis are existing or new variables, a is a constant and Q is some relation. FOIL considers the entire search space of such literals with the following constraints: the literal must contain at least one existing variable. If Q is the same as the relation P on the head of the clause, possible arguments are restricted to prevent some problematic recursion.

1 2 1 2

34

The gain criterion provides information for FOIL's pruning technique. FOIL uses a minimum description length principle for handling noise. The addition of a literal to a clause is ruled out if the number of bits required to encode the new clause exceeds the number of bits needed to indicate the covered tuples. If no literals can be added to a clause but it is still reasonably accurate (taken here to mean at least 85%), the partial clause is retained as a nal clause. Similarly, if no further clause can be found within this encoding restriction, the current set of clauses is taken as the de nition of the target relation. FOIL uses type and mode information to restrict the search for sensible literals. To moderate the consequences of a greedy search FOIL creates choice points and use backtracking when there are several literals with the same heuristic value. To overcome some myopic problems of FOIL's search, recent versions borrow the idea of determinate literals from Golem MF90]. Determinate literals Qui91] are literals that introduce new variables for which each tuple has exactly one extension and each tuple has at most one extension in the new training set. Unless a literal is found with a gain close to the maximum, all determinate literals are added to the body of the clause under development. FOIL uses determinate literals to learn quicksort. As stated by Quinlan in Qui90], FOIL is usually unable to learn from a small presentation of examples. It was referred earlier that FOIL uses the closed world assumption to construct negative examples if these were not provided by the user. It is basically a wrong assumption since the training set represents a subset of the extensional de nition of the target concept. The complement of the training set is made up of some tuples that belong to the extensional de nition and some others than do not belong to the de nition. FOIL does not use a theorem prover therefore all the ground atoms used in a proof must be present in its background data base. Although it may be e cient by using a matching algorithm that uses the data base as a look-up table the amount of data may be quite large for complex problems. The nal set H of clauses is not logically reduced as FOIL has no concept of clause redundancy. In the nal set H may even contain clauses that cover zero positive examples and zero negative examples. Quinlan states in Qui91] that if a literal represents a relation between 35

two variables X and Y in which each instantiation of X corresponds to exactly one value Y (a one-to-one mapping from values of X to values of Y) then adding this literal give rise to a gain of zero since the number of positive tuples remain the same. The solution adopted is to give them some small gain to any literal that introduces a new variable. In the case of determinate literals FOIL decides to add them all to the partial clause. In both cases the gain criterion is no longer useful and no pruning facility may be used. The search over such spaces is made in almost an exhaustive way. Since FOIL does not handle functions of arity greater than zero it has to overcome this problem by introducing extra predicates. Every predicate introduced makes the search space exponetialy larger and therefore the search gets more complicated due to this limitation of FOIL. The user has to provide all the possible constants in the training set. FOIL then tries all of them in all possible arguments of the predicates. In the domain of the ight simulator the number of possible constants is quite large making the domain inappropriate for a system like FOIL.

3.6 Summary

A summary of current ILP systems limitations can be found in Mug94] and are reproduced here in Table 3.6. The applications abbreviated in the third column of Table 3.6 are: Chess. Learning endgame strategies Mor92] Drugs. Structure-activity prediction KLS92] List & number theory. Quick-sort, multiply, etc MF90] Meshes. Rules for Finite Element Mesh design DM92] Natural language. Grammar acquisition Wir88]. Proteins. Secondary structure prediction KS90, Kin91] Qualitative. Model identi cation in naive qualitative physics BMV91] Satellites. Temporal fault diagnosis Fen91]. 36

Restriction Ground background knowledge Non-numerical data Determinacy Search myopia E ciency of learning

Systems FOIL, Golem ITOU, FOIL, Golem, SIERES, CLINT, RDT Golem, FOIL, LINUS FOIL, FOCL ITOU, CLINT

Problematic Domains Qualitative, chess, natural language Meshes, drugs Qualitative, chess meshes List & number theory Proteins, chess, satellites

Table 3.1: Restrictions that have led to problems in real-world applications.

37

Chapter 4 The Indlog approach

This chapter describes an ILP system under development called Indlog. The system aims at solving some of the shortcomings of current ILP systems, namely the ones described in the previous chapter. In the line of MIS Sha83], FOIL Qui90] and Progol Mug93c], Indlog traverses the generalisation lattice in a top-down fashion. However, Indlog improves on either MIS and FOIL by using available or user supplied knowledge to traverse the generalisation lattice e ciently. A distinctive characteristic of Indlog is that it generates explicitly the bottom of the generalisation lattice. This technique of building an initial clause to reduce the search space exists already in systems like Golem MF90], ITOU Rou90], CLINT DB92] and Progol Mug93c]. The use of the bottom clause of the lattice together with further uses of knowledge either provided by the user or deduced from the available data is a major source for e ciency improvement. The search is much more informed than the MIS search. Both Progol and Indlog systems represent an improvement over existing ILP systems in that it has some useful features that have been lacking in previous ILP systems. Unlike FOIL, for example, the result of the learning task in Progol and Indlog is a non-redundant theory. Indlog can handle non-ground background knowledge, can use nondeterminate predicates, uses a strong typed language and makes use of explicit bias declarations such as mode, type and determination declarations. As in the Progol system Indlog learns with one example at a time and carries out a saturation and attening before the reduction step. 38

The main di erences between Progol and Indlog are in the facilities in the meta-language and the strategy of the reduction algorithm. Progol uses an A* search strategy where compression is used as the heuristic function h whereas Indlog carries out an iterative deepening search. Indlog provides a connected theory ag declaration that allows considerable size reductions of the ?H for this kind of restricted language.

4.1 Constructing the bottom clause

As stated before Indlog starts generating the bottom of the generalisation lattice to be searched. This is a rst and crucial step to remove useless clauses from consideration during the lattice traversal. Search is done over a nite lattice with unique > and ? elements. This is unlike MIS and FOIL search over an in nite descending lattice. The generation of the bottom clause is justi ed by the use of the deduction theorem applied to the posterior su ciency condition in two steps Mug94]. The original condition is

1

B ^ H j= ei

+

Applying the Deduction Theorem left to right we get

B j= H _ ei :

+

Applying the Deduction Theorem right to left we get

B ^ ei j= H:

+

H may be derived using any sound and complete derivation method. The search space is then restricted to all the hypotheses that subsume H. H may be, in general, in nite. Indlog uses some set of constraints C on the language having as a consequence a nite H. The case in which H is a single clause will be the only one considered. As stated before the system considers only one example at a time therefore the derived H will be denoted by ?H (B; ei ; C ) Mug93b]. ?H (B; ei ; C ) = fliteral j literal is derivable from B ^ei g. From now on we will

+ + +

1

F1 ^ F2 j= F3 i F1 j= F2 _F3

39

refer to ?H (B; ei ; C ) simply by ?H bearing in mind that there will be always an underlying example ei and a set of constraints C. ?H is constructed during saturation and will be described in Section 4.2.3. Saturation procedures were rst introduced by Rouveirol Rou90].

+ +

4.2 Indlog

As will be demonstrated in Section 4.2.4, every atom that can be justi ably removed from the ?H clause has an enormous saving e ect during the reduction stage of the algorithm. Information necessary to remove irrelevant atoms may be either provided by the user or be extracted from the available data. This section considers the information the user may supply and that is coded in a meta-language level. Information derived by the system will be considered in Section 4.2.2. The use of knowledge conveyed in the meta-statements is to reduce the size of the bottom clause and improve the e ciency of the reduction algorithm. The meta-statements available to the user are now described and explained. The user may specify which predicate symbols can appear in the de nition of the target concept. The determination declarations Mug93a] specify, for each predicate symbol, which other predicates can appear in its de nition. The predicates are therefore organised in a graph of dependencies and the system needs to have a de nition of the predicates in the leaves of this graph before being able to learn the predicates up in the graph. A mode declaration Mug93a] supply the information concerning the arguments of each predicate. They specify the type of the argument and if it is intended to be either an input or output argument. If a constant is to appear in a predicate its place may be signaled in a mode declaration for that predicate. There may be more than one mode declaration for each predicate symbol. The system may deal with non-determinate predicates. The number of possible outputs, for each combination of input arguments, is 40

4.2.1 The meta-language

bound by the recall number Mug93a]. The system uses the Theorem Prover backtracking mechanism to generate the alternative outputs for each combination of input arguments until a \recall number" of them has been generated or there are no more alternatives. During saturation a ground literal may be added into the ?H clause only if all its input terms are available and its negation is logically implied by B ^ei . A term is available if it is an input argument of ei or is the output of some literal in a previous iteration step of the saturation procedure. There is therefore a chain of dependency among the literals appearing in ?H . A limit on the chain size is de ned as maximum i-depth. The concept of i-depth is borrowed from Golem MF92]. The i-depth of a literal is the maximum i-depth of its input arguments. A term in the head of the predicate is said to have i-depth of zero. A term appearing for the rst time in a clause has an i-depth one greater than the bigger i-depth of the input terms of that predicate. The user may specify the maximum i-depth as a meta-statement.

+ +

Whenever there are functions of arity greater than zero there is a possibility of using some of their arguments as input terms to the predicates in the next iteration of saturation. If some of the arguments of a function are still a function the process of using its arguments may continue. By analogy with type selectors the user may de ne a maximum selector number intended to be the maximum level of recursion of a procedure that extracts the arguments of a function and calls itself recursively on the extracted arguments. If the user does not specify a selector value a default value of one is used. We may \compose" new terms by taking a function symbol and using the corresponding number of available terms as its arguments. The process may recurse and we may use the just constructed function and all the other available terms in the process of \constructing" a new term. The user may specify a maximum constructor number to limit the recursive calls of the constructor procedure. The default value for the constructor operator is zero. Both selector and constructor procedure respect the type information of the available terms. As referred in Section 4.2.1 the use of a maximum selector number and maximum constructor number also contribute to keep the number 41

of literals in ?H at a minimum. They achieve the desired e ect by restricting the number of available terms at each i-layer.

De nition 4.1 A clause is connected if every variable occurs at least

in two literals and at least once as an input argument. This de nition is similar but stronger than Rouveirol's de nition of connex clause Rou92]. A connected clause is connex but the other way around may not be true.

De nition 4.2 A theory is connected if all its clauses are connected.

The user can specify if the target theory is to be connected. The system assumes that the target theory is connected if not stated otherwise. This is not a severe restriction on the hypothesis language since every non connected theory can be converted into a connected one by adding extra monadic predicates about types, for example. As in the Progol system, the user may specify if a predicate have a commutative property. Considerable savings can be achieved by that piece of information as will be shown later in Section 4.5. The complexity results (see Section 4.2.4) show that the search time spent in the reduction algorithm grows exponentially with the number of literals in ?H . It is therefore desirable that only the absolutely necessary literals be present in ?H . The type and mode declarations, the recall number and the i-depth value already give a signi cant contribution to the reduction of the size of ?H . However there may be a signi cant reduction in the number of literals in ?H if the theory to be learned is connected and some of the literals have mode declarations with output arguments. The improvement may occur at two di erent levels. A coarse level where the type information play a central r^le and a ner level of actuo ally considering each individual literal. The coarse level procedure associates two numbers with each mode declaration of the predicates. The rst number is \minimum i-depth" and the second is \number of levels to completion". Having computed 42

4.2.2 Further reductions of ?H 's size

these two numbers the mode declaration of a predicate symbol is considered for saturation only if the current i-depth is greater or equal to the \minimum i-depth" and less than or equal to the \number of levels to completion". If a mode declaration for a predicate (P) speci es that a certain input argument of P is of type T, then the \minimum i-depth" of that argument is zero if the type is one of the input types of the clause's head, or N + 1 if there is an output type of a predicate with \minimum i-depth" N. The \minimum i-depth" of a predicate is the maximum of the \minimum i-depth" of all its input arguments. In a similar fashion, if the mode declaration for a predicate (P) speci es that a certain output argument is of type T and one of the output types of the clause's head is of type T then the \number of levels to completion" of that argument is zero. Otherwise it will be 1 plus the bigger \number of levels to completion" of the predicate set that uses P's outputs to produce all the outputs of the head. As far as the \number of levels to completion" is concerned it is the same thing producing the outputs of the head or producing inputs to a predicate that has no outputs. An advantage of the described use of type information is that it is computed just once when the training data is read in to the program and used in the saturation and reduction of all examples. The nal level takes advantage from the fact that all predicate's output must be used in some literal's input or the clause head's output. Literals that produce an output term that is not used elsewhere in the clause may be removed from ?H . A recursive procedure may start at the last i-layer and remove all the literals in that layer that do not produce a term used as input elsewhere in the body of the clause or as an output of the clause's head. Then the procedure applies the same operation to the literals of the previous i-layer and so on until it reaches the rst i-layer of literals in the clause. The current version of the saturation step is done layer by layer in a breadth- rst fashion. The literal's dependency graph is built and updated as the literals are generated. If the theory is connected some literals are going to be removed at the end of the saturation. The removal of the useless literals requires therefore that they have been stored and the dependency graph updated when they were generated and when 43

they are removed. A signi cant improvement could be achieved if the saturation is done in a depth- rst fashion. The literal's dependency graph would just be updated only if the literal produces outputs used elsewhere in the clause. The system only needs to keep track of the available terms at each i-layer.

The system accepts as input a training set made up of positive and negative examples as was described in Section 3.1. However, the positive examples may be non-ground. B is a set of non redundant de nite clauses. The system learns with one example ei at a time. The output is a non redundant theory H. The main control cycle of the learning system is speci ed in Algorithm 4.1. Each iteration of the cycle produces one clause of the nal H set. The positive examples covered by the clause hi produced in the current cycle are removed and the next iteration receives the remaining positive examples. An example e is covered by the current hypothesis clause hi if B ^hi j=ei . Each hi gets incorporated into B at the end of each cycle. The cycle terminates when there are no positive examples left. Since H is reduced at the end of each cycle, the result of the algorithm is an irreducible set of clauses that covers all the positive examples and is consistent with the negative examples. The clause produced in each iteration of the main control cycle is generated as follows. Each positive example of the current set of remaining positive examples is taken as a seed. An hypothesis is generated for each seed as described in Algorithm 4.2. The hypothesis generation is done in two steps. In the rst step an example is saturated and attened using B and C the set of constraints that include type, mode and determination declarations. The resulting saturated clause is then reduced in the second step using the dependency graph of literals built during saturation.

+ + +

4.2.3 The algorithm

Saturation

The purpose of the saturation is to collect all the \relevant" ground atoms. A \relevant" ground atom is one that may be derived from B 44

Algorithm 4.1 (General control cycle)

input:

E+ and E? B C

/* the training set */ /* background knowledge theory */ /* set of constraints */

H=; /* initial set of hypotheses */ E+ = E+ /* initial set of positive examples */ cur + while Ecur 6= ; do Best = NULL /* no initial best clause */ + for all ei 2Ecur do Clause = induceClause(ei, B, C) if Clause better than Best then Best = Clause Ecur = E+ - covered( Best ) cur S H = reduce( H Best ) S Best if notMadeRedundant( Best ) then B = B

endwhile output: endfor +

H

/* reduced set of hypotheses */

45

Algorithm 4.2 (Hypothesis generation)

input:

e+ i B C /* saturation */

/* a positive examples */ /* background knowledge theory */ /* set of constraints */

/* target predicate determinations */ Determinations = f x j x determines the target predicate in Cg ?H = atten( e+ ) /* initial partial clause */ i Terms = inputArgs( e+ ) /* Initial available terms */ i Graph = ; /* literal's dependency graph */ for i = 1 to maximum i-depth do UsableTerms = Terms for all predicate symbols pi 2 Determinations do for all mode declarations mi of pi do Calls = buildPredicateCalls( mi , UsableTerms ) NewLiterals = evaluate( Calls ) S atten( NewLiterals ) ?H = ?H Graph = updateGraph( Graph, NewLiterals ) S Terms = Terms newTerms( NewLiterals )

endfor endfor

/* reduction */ MaxDepth = 1 Best = NULL

repeat

endfor

Root = root( Graph ) /* root of the dependency graph */ Siblings = ; Best = search( 2, Root, Siblings, Graph, Best, MaxDepth ) MaxDepth = MaxDepth + 1 until fail or (Best 6= NULL and did not reach the depth limit) Best /* best clause seeded with ei */

output:

46

virtuoso(A) A = glenn gould ^ plays instrument(A,B) ^ B = piano ^ performance( A, B, superb). Figure 4.1: ?H (virtuoso(glenn gould),B,C) generated with the following mode and type declarations: virtuoso(+person);plays instrument(+person, ?instrument);performance(+person,+instrument,#quali cation). + means input argument. ? means output argument and # signals the place of a constant. The saturation process is done layer by layer starting with the current example as the head of the saturated clause. Apart from being derivable from B, the relevant ground atoms must be di erent from the head (avoiding tautologies in recursive predicates) and all its input terms must have been produced by already saturated atoms from previous i-layers. During the saturation process a dependency graph of the used literals is built. A literal is dependent on the literals that \produce" the terms it uses as the input arguments.

^ ei and satisfy the constraints C.

+

Flattening

As the relevant ground atoms are being collected, they are attened. Flattening a clause is a logical equivalence preserving operation that just introduces equalities. Flattening transforms a clause with terms of various depth of complexity into a logically equivalent one whose terms are at most of complexity depth 1. During attening each new constant is replaced by a new variable and an equality is introduced preserving the "connection" of the new variable with the constant. Each new function is replaced by a new variable and, again, an equality is introduced preserving the connection between the function and the new variable. The process is recursively applied to the arguments of the function. The attening by introducing equalities is di erent from the attening by introducing new predicates with predicate symbol identical 47

to the functions being attened. Flattening by introducing predicates is done in the ITOU system Rou92]. Figure 4.1 shows as an example the ?H for the example of Chapter 3 Section 3.1.

Reduction

+

The result of the saturation step is the most speci c clause, ?H , that subsumes ei relative to the background knowledge. Usually the ?H clause is too speci c to be of any interest. It often just subsumes ei . ?H must then be generalised by the process of reduction. Unlike a bottom-up reduction algorithm like the one used in Golem MF92], the reduction step implements a top-down approach. The reduction's top-down approach traverses the generalisation lattice starting with the most general clause that have the same predicate symbol of the target concept. The search then proceeds by specialising the current working clause by adding literals to the body of it. While doing this the system computes the coverage of the partial hypothesis (each partial hypothesis is a path from the root). The goal of the search is to nd the shortest hypothesis that covers the maximum number of positive examples as possible and is consistent with the negative examples. Unlike the reduction algorithm of Golem, the algorithm used is provably complete. It will nd the shortest clause (measured in number of literals) that have the best positive cover and are consistent with the negative examples. The search strategy is based on an iterative deepening procedure as shown in Algorithm 4.3. The iterative deepening search is a repeated depth- rst search in which there is a depth bound that is increased at each cycle until a solution is found.

+ 2

The rst maximum depth is set to zero, so the search just considers the head of the partial clause. Then the maximum depth is increased and another depth- rst search is done with the new depth bound. The search stops when boundaries of the search have not been reached indicating that all paths have been pruned earlier than the depth limit. Usually ?H is a long clause (with many literals). Bottom-up reduction methods start with this most speci c clause and drop \unnecesCover is the number of positive (positive cover) and negative examples (negative cover) derivable by an hypothesis ^ B.

2

48

Algorithm 4.3 (Iterative deepening search) input: Clause = h b : : :bn?

bn Siblings Graph Best MaxDepth

1 1

/* Partial clause */ /* next literal to be tried out */ /* Dependency graph */ /* Best clause so far */ /* search depth limit */

if search depth limit reached then return else if complete( h b ^ : : : ^bn? ^bn ) then if h b ^ : : : ^bn better than Best then Best = h b ^ : : : ^bn else do

1 1 1 1

enddo else do

NextLiteral = best( Siblings ) Siblings = Siblings - NextLiteral Best = search( Clause, NextLiteral, Siblings, Graph, Best, MaxDepth )

enddo

Suns = expand( bn ) Sun = best( Suns ) S NewSiblings = (Suns - Sun) Siblings NextClause = h b1 : : :bn?1 ^bn Best = search( NextClause, Sun, NewSiblings, Graph, Best, MaxDepth ) /* best clause so far */

output: Best

49

sary" literals . Therefore bottom-up methods are much more memory consuming because the clauses they consider in the early stages of the search are, in general, long. On the other hand top-down methods start with a positive unit clause and carry on adding literals to it. Since usually the nal hypothesis has not many literals the clauses considered by top-down methods are much more smaller than the ones considered by bottom-up methods. Usually the solution hypothesis is near the top of the generalisation lattice (clauses with few literals) so the top-down search is less memory consuming than bottom-up methods, in general. Indlog's reduction algorithm guarantees to nd the shortest clause with the biggest cover each time an example is saturated. It does not, however, guarantee that the nal set of clauses will have a minimum number of literals. The nal set of clauses H obtained by Indlog is order dependent. Finding the best set will require computing all possible sets and then choosing the one with the minimum number of literals. This is usually feasible since the number of clauses in H is usually fairly small. A better criterion is used in Progol and is based on compression. Compression presents a method of choosing among consistent theories that have di erent positive cover and have di erent sizes. The measure currently used by Progol gives a poor estimate of the \real" compression of a theory. Future research will look at incorporating better estimates into Indlog.

3

Consider the case in which each literal has at most b dependent literals on the literal's dependency graph. Consider also that each node of the search space needs a unit of memory storage. Let d be the depth limit of the search. The iterative deepening algorithm will at each node generate its descendants, store them and pass them as siblings to the next node to be expanded. Since at each iteration a depth- rst search is carried out the next node to be expanded will necessary be a direct descendant of the current node. The root will need one storage unit. The root generates b descendants, therefore the needs for the rst level are b units of storage. The node to be expanded generates b new

A literal is "unnecessary" if removing it does not reduce the coverage of positives or does not increase the coverage of negatives

3

4.2.4 Memory requirements and computation complexity for Reduction

50

nodes and passes on b - 1 siblings. The requirements for the second level are b + (b - 1). The expanded node at the second level will again generate b descendants and will pass onto the next level 2 (b - 1). By induction the amount of storage needed at level i is b + (i-1)(b-1). The total amount of storage needed to search until level d is the sum of the amounts from the root to that level (1) + (b) + (b + (b ? 1)) + : : : + (b + (i ? 1) (b ? 1)) = = 1 + 1 d (b + (b + (d ? 1) (b ? 1)) = 2 = 1 + 1 (b + 1) d + 1 (b ? 1) d : 2 2 The amount of storage is proportional to d and is therefore polynomial on the depth limit. Let p be the number of predicate symbols and a their maximum arity. If the maximum number of \available" terms is t them a pessimistic estimate of b is b = p ta. In the ying domain the amount of memory storage is quite important. Since there is a large number of constant values the branching factor b is large. The depth limit is determined by the number of literals in the target clause. As an example consider that the value for the elevators is determined by the value of altitude , climb rate and roll value . If we code the intervals for their values as two lessThan literals a depth limit of six is required.

2 2

Let Li be the number of nodes visited at level i. If the depth limit is d the search will visit d + 1 times the root, d times the rst level, : : :, and 1 time the d level. The number of nodes visited is therefore

d X i=0

(d + 1 ? i)Li:

0

The number of nodes visited at the root level L is 1. The nodes visited at the subsequent levels is L = b: 1 3 b L = 2 L (b + (b + L ? 1)) = 2 b ? 2

1 2 1 1 2

51

9 6 b L = 1 L (b + (b + L ? 1)) = 8 b + 8 b ? 9 b + 4 : 2 8 The number of nodes at each level is proportional to bd2 . The total number of nodes visited is proportional to d bd2 .

3 2 2 4 3 2

4.3 Admissible pruning strategies

The main pruning facility is based on a property of coverage that follows trivially from the subsumption generalisation ordering: adding a new literal to the body a clause cannot increase the coverage of the clause. Other strategies used in Indlog are now described to prune the search space speci ed by ?H . The pruning consists in ruling out those partial clauses that have a positive cover less than the best positive cover found so far. In case of equal positive cover the clause may be ruled out if it has a length greater than the best clause found so far. The literal's dependency graph provide information useful to avoid considering the descendants of a literal that has been pruned. Those descendant literals will only be considered if there are alternative combinations of other literals that may produce their input arguments. During reduction we may exploit as well the \number of levels to completion". During the iterative deepening search if the \number of levels to completion" of the predicate we are considering joining the partial clause is bigger than the di erence between the current path length (dependency graph level) and the current level boundary it is useless to join that literal since the clause will never be able to produce the outputs of the head. Consider the following situation where clause's head has input type T and output type T and the possible predicates for the clause body have the following mode declarations: p1(+T , -T ). p2(+T , -T ). p3(+T , -T ). When doing an iterative deepening search is useless to consider all possible maximum depth boundaries (0, 1, 2, : : :). In this case the increase in the maximum depth boundary should be 3 and not one. The

1 4 1 2 2 3 3 4

52

shortest hypothesis can not be found before level 3. If there is no interesting hypothesis at level 3 then the next level boundary should be set to 6 and so forth. Type information may be also used to build up schemata of clauses and further reduce search. A similar technique called rule model is used in the system RDT KW92]. A problem with a systematic search strategy like depth- rst search is that it may visit lots of nodes before reaching a \good" one. Once it commits with a descendant of a node it visits all its descendants before considering the node's siblings. To reduce the e ect of this \early commitment" the best path is recorded in each depth- rst search. The next depth boundary starts searching the best path of the previous depth bound search. Although this procedure is an heuristic ordering of the next path to start with, it does not a ect the completeness of the search and may improve its e ciency.

4.4 Problems in the ying domain

4.4.1 Learning constants in the Herbrand Universe of E+

Introducing constants in literals in the body of a clause may be very important. In the case of the ying domain when learning basic manoeuvres, it is essential to learn the applicability conditions for those operators. Those applicability conditions make use of constants. For example, when learning to control the elevators a rule like elevators( ElevatorsValue, Time ) altitude( Time, Altitude ), less than( Altitude, 2010 ), less than( 1990, Altitude ), DeltaAltitude is Altitude - 2000, ElevatorsValue is k1 * DeltaAltitude.

is very meaningful. 53

How can the constants 2010, 1990, 2000, k1 be learned ? The current solution adopted by both Indlog and Progol consists in special constant declarations (the user speci es the place where the constants will appear) and in using the recall number to produce potential relevant constants (the user has to provide a way of generating potential useful constants). There is a problem if the data is spread along a wide range of values like in the FS domain altitude may vary between 0 ft and 33000ft. With this approach a big number of useless constants have to be produce even if the values are 1,2,1000 and 10001, where the interval to learn would be 1,1001]. The main di culty in learning constants whose values are \available" in H(B ^ E ) (Herbrand Universe of B ^ E ) is that a system like Progol only have access to a small subset i.e (B ^ ei ) during saturation. The entire set is only available at reduction time. All \relevant" atoms and terms are collected during saturation using just B and the data of only one example. An algorithm that collects information from all the examples before inducing the hypotheses would not have that myopic problem. One way of learning those constants requires an extension of the original framework. n In the usual framework E is taken to be V ei. The posterior i su ciency condition is B ^ H j= e ^ : : :en that is to say that the set of hypotheses as to \explain" all the positive examples. The learning algorithms learn a clause at a time that \explain" part of the positive example set. The explained examples are removed from the training set and a new clause is induced. The process iterates until all positives are explained or no further clause can be induced. Using the posterior necessity condition the hypothesis clause to be induced have to explain at least on example. That is to say B ^ hi j= e _ : : : _ en Since e ^ : : : _ en j= e _ : : : en

+ + + + =1 1 1 1 1

54

using transitivity of j= we have

B ^ hi j= e _ : : : _ en We may now derive the most speci c clause that explains all positives using the formula B ^ E j= hi where E is taken as the disjunction of the positive examples. ?H (B, E , C) is the negation of the conjunction of all ground positive atoms derivable from B ^E using the constraint parameter set C. This non-Horn clause is the bottom element of a non-Horn clause subsumption lattice and is the most speci c clause that subsumes all positive examples. All lattices derived using just one example at a time (in a Progol fashion) are sub-lattices of this one. Further, all bottom elements of those lattices (?H (B,ei,C)) subsume the bottom element of the latter lattice (?H (B,E ,C)). Since we are just interested in learning an Horn clause, we just need to traverse the sub-lattice "corresponding" to each example. The advantage is that some literals that belong to the intersection of di erent sub-lattices (corresponding to di erent examples) did not appear in the lattice built in a Progol fashion. For those literals in the \frontier", if there is an input argument that does not belong to the sub-lattice corresponding of the current example, that argument is taken as a constant. Let E be p(1), p(2), p(3) and lessEqThan(+,+) be the only predicate in B. (?H (B,E ,C)) is

1 + + + + + + +

p(1) _ p(2) _ p(3) lessEqThan(1,1) ^ lessEqThan(1, 2) ^ lessEqThan(1, 3) ^ lessEqThan(2, 2) ^ lessEqThan(2, 3) ^ lessEqThan(3, 3) If we start the reduction process starting at the node corresponding to p(1) we may see, for example, that lessEqThan(1,3) belongs to the "frontier" of the sub-lattices of the rst and third example. The term \3" would not have been produced when saturating p(1) in the Progol 55

state(elevators, Rollers, Altitude, ClimbRate) lt(rol1, Rollers), lt(Rollers, rol2), lt(alt1, Altitude), lt(Altitude, alt2), lt(clb1, ClimbRate), lt(ClimbRate, clb2). Figure 4.2: Prototype of a clause in a theory for controlling the elevators. saturation procedure. Since lessEqThan(1,3) has a term (\1") that is a "natural" descendant of the terms of p(1), the term \3" is taken as a constant and lessEqThan(1,3) is included (taking \3" as a constant) in the sub-lattice of p(1). Some advantages of this approach are that saturation is done only once and then reduction is done for each of existing example. In this approach the user does not need to specify the places where the constants are bound to appear. There is no need to use recall number (like in lessEqThan) to produce potential constant values by means of backtracking. There will be no major problem if the range of values to look at is wide since no unnecessary, non available in the data, constants will be generated. The user just needs to provide the \natural" mode declaration of lessEqThan(+,+). The main problem is that the size of ?H increases exponentially (in general) with the number of examples. Memory store size could be a problem. Om the other hand, the recall number allows the system to generate constants not present in H(B, E ).

+

4.4.2 Dealing with large amounts of examples

The Indlog system is now being applied to learn from the ying data. The rst set of experiments consist in learning the elevators and rollers controls for the straight and leveled ight. The target theory is a set of clauses of the form of the prototype shown in Figure 4.2. This kind of theory is very similar in form to the rules extracted from a decision tree. The aim in this early stage is to achieve with an ILP system at least the same results as the ones reported in Chapter 2. Further stages of the project are reported in the concluding chapter. 56

Give the training set: positives: 2 3 4 5 6 7 10 12 15 16 17 18 19 25 26 27 28 29 30 negatives: 0 1 8 21 31 the reported system may already learn the following theory: p(A) ltEq(10, A), ltEq(A, 19). p(A) ltEq(2, A), ltEq(A, 7). p(A) ltEq(25, A), ltEq(A, 30). Figure 4.3: Training set and Indlog learned theory for learning a set of intervals. ltEq means less than or equal. Although very simple in aspect that kind of theory is hard to learn with existing ILP systems like Golem, FOIL or MIS. It requires the use of non-determinate predicates that Golem, for example, does not handle. It makes use of constants chosen from a very large set of available constants in the data set. Systems like MIS would require the user to specify all possible constants and will search through all possible combinations without using any pruning possibility of clever restriction to a small set of relevant constants. As far as the learning task is concerned, learning a clause of this theory or learning a clause like the prototype shown in Figure 4.2 has the same di culties. A major problem, however, is the amount of data from which the theory is to be learned. The number of positive examples extracted for 30 training ights are nearly two thousand and the number of negative examples nearly four thousand. Under these circumstances the amount of time taken by the theorem prover and the amount of memory storage required by the algorithm are key points in the overall performance of the system. A technique of windowing BS94] similar to the one reported in Section 2.1.2 of Chapter 2 or a technique of layered learning as described 57

234567 cycle 1: 2, 6] 7] cycle 2: 2, 7]

10 12 15 16 17 18 19 10, 17] 18, 19] 10, 19]

25 26 27 28 29 30 25, 29] 30] 25, 30]

Figure 4.4: Sequence of theories to learn the set of intervals of the training set previously presented in this section. in Mug93c] are not applicable if completeness of the results should be kept. When using a subset of the training set the nal theory although covering the complete training set may not be guaranteed to be the best on the complete training set. The clauses of such theory are maximum cover clauses only for the examples of the \window". An algorithm capable of using a small amount of resources when learning from the learning data would perform iteratively improving on previously learned theories. In the particular problem of learning a wide range interval the system may start by learning a large set of small intervals and progress towards a smaller set of larger intervals. A set of axioms about transitivity of the less than relation must be provided as background knowledge. The algorithm will use a small recall number to constrain the number of literals in ?H and therefore the search space. With a small recall number the previously learned intervals will be increased by small values but the resources spend are tightly bounded. The general control cycle has to be modi ed to accomplish the iterative improvement of theories and the system has to be able to learn from non-ground examples. With the training set of Figure 4.3 and with a recall number of 3 the learned theories at each cycle would be the ones depicted in Figure 4.4. To improve e ciency, the constants used in the less than relation may be collected from the Herbrand Universe of E (suggestion of Srinivasan ,1994, in a private communication).

+

Some systems like Progol rely on heuristics to guide the search during reduction. Progol, for example uses an A* algorithm in which the heuristic function is computed using the number of positive and negative examples covered by each partial clause. The exact number of 58

examples covered are therefore important in that kind of search. It requires to use the theorem prover to evaluate every example. In the ying domain the number of examples is large and the time spent on the coverage tests is the dominant overall time. An algorithm like the iterative deepening does not rely on heuristics. If a partial clause covers more than the allowed number of negative examples it must be specialised otherwise it is complete and no further search rooted on it should be done. In this circumstances the algorithm may do a \lazy" evaluation of the negative examples. For each evaluation the algorithm stops as soon as there is no more negative examples to be tested or it has covered already one more than the allowed number. The \lazy" evaluation of the negatives may also be done in an heuristic-based algorithm on the literals of the last level since they are not going to be expanded any more. Muggleton (1994 | private communication) suggested that the literals in ?H could be ordered by generality if proper transitivity axioms exist in the background knowledge. If for example l ! l , l would succeed l in that order. If, during reduction, a partial clause had already a literal l (= X < 3, for example) and was considering the possibility of l (= X < 5) the ordering information would be valuable. Every literal up in the ordering may be ruled out of consideration and therefore l will not be joined to the current partial clause. An advantage of the referred ordering of the literals in ?H is that it is done only once after saturation. An extra advantage could be taken also when the number of examples is large and theorem proving is the bottleneck step. If two siblings are ordered by the transitivity axioms like the previous l and l then the set of examples covered by l is a subset of the one covered by l . When evaluating l after having evaluated l only the examples in the di erence set need to be evaluated since the result on the remaining ones is known.

1 2 2 1 1 2 2 1 2 1 2 2 1

4.5 Results

This section presents some results of improved e ciency due to the techniques described in the previous sections. Results were carried out in a sun sparcstation 10. 59

example mult(5,11,55)

original

j?H j

664

potentially potentially removed by the removed by jsimpli ed ?H j connected commutative property property 531 232 81

Table 4.1: Example of simpli cation of the number of literals in ?H when both the information that some functions are commutative and that the theory is connected. mult stands for multiplication. Table 4.2: Statistics concerning the search through the generalisation graph during reduction. The columns compare the result of using axioms of transitivity in the problem of learning an integer interval between 10 and 20. The example used was p(12).

Statistics without axioms with axioms maximum heap (KB) 27356 26108 number of nodes visited 69 50 number of nodes expanded 13 13 average number of direct 2 2 descendants of each node average number of siblings inherited 3.308 1.846 number of complete clauses produced 71 52

Table 4.1 shows same savings when taking advantage of the connected property of the target theory.

60

Statistics not connected theory Connected theory Maximum Heap used (KB) 116296 104832 number of nodes visited 426 373 number of nodes expanded 31 31 average number of direct 7.871 6.161 descendants of each node average number of siblings inherited 6.839 6.839 number of complete clauses produced 28 20 cpu time 5.933 2.933

Table 4.3: E ect on the reduction search of the connected property of a theory. The saturated example is mult( 5, 3, 15).

61

Chapter 5 Conclusions

The results from using decision tree induction show how the inability to use background knowledge together with the use of a propositionallogic level representation may limit or prohibit the solution of complex problems. Although some positive results were obtained the ndings are still too limited for the solution of the full task of learning models for the basic manoeuvres. The results show that for complex manoeuvres the decision trees are unintelligible and that the auto-pilot is strongly dependent on the training instances. Any change to the ight plan requires the generation of a new auto-pilot. Further, an auto-pilot trained in some given circumstances (no existing wind) may no longer be adequate if the testing circumstances are di erent ( ying with existing wind). All these suggest that a more powerful knowledge representations is required, and prompt the use of a rst-order logic learning system. The thesis is that ILP systems capable of learning Horn clause theories from examples are adequate for the ying domain. However, the results in Chapter 3 show that current ILP system lack essential properties to make them immediately applicable to such a complex task. The strongest limitation of the ILP systems described in Chapter 3 is that they have di culties in search large hypothesis spaces. The systems are either ine cient (MIS) or do not justi ably restrict the search space size by using information given by the user or extracted from the available data. Further, systems like FOIL or Golem do not provide any guarantees on the optimality of the generated set of hypotheses. In the case of FOIL the nal theory is not even reduced and may contain 62

clauses that cover no positive examples. The Indlog prototype overcomes some of these limitations by restricting the search space and conducting an e cient traversal of that space. Indlog restricts the search space by rst building the clause at the bottom of the generalisation lattice and then removing from that clause irrelevant atoms. This is similar to the Progol system. The e ciency during the graph traversal is achieved by means of pruning techniques that preserve the completeness of the search. The choice for the reduction algorithm, however, trades speed for memory storage. Indlog is, in general, more slow than Progol, but its reduction algorithm requires less memory storage than the A* search used by Progol. Experiments will be carried out to further investigate this point. It is therefore more adequate to domains characterised by large amounts of data like the ying domain. All the pruning techniques rely on the careful use of information provided by the user or extracted automatically from the training data. Clearly, lack of information does not a ect the nal result but may signi cantly alter the e ciency of the search. The report suggests three important ways in which background knowledge may be useful. The rst use of background knowledge concerns the restriction of the search space by building a small but still su cient ?H clause. The reduction of ?H can be done either by imposing more constraints on the hypothesis language or by taking advantage of some properties of the predicates being used in ?H . The second use of background knowledge concerns new pruning techniques during reduction. The third use of background knowledge concerns reducing the number of nodes of the generalisation lattice. A stronger generalisation ordering may collapse several nodes of the subsumption lattice into a single new equivalent class in the new ordering. A new set of complete operators for traversing the new lattice would then be required. It would be desirable that the set of operators would be powerful enough to do a non-redundant search. The results of Chapter 4 are aimed at demonstrating the rst two points raised. The issue of generalisation orderings has not been addressed in this report. The research to be done in the rest of the thesis will be mainly concerned with using background knowledge to collapse nodes in the generalisation lattice. 63

Learning models of basic manoeuvres in a ight simulator will remain the main test-bed for validating all future developments to the Indlog system.

64

Appendix A Decision Trees results

A.1 Human pilot ight data

This section presents the tables with the statistics concerning the 30 ights used in the training set for c4.5.

A.2 Auto-pilot learning data and results

This section presents information concerning the auto-pilot. Information about the number of events used for generating the trees. Tree's statistics are also presented.

65

mission 1 3 4 5 7 8 9 11 12 13 15 16 17 19 20 21 23 24 25 27 28 29 31 32 33 35 36 37 39 40

y 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

sy heading sheading 0.0 358.47 1.54 0.0 358.45 1.56 0.0 358.49 1.52 0.0 358.48 1.52 0.0 358.49 1.51 0.0 358.46 1.54 0.0 358.48 1.52 0.0 358.51 1.49 0.0 358.51 1.49 0.0 358.49 1.52 0.1 358.50 1.52 0.0 358.50 1.50 0.0 358.50 1.50 0.0 358.49 1.52 0.1 358.49 1.53 0.0 358.52 1.48 0.0 358.48 1.52 0.0 358.48 1.52 0.1 358.48 1.53 0.0 358.49 1.52 0.0 358.47 1.53 0.1 358.46 1.55 0.0 358.50 1.50 0.0 358.53 1.47 0.0 358.49 1.52 0.0 358.49 1.52 0.0 358.52 1.48 0.0 358.52 1.48 0.1 358.49 1.52 0.0 358.51 1.49

Table A.1: Statistics for the take o state. In the ying plan y = 0 ft and heading = 360 .

66

mission 1 3 4 5 7 8 9 11 12 13 15 16 17 19 20 21 23 24 25 27 28 29 31 32 33 35 36 37 39 40

y 0.6 1.7 -0.6 0.0 1.7 3.2 0.2 0.0 0.0 0.8 -0.6 0.1 0.0 0.1 -1.0 0.0 0.0 0.0 -1.0 0.1 0.0 -1.4 0.0 0.0 0.4 0.4 0.4 0.8 -1.0 0.0

sy heading sheading climbrate sclimbrate roll sroll 0.8 360.00 0.00 2686 880.5 -0.0003 0.0006 2.2 360.00 0.00 2602 748.5 0.0004 0.0007 0.5 357.07 2.32 2599 853.3 -0.0008 0.0007 0.0 360.00 0.00 3007 834.7 -0.0003 0.0005 2.1 360.00 0.00 2616 820.1 0.0002 0.0009 4.1 360.00 0.00 2792 737.2 0.0021 0.0017 0.4 360.00 0.00 2660 749.0 0.0000 0.0000 0.0 360.00 0.00 2817 777.3 0.0000 0.0000 0.0 360.00 0.00 2858 965.4 -0.0001 0.0004 1.1 360.00 0.00 2859 849.1 -0.0001 0.0010 0.5 357.61 2.39 2726 795.0 -0.0004 0.0005 0.3 360.00 0.00 2932 763.2 -0.0003 0.0006 0.0 359.91 0.65 2820 774.8 -0.0003 0.0005 0.2 360.60 0.00 3170 900.2 0.0002 0.0004 0.0 355.23 0.00 2901 830.6 -0.0008 0.0004 0.0 359.00 1.95 2904 853.3 -0.0004 0.0005 0.0 360.00 0.00 3010 876.7 0.0000 0.0000 0.0 360.00 0.00 2777 847.2 -0.0001 0.0003 0.0 355.23 0.00 3166 906.0 -0.0010 0.0000 0.3 360.00 0.00 2708 903.8 0.0002 0.0004 0.0 360.00 0.00 2748 778.3 0.0000 0.0000 0.7 355.23 0.00 2701 824.1 -0.0010 0.0009 0.2 360.00 0.00 2643 816.5 0.0000 0.0000 0.0 360.00 0.00 2941 849.0 -0.0004 0.0005 0.6 360.00 0.00 2783 709.6 -0.0002 0.0005 0.8 360.00 0.00 2935 750.1 0.0002 0.0004 0.7 360.00 0.00 2855 881.6 -0.0001 0.0003 1.2 360.00 0.00 2611 754.5 0.0002 0.0006 0.0 355.23 0.00 2912 755.6 -0.0010 0.0000 0.0 359.98 0.33 2883 831.9 0.0000 0.0000

Table A.2: Statistics for the climbing state. In the ying plan y = 0 ft, heading = 360 , roll = 0.0 rad, climb rate = 3000ft/h.

67

mission 1 3 4 5 7 8 9 11 12 13 15 16 17 19 20 21 23 24 25 27 28 29 31 32 33 35 36 37 39 40

y 2.5 9.8 -3.6 -0.8 8.3 19.9 2.4 -0.3 -1.1 1.8 0.3 -0.2 -0.4 3.1 -1.4 -2.0 -0.1 0.0 -2.5 4.8 0.5 4.0 0.9 0.0 -0.8 4.3 2.5 6.1 -1.0 0.0

sy heading sheading roll sroll 0.7 356.69 2.21 -0.0029 0.0003 1.1 358.85 2.05 -0.0044 0.0013 1.3 355.23 0.00 -0.0010 0.0007 0.9 355.51 1.11 -0.0017 0.0005 0.7 358.16 2.33 -0.0041 0.0010 2.4 358.43 2.25 -0.0128 0.0056 2.5 357.49 2.39 -0.0036 0.0022 1.6 356.81 2.25 -0.0018 0.0016 1.7 356.56 2.14 -0.0023 0.0012 2.6 356.20 1.93 -0.0062 0.0013 0.5 360.00 0.00 0.0000 0.0000 1.2 355.58 1.25 -0.0049 0.0017 0.5 355.23 0.00 -0.0015 0.0005 1.1 358.04 2.35 -0.0019 0.0011 0.5 355.43 0.95 -0.0005 0.0007 1.1 355.23 0.00 -0.0004 0.0008 0.7 357.66 2.39 -0.0027 0.0023 0.0 357.23 2.36 -0.0016 0.0005 0.6 356.06 1.81 0.0004 0.0010 2.0 360.00 0.00 -0.0010 0.0008 0.9 357.15 2.34 -0.0014 0.0008 6.9 359.04 1.91 -0.0004 0.0067 0.3 357.62 2.39 -0.0016 0.0006 0.0 357.88 2.28 -0.0003 0.0005 3.3 356.03 1.78 -0.0046 0.0017 0.8 359.52 1.44 -0.0023 0.0011 0.6 357.30 2.37 -0.0033 0.0010 1.0 360.00 0.00 -0.0020 0.0007 0.0 355.23 0.00 -0.0006 0.0005 0.0 360.00 0.00 -0.0007 0.0005

Table A.3: Statistics for the levelling state. In the ying plan y = 0 ft, heading = 360 , roll = 0.0 rad.

68

mission 1 3 4 5 7 8 9 11 12 13 15 16 17 19 20 21 23 24 25 27 28 29 31 32 33 35 36 37 39 40

y 13.7 8.6 6.5 23.4 4.0 15.5 15.8 8.0 9.7 19.4 1.3 15.4 0.6 7.1 8.5 -5.9 15.3 4.5 11.9 0.2 7.6 31.7 17.8 12.0 -4.5 3.2 2.5 14.9 14.3 1.6

sy heading sheading 26.2 357.44 2.38 13.3 357.36 2.37 34.4 357.76 2.38 46.5 357.97 2.36 23.1 357.36 2.37 34.2 357.68 2.38 33.0 357.26 2.36 20.4 357.17 2.34 54.9 357.72 2.38 38.3 357.50 2.38 19.6 357.32 2.37 38.6 357.20 2.35 25.8 357.26 2.36 24.7 357.85 2.37 20.6 357.55 2.38 36.7 357.58 2.38 34.4 357.87 2.37 39.4 357.37 2.37 27.9 357.13 2.33 22.7 357.26 2.36 26.5 357.34 2.37 50.7 357.47 2.38 17.8 357.60 2.38 27.0 357.69 2.38 27.8 357.75 2.38 23.4 357.44 2.38 27.6 357.56 2.38 41.5 357.55 2.38 23.4 357.62 2.38 17.3 357.65 2.38

roll -0.0001 -0.0018 -0.0005 -0.0014 -0.0007 -0.0003 0.0001 -0.0003 -0.0003 -0.0015 -0.0002 -0.0017 0.0002 -0.0005 -0.0002 -0.0003 -0.0013 -0.0004 -0.0016 -0.0001 0.0005 0.0006 -0.0008 -0.0007 -0.0003 0.0008 0.0007 -0.0008 -0.0015 0.0002

sroll 0.0112 0.0108 0.0128 0.0117 0.0086 0.0152 0.0160 0.0066 0.0171 0.0138 0.0061 0.0167 0.0087 0.0079 0.0103 0.0112 0.0127 0.0100 0.0099 0.0133 0.0109 0.0163 0.0081 0.0125 0.0166 0.0159 0.0196 0.0251 0.0196 0.0093

clbr -0.3 21.0 -5.3 -13.4 4.6 11.5 -22.3 -41.3 5.0 5.4 2.2 -13.9 1.1 2.5 -15.3 1.8 9.0 9.1 3.8 -10.7 15.6 -5.6 5.6 0.7 4.7 8.5 19.7 -10.0 6.3 15.9

sclbr 540.8 563.4 314.9 220.7 310.7 494.5 327.9 425.1 289.4 318.9 328.7 538.1 418.9 409.3 328.8 318.3 285.6 277.2 400.4 277.4 441.3 461.8 203.7 409.5 266.1 419.5 318.4 483.6 328.1 267.6

alt 1531 1493 1508 1491 1507 1509 1510 1509 1497 1496 1500 1504 1511 1513 1505 1520 1505 1520 1508 1508 1519 1505 1516 1495 1525 1503 1512 1505 1509 1529

salt 46 40 32 19 19 40 23 38 32 33 41 44 27 35 24 28 30 25 37 17 29 32 22 27 24 30 26 34 27 27

Table A.4: Statistics for the straight and levelled ight state. In the ying plan y = 0 ft, heading = 360 , roll = 0.0 rad. clbr stands for climb rate, and alt stands for altitude.

69

elevators taking o 72 climbing 718 levelling 244 stlv 2140 turning 2193 aligning 3996 descending 527

rollers 57 110 81 1536 3163 6585 278

Table A.5: Number of events in each state for elevators and rollers.

control and state nodes height elevators at turning 2465 36 rollers at turning 2417 34 rudder at turning 165 20 elevators at aligning 818 40 rollers at aligning 175 15 rudder at aligning 665 27 Table A.6: Some tree statistics for the complex manoeuvres such as levelled turn and align.

70

Appendix B Logic Programming De nitions

This appendix lists Logic Programming de nitions used in this report. The de nitions were collected in Hog90].

B.1 The language

Let alphanumeric be the set f0,: : :,9,a,: : :,z,A,: : :,Zg (of digits and upper and lower case letters). A variable symbol is a upper case letter followed by zero or more alphanumeric symbols. A function symbol is a lower case letter followed by zero or more alphanumeric symbols. A predicate symbol is a lower case letter followed by zero or more alphanumeric symbols.

De nition B.1 A term is recursively de ned as follows: a variable is a term.

a function symbol followed by a bracketed n-tuple of terms is a function. A function is a term. If n is zero then the function is called a constant and we may omit the brackets. n is called the arity of the term.

Examples of terms: Human is a variable. peter is a constant. 71

of terms. If n is zero then we may omit the brackets. n is called the arity of the predicate.

age( peter ) is a function of arity 1. A predicate is a predicate symbol followed by a bracketed n-tuple

De nition B.2 A w is de ned inductively as follows:

a predicate is a formula (called atomic formula or,more simply an atom). If F and G are formulas, then so are (:F), (F^G), (F_G), (F!G) and (F$). If F is a formula and x is a variable, then (8xF) and (9xF) are formulas.

De nition B.3 A literal is an atom or the negation of an atom. In

the rst case is called a positive literal and in the second case is called a negative literal.

De nition B.4 A clause is a nite (possibly empty) set of literals. The empty clause is denoted by 2. A clause represents the disjunction of literals. It can, therefore, be represented as L _ L _ : : : _ : Li _ : Li _ : : : _ Ln or, equivalently as L _ L : : : L i ^ Li ^ : : : ^ Ln . A clause with no variables is called a ground clause. De nition B.5 A de nite clause is a clause which has exactly one

1 2 +1 1 2 +1

positive literal. A de nite clause with no negative literals is called a de nite unit clause. The positive literal is called the head of the clause. The set of all negative literals is called the body of the clause.

De nition B.6 A negative clause is a clause which has zero or more De nition B.7 A Horn clause is a clause with at most one positive

72

negative literals and no positive literals. The empty clause is a negative clause. In the Logic Programming context a negative clause is called a goal.

literal.

De nition B.8 A program clause is of the form head

b1 ^ : : : ^ bn where head is an atom and each bi is of the form A or not A and A is an atom. De nition B.9 A set of clauses is called a clausal theory. The empty clausal theory is denoted by .

De nition B.10 A set of ground atoms is called a model. De nition B.11 The Herbrand Universe of a Language L, HL is the

B.2 Models and substitutions

set of ground terms formed from the function symbols of L. If L contains no constants, we additionally use a new symbol f0 when constructing 0 HL . De nition B.12 The Herbrand Base of a language L, B(L) is the set of ground atoms formed from the predicate symbols of L and from the terms of HL. De nition B.13 A theory T is consistent if it has a model, otherwise it is inconsistent. De nition B.14 Let F1 and F2 be two w s. F1 logically entails F2, or F1 j= F2 i every model of F1 is a model of F2 . De nition B.15 Let F1 and F2 be two w s. F1 syntactically entails F2 , or F1 `I F2 i F2 can be derived from F1 using the set of inference rules I. De nition B.16 The set of inference rules I is said to be deductively sound and complete i F1 `I F2 whenever F1 j= F2. In this case the subscript may be dropped (F1 ` F2). De nition B.17 A w F is said to be satis able if there is a model for F and unsatis able otherwise. F is unsatis able i F j= 2. De nition B.18 Let = v1=t1; : : :; vn=tn. is said to be a substitution when each vi is a variable and each ti is a term, and vi = vj ! i = j. Let E be a w or a term and = v1=t1 ; : : :; vn =tn be a substitution. The instantiation of E by , written E , is formed by replacing every occurrence of vi in E by ti .

73

B.3 Resolution and deduction

=s .

De nition B.19 A substitution is uni er of two terms t and s if t De nition B.20 A uni er of two formulae t and s is a most general

uni er if all uni ers 0 can be decomposed into a composition ( 00) = 0 . mgu(t,s) denote the most general uni er of t and s. De nition B.21 If C = fAgS C0 and D = f:A0g S D0 S clauses, are and = mgu(A, A0) then the resolvent of C and D is (C0 D0 ) .

74

Bibliography

C. W. Anderson and W. T. Miller. A set of challenging control problems. MIT Press, 1991. Eds. Miller, Sutton and Werbos. Blu75] Blum, L. and Blum, M. Towards a mathematical theory of inductive inference. Information and Control, 28:125{155, 1975. BMV91] I. Bratko, S. Muggleton, and A. Varsek. Learning qualitative models of dynamic systems. In Proceedings of the Eighth International Machine Learning Workshop, San Mateo, Ca, 1991. Morgan-Kaufmann. BS94] M. Bain and A. Srinivasan. Incremental construction of ILP theories with large-scale unstructured data. In K. Furukawa, D. Michie, and S. Muggleton, editors, Machine Intelligence 14. Oxford University Press, Oxford, 1994. To appear. Bun88] Wray Buntine. Generalised subsumption and its applications to induction and redundancy. Arti cial Intelligence journal, 36(2):149{176, 1988. revised version of the paper that won the A.I. Best Paper Award at ECAI-86. DB92] L. De Raedt and M. Bruynooghe. An overview of the interactive concept-learner and theory revisor CLINT. In S.H. Muggleton, editor, Inductive Logic Programming, pages 163{191, London, 1992. Academic Press. DM92] B. Dolsak and S.H. Muggleton. The application of inductive logic programming to nite element mesh design. 75 AM91]

In S.H. Muggleton, editor, Inductive Logic Programming, pages 453{472, London, 1992. Academic Press. Fen91] C. Feng. Inducing temporal fault diagnostic rules from a qualitative model, pages 403{406. Morgan Kaufmann, San Mateo, CA, 1991. Hog90] Christopher John Hogger. Essentials of Logic Programming. Clarendon Press, Oxford, 1990. Kin91] R. King. PROMIS: Experiments in machine learning and protein folding. In D. Michie, editor, Machine Intelligence 12. Oxford University Press, 1991. KLS92] R. King, S. Muggleton R. Lewis, and M. Sternberg. Drug design by machine learning: The use of inductive logic programming to model the structure-activity relationships of trimethoprimanalogues binding to dihydrofolate reductase. Proceedings of the National Academy of Sciences, 89(23), 1992. Kon84] Kononenko, I., Bratko, I. and Roskar, E. Experiments in automatic learning of medical diagnosis rules. Tr, Jozef Stefan Institute, Ljubliana, Yugoslavia, 1984. KS90] R. King and M.J.E. Sternberg. A machine learning approach for the prediction of protein secondary structure. Journal of Molecular Biology, 216:441{457, 1990. KW92] J.U. Kietz and S. Wrobel. Controlling the complexity of learning in logic through syntactic and task-oriented models. In S.H. Muggleton, editor, Inductive Logic Programming, pages 335{359, London, 1992. Academic Press. MBHM90] D. Michie, M. Bain, and J. Hayes-Michie. Cognitive models from subcognitive skills. In M.J. Grimble J. McGhee and P. Mowforth, editors, Knowledge-Based Systems for Industrial Control, pages 71{99. Peter Peregrinus for IEE, London, UK, 1990. 76

Donald Michie and Rui Camacho. Building symbolic representations of intuitive real-time skills from performance data. In Machine Intelligence 13, pages 385{418. Oxford University Press, Oxford, United Kingdom, 1994. eds. K. Furukawa, D. Michie and S. Muggleton. MF90] Stephen Muggleton and Cao Feng. E cient induction of logic programs. In Proceedings of the First International Workshop on Algorithmic Learning Theory|ALT90, pages 368{381, 1990. MF92] Stephen Muggleton and Cao Feng. E cient induction of logic programs, pages 281{298. Academic Press, London, 1992. ed. Muggleton, Stephen. Mit82] T. M. Mitchell. Generalization as search. Arti cial intelligence, 18(2):203{226, 1982. Mor92] Eduardo Morales. Learning chess patterns, pages 517{ 537. Academic Press, London, U.K., 1992. ed. Muggleton, Stephen. Mug91] Stephen Muggleton. Inductive logic programming. New Generation Computing, 8(4):295{318, 1991. Mug93a] S. Muggleton. Inductive logic programming:derivations, successes and shortcomings. In Proceedings of the European Conference on Machine Learning: ECML-93., pages 21{ 37, Vienna, Austria, April 1993. Mug93b] S. Muggleton. Logic programming part ii: Inductive logic programming. In course notes of the Logic Programming subject. M.Sc. in Computation, Programming Research Group, Oxford, pages 1{106. Oxford, 1993. Mug93c] S. Muggleton. Optimal layered learning: a pac approach to incremental sampling. In Proceedings of the Fourth International Workshop on Algorithmic Learning Theory., 1993. Mug94] S. Muggleton. Inductive logic programming:derivations, successes and shortcomings. SIGART Bulletin, 5(1):5{11, January 1994. 77

MC94]

O'K83]

Plo69] Qui79a]

Qui79b] Qui82]

Qui86] Qui87]

Qui90] Qui91]

Rob65]

R. A. O'Keefe. Concept formation from very large training sets. In IJCAI 83: Proceedings of the Eighth International Joint Conference on Arti cial Intelligence, Karlsrhue, West Germany, 1983. eds. Morgan Kaufmann. G. D. Plotkin. A note on inductive generalisation, pages 153{163. Edinburgh University Press, Edinburgh, 1969. eds. Meltzer, B. and Michie, D. J. R. Quinlan. Discovering rules from large collections of examples: a case study. In D. Michie, editor, Expert Systems in the Micro-electronic Age, pages 168{201. Edinburgh University Press, Edinburgh, 1979. J. R. Quinlan. Induction over large data bases. Technical Report memo HPP-79-14, Standford University, 1979. J.R. Quinlan. Semi-autonomous acquisition of patternbased knowledge. In D. Michie J. E. Michie and Y.-H. Pao, editors, MachineIntelligenc vol 10, pages 159{172. Ellis Horwood, Chichester/Halsted, New York, 1982. J.R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81{106, 1986. Have a copy and read it. J.R. Quinlan. Generating production rules from decision trees. In Proceedings of the Tenth International Conference on Arti cial Intelligence, pages 304{307, San Mateo, CA:, 1987. Morgan-Kaufmann. J. Ross Quinlan. Learning logical de nitions from relations. Machine Learning, 5(3):239{266, August 1990. J.R. Quinlan. Determinate literals in inductive logic programming. In IJCAI-91: Proceedings of the Twelfth International Joint Conference on Arti cial Intelligence, pages 746{750, San Mateo, CA:, 1991. Morgan-Kaufmann. J. A. Robinson. A machine-oriented logic based on the resolution principle. Journal of the ACM, 12(1):23{41, Jan 1965. 78

Rou90]

C. Rouveirol. Saturation: Postponing choices when inverting resolution. In Proc. Ninth European Conference on Arti cial Intelligence, pages 557{562, London, 1990. Pitman. Rou92] C. Rouveirol. Extensions of inversion of resolution applied to theory completion. In S.H. Muggleton, editor, Inductive Logic Programming, pages 63{92, London, 1992. Academic Press. Sha81] E.Y. Shapiro. Inductive inference of theories from facts. Technical Report 192, Dept. Comp. Sci., Yale University, Connecticut, 1981. Sha83] E.Y. Shapiro. Algorithmic program debugging. MIT Press, 1983. SHKM92] Claude Sammut, S. Hurst, D. Kedzier, and Donald Michie. Learning to y. In Proceedings of the Ninth International Workshop of Machine Learning 92, pages 385{393, Aberdeen, U.K., July 1992. SSMS94] A. Srinivasan, R. King S. Muggleton, and M. Sternberg. Exhaustive search in ilp: a case study. In Proceedings of the Eleventh International Conference on Machine Learning (ML94), 1994. SZ93] Claude Sammut and Tatjiana Zrimec. Building pilot training aids by induction. In Workshop on Real-world applications of Machine Learning in the European Conference on Machine Learning, Vienna, Austria, april 1993. Wir88] R. Wirth. Learning by failure to prove. In Proc. Third European Working Session on Learning, pages 237{251, London, 1988. Pitman.

79

赞助商链接

- Knowledge Discovery in Databases - An Inductive Logic Programming Approach
- TECHNIQUES FOR IMPROVING THE EFFICIENCY OF INDUCTIVE LOGIC PROGRAMMING IN THE CONTEXT OF DA
- information in Inductive Logic Programming
- Inductive Logic Programming
- Lecture Notes in Computer Science 1 A Perspective on Inductive Logic Programming
- TECHNIQUES FOR IMPROVING THE EFFICIENCY OF INDUCTIVE LOGIC PROGRAMMING IN THE CONTEXT OF DA
- information in Inductive Logic Programming
- The logic of learning a brief introduction to Inductive Logic Programming
- 1 An Empirical Study of the Use of Relevance Information in Inductive Logic Programming
- Inductive Logic Programming
- An Empirical Evaluation of Bagging in Inductive Logic Programming
- Improving the eciency of inductive logic programming through the use of query packs
- The field of Inductive Logic Programming
- Compact Representation of Knowledge Bases in Inductive Logic Programming
- Editorial Inductive Logic Programming is Coming of Age

更多相关文章：
更多相关标签：