9512.net

甜梦文库

甜梦文库

当前位置：首页 >> >> # Prof.dr. M. van Lambalgen

Decision-Theoretic Robotic Surveillance

ILLC Dissertation Series 2002-01

For further information about ILLC-publications, please contact Institute for Logic, Language and Computation Universiteit van Amsterdam Plantage Muidergracht 24 1018 TV Amsterdam phone: +31-20-525 6051 fax: +31-20-525 5206 e-mail: illc@wins.uva.nl homepage: http://www.illc.uva.nl/

Decision-Theoretic Robotic Surveillance

Academisch Proefschrift ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magni?cus prof.mr. P.F. van der Heijden ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Aula der Universiteit op vrijdag 25 januari 2002, te 10.00 uur door

Nikolaos Alexiou Massios

geboren te Athene, Griekenland.

Promotores: Co-promotores:

Prof.dr.ir. F.C.A. Groen Prof.dr. M. van Lambalgen Dr.ir. L. Dorst Dr. F.P.G.M. Voorbraak

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Universiteit van Amsterdam Plantage Muidergracht 24 1018 TV Amsterdam

Copyright c 2001 by Nikos Massios Cover design by the author. A Typeset in L TEX. Printed and bound by Print Partners Ipskamp. ISBN: 90–5776–078–9

? & " ?? % $ ? ? # ?! ? ?§ ¨¨?¤?? ? ?? ? ? §???

? ? ? ?

v ?

Contents

Acknowledgments 1 Introduction 1.1 Surveillance as a planning task . . . . . . . . . . . . . . . . . . . . 1.2 Approach in this thesis . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Issues in Surveillance 2.1 Robotic surveillance projects . . . . . . . . . . . 2.1.1 List of surveillance projects . . . . . . . 2.1.2 Classi?cation of surveillance approaches 2.2 Surveillance in this thesis . . . . . . . . . . . . . 2.3 Key issues in surveillance planning . . . . . . . 2.3.1 Comparing surveillance strategies . . . . 2.3.2 A priori information . . . . . . . . . . . 2.3.3 Level of detail . . . . . . . . . . . . . . . 2.3.4 Intelligent and non-intelligent opponents 2.3.5 Con?dence . . . . . . . . . . . . . . . . . 2.3.6 Problem complexity . . . . . . . . . . . . 2.4 Other work on surveillance planning . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . 3 Problem Formalisation and Complexity 3.1 Problem assumptions . . . . . . . . . . . 3.2 Decision-theoretic setting . . . . . . . . . 3.2.1 Formal environment model . . . . 3.2.2 Surveillance strategies . . . . . . 3.2.3 Examples and ?rst results . . . . vii

xi 1 2 2 3 5 5 5 10 12 13 13 14 15 16 16 17 17 18 19 20 21 21 23 26

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3.3

3.4

3.5

3.6

(PO)MDP settings . . . . . . . . . . . . . . . . . . . 3.3.1 Introduction to POMDPs . . . . . . . . . . . 3.3.2 POMDP setting . . . . . . . . . . . . . . . . . 3.3.3 Direct MDP settings . . . . . . . . . . . . . . 3.3.4 Equivalence of di?erent settings . . . . . . . . Surveillance state space size . . . . . . . . . . . . . . 3.4.1 State space size derivation . . . . . . . . . . . 3.4.2 The value of T and the exponential expression Standard (PO)MDP solving approaches . . . . . . . . 3.5.1 Value functions; value and policy iteration . . 3.5.2 Piecewise linear discretisations . . . . . . . . . 3.5.3 Planning with factored representations . . . . 3.5.4 Reinforcement learning using neural networks 3.5.5 State-action pair sampling . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

35 36 39 41 43 46 48 49 50 51 53 54 56 57 59 63 63 64 64 66 67 67 68 71 74 75 77 77 79 81 82 83 84 85 86 90 91 93

4 Hierarchy and Strategies 4.1 An o?ce-like environment . . . . . . . . . . . . . 4.1.1 Environment graph construction . . . . . . 4.1.2 The cost barrier problem . . . . . . . . . . 4.1.3 Simulation experiments . . . . . . . . . . . 4.2 The hierarchical minimum expected cost strategy 4.2.1 Abstraction tree construction . . . . . . . 4.2.2 Decision procedure . . . . . . . . . . . . . 4.2.3 Revised decision procedure . . . . . . . . . 4.2.4 Discussion . . . . . . . . . . . . . . . . . . 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . 5 Path-based Clustering 5.1 Clustering desiderata . . . . . . . . . . . . . . 5.2 O?ce-like environments . . . . . . . . . . . . 5.3 Route abstraction simpli?cation . . . . . . . . 5.4 Expected cost assignment . . . . . . . . . . . 5.4.1 Approximate probability computation . 5.4.2 Cost computation . . . . . . . . . . . . 5.4.3 Room expected cost . . . . . . . . . . 5.4.4 Block expected cost . . . . . . . . . . . 5.4.5 Higher-level cluster expected cost . . . 5.5 Properties of expected cost equations . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . viii

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

6 Path-based Decisions 6.1 New clustering of our o?ce building . . 6.2 The ?xed cluster route strategy . . . . 6.2.1 Overview . . . . . . . . . . . . 6.2.2 Candidate trajectory generation 6.2.3 The decision level . . . . . . . . 6.2.4 Trajectory selection . . . . . . . 6.3 Simulation results . . . . . . . . . . . . 6.3.1 Cost-based comparison . . . . . 6.3.2 Sensitivity to clustering . . . . 6.4 Summary . . . . . . . . . . . . . . . . 7 Conclusions 7.1 Overview . . . 7.2 Claims . . . . 7.3 Further work 7.4 Final word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

95 95 96 96 96 98 99 101 101 106 107 111 111 112 113 113

A Example POMDP application. 115 A.1 Setting example 3.2.4 as a POMDP . . . . . . . . . . . . . . . . . 115 A.2 One step look-ahead POMDP application . . . . . . . . . . . . . . 117 A.3 Remarks related to the example . . . . . . . . . . . . . . . . . . . 119 Bibliography Index Samenvatting Abstract 121 127 129 131

ix

Acknowledgments

First and foremost I would like to express my gratitude to Leo Dorst and Frans Voorbraak. This thesis is mainly the result of the weekly interaction I had with them and the expert and wise guidance they o?ered to me over the last 4 years. I am greatly indebted to them. Many thanks should also go to Michiel van Lambalgen and Frans Groen for their valuable help. They were always eager to listen to what I had to say and their questions often guided me through my research. Moreover, their ?nal reading and comments greatly improved the readability of this thesis. My o?cemates Stephan ten Hagen and Joris Portegies-Zwart helped me with some latex related problems and with the translation of the Dutch summary. Marco Vervoort supplied the excellent ILLC dissertation series style-?le for latex and Cesar Garita gave me a CorelDRAW template which helped in making the cover of this thesis. I thank them, therefore, for their kind contributions. Finally, there are a lot more people that should be thanked for moral support of varying degrees and in various periods throughout my work. I ?nd it hard to thank everyone at the right amount and still be fair but I hope that my every day attitude should make my feelings clear. An exception will be made for my parents and sister who always stood by me, Amsterdam December 2001. Nikos Massios

? ? ?

xi

?

?

¨ ??

? §?

? ? ? ¤

? ??

§

?

Chapter 1

Introduction

Surveillance can be informally de?ned as “a close watch kept over something or someone with the purpose of detecting the occurrence of some relevant events”. The person or thing to be watched should be understood to include humans, animals, areas, places, parts of aerospace, et cetera. What events are considered relevant depends on the type of surveillance and the reason why we do it. Most surveillance tasks are very mundane and automating them is highly desirable. Automated surveillance typically concerns itself with detection and recognition of relevant events. A system of CCTV cameras, microphones or smoke detectors can be used to detect interesting events or to analyse patterns of behaviour. Most of the existing work on robotic surveillance is on recognition of relevant events. However, event recognition is only one aspect of surveillance. Another aspect, and the concern of this thesis, is that of selecting the right places to focus the surveillance attention on. A solution to this is essential in situations where using ?xed surveillance sensors is either not possible, or not practical, or where mobile platforms are more socially acceptable since people then know they are being observed. Further, a solution to the problem of selecting the optimal places to focus attention on can be important not only to robots but also to human security guards. Commonly, the reason for focusing attention on one area is that relevant events are more likely to be detected there either because they are intrinsically more likely to occur, or because the area has not been observed for a while - increasing the chance that something is there. Because the robot is not omnipresent, this becomes a problem of limited resource allocation. With enough knowledge of the structure and the parameters a?ecting the decisions, this allocation problem can be automated. 1

2

Chapter 1. Introduction

1.1

Surveillance as a planning task

By reviewing surveillance in section 2.1, three main capacities in which an arti?cial or human agent can be involved in a surveillance task are identi?ed. They are those of: coordinator An agent that determines or constrains which resources can be used for the surveillance task. The coordinator of a surveillance task may decide on the available budget, on the types of sensors to be used, on its relative importance compared with other tasks, et cetera. planner An agent that decides how to use the available resources (sensors and actuators) to perform the surveillance task. Typically, a planner has a clear goal like exploration or cost minimisation. It should be seen as something much more task-speci?c than a coordinator. sensor An agent that gathers information and transmits it to the planner. A security camera is an example of sensor. One way a robot can be used in surveillance is in the capacity of a (?exible) sensor. In that case, the planning of the surveillance task would be typically done by a human operator. An example is a security guard teleoperating a robot that inspects a hazardous environment. However, it is the next level that concerns us here. The subject of this thesis is the investigation of autonomous surveillance planning for an o?ce-like environment. The main concern is to develop strategies or algorithms that move the robot in a way that the expected cost of relevant events remaining undetected is minimised. To make the discussion simpler we will focus on one type of relevant events, namely that of ?res. A detailed justi?cation of these choices is given in the next chapter.

1.2

Approach in this thesis

Humans can perform surveillance tasks quite well most of the time. However, the process that makes this possible is not immediately clear, which makes it hard to replicate it and hard to assess its optimality. In the case where the probabilities and costs are known, we would like to get a structured understanding based on an analysis of surveillance as a probabilistic decision process. We are eventually interested in an algorithmic implementation of such a decision process. The detailed exposition of the surveillance problem in the rest of the thesis will make it clear that instances of it for simple environments and simple events are still computationally very hard. So it is not clear how a robot should compute solutions to such instances if results are to be found in a reasonable time.

1.3. Thesis overview

3

Approximation is a possible solution to the computational issues. However, approximations, in turn, have the problem of departing from optimal. A di?cult balancing act exists between the need to approximate and the computational need to get an approximation that is within some reasonable bound to the optimal. The approximations proposed in this thesis are based on abstracted representations of the environment. Eventually, decisions are taken among di?erent routes of the abstract environment nodes. A mix of theoretic and simulation experiments are presented to support our discussion.

1.3

Thesis overview

This thesis is divided into the following chapters: Chapter 2 - Issues in Robotic Surveillance. An overview of current and past research in robotic surveillance is given to provide some context. It should become clear from the discussion that most existing work on robotic surveillance addresses event recognition and that not much is on planning. On the basis of this review the decision was taken to concentrate on surveillance planning. In the rest of this chapter some surveillance planning-related issues which a?ect our work are mentioned. Chapter 3 - Problem Formalisation and Complexity. The speci?c problem of ?re surveillance is set using several formalisms like (PO)MDPs or decision theory. These formalisms are shown to be equivalent and conveying the exponential nature of surveillance planning viewed as an optimal search problem with the aim of minimising the expected cost of ?res. Because of its exponential nature, simple n-step strategies cannot solve this problem optimally, and consequently approximation methods for the surveillance planning problem become a priority. Chapter 4 - Hierarchy and Strategies. A “cost barrier” problem de?ned on a speci?c o?ce-like building is presented as an example of what a simple n-step look-ahead strategy cannot solve. Then a hierarchical abstraction of this environment is given along with a ?rst ad-hoc expected cost approximation, which does not solve the problem in all circumstances, but demonstrates the promise of abstraction. Chapter 5 - Path-based Clustering. This chapter deals with abstracting in a more principled manner. A much better assignment of the abstracted expected costs can be produced if the geometry of the environment is considered in more detail. After discussing some general desiderata for clustering an o?ce building, we concentrate on the speci?c case of the corridor-based o?ce building containing a “cost barrier”. A better method for assigning

4

Chapter 1. Introduction expected costs is produced which di?erentiates between di?erent types of routes that visit abstract nodes.

Chapter 6 - Path-based Decisions. Given the abstraction, a decision procedure for choosing between clusters is necessary. Several decisions concerning the look-ahead and the type of decisions needed are discussed. Then the ?xed cluster route strategy which works on the route-generated abstraction of the environment is presented. It is shown to work well in most cases and to have some robustness with reference to abstraction choices. Chapter 7 - Conclusions. This chapter brie?y draws general conclusions from the speci?cs of the other chapters.

Chapter 2

Issues in Surveillance

Several robotic surveillance projects exist worldwide and a multitude of di?erent aspects have been and are currently being researched. Despite some overlap in the scope of some projects, there are features that are quite unique in others. In this chapter, we give an overview of current and past research in robotic surveillance to provide the context for our work. It should become clear from the discussion that only few projects have been concerned with surveillance planning and that most existing work addresses event recognition. After mentioning the research and assumptions of others, we proceed to make explicit what kind of surveillance we are considering. Then some key issues in surveillance planning, which have been identi?ed from the literature review, are listed together with the design choices made in relation to each of these issues.

2.1

Robotic surveillance projects

Some robotic surveillance research projects are ?rst listed and then classi?ed. The aim here is to give an idea of the current status of research, rather than completely describe everything that has been done. The classi?cation of the listed projects at the end of the section (see table 2.1) aims to help the reader assess their merits and weaknesses.

2.1.1

List of surveillance projects

AUVs. The goal there is to produce a ?eet of small, inexpensive, autonomous underwater vehicles (AUVs) that will roam the earth’s oceans [Fri94]. Oceanographers often want simultaneous measurements in many places. These measurements are necessary in building and testing models of the dynamics of the earth’s oceans. Currently, such measurements have to be taken on-board ships and it is normally prohibitively expensive to use 5

6

Chapter 2. Issues in Surveillance enough ships to collect data at the required density. A great number of inexpensive AUVs can reduce the cost of collecting such data [Bel97]. However, certain planning problems arise from the fact that these robots have to be capable of performing autonomous navigation inside the sea. For instance, the robot has to be able to avoid underwater obstacles, yet collect the required measurements. An easy way of avoiding underwater obstacles is to get the AUV to rise to the surface. However, rough seas can also be a threat and some sort of compromise has to be made. Furthermore, since the energy of the robot is limited, an optimal survey resolution and time constraint parameters have to be found. This must be done so that the survey takes the minimum amount of energy possible [BW96, WYBM96]. Moreover, there are plans to use “distributed intelligence” by making the robots exchange information about their position and status [TT98]. To do this, underwater communications and robot control mechanisms have to be used [KML+ 96]. Apart from collecting measurements, other applications suggested include surveillance of ?sheries, detection of toxic/nuclear dumping and coast guard support.

(a) Inside view

(b) Relative size

Figure 2.1: An AUV (Autonomous Underwater Vehicle). NRaD, MSSMP. This is a project of the US Navy Research and Development agency (NRaD). The MSSMP [MBB+ 97] acronym stands for Multipurpose Security and Surveillance Mission Platform. It involves using Cypher, a VTOL (Vertical Take-O? and Landing) robotic vehicle developed by Sikorsky, to perform airborne surveillance. Essentially, Cypher is used as a ?ying tele-operated robot. It carries a visible and an infrared light camera, a laser range ?nder and some audio sensors. The robot is controlled by an operator, who can send the robot inside a war zone and use it for locating

2.1. Robotic surveillance projects

7

enemy positions or convoys. Moreover, it can be used to control the perimeter of bases and to inspect di?cult terrains (e.g. villages and rooftops) for enemy troops. The US Army calls this system autonomous because there is no tether connecting the operator with the robot. However, we believe that a truly autonomous system is one that does not need an operator. There is some on-board processing performed on the robot to reduce the amount of information needed to be sent over the radio link between the robot and the operator. This on-board processing is limited, however, to the detection of events of interest. It should also be mentioned here that there is a number of related army projects which aim at producing UAVs (Unmanned Airborne Vehicles). Those vehicles also do not require a tether and their navigation relies on tele-operation and GPS waypoints. Their range is much larger, up to a little over 900 km for some of them. These vehicles communicate with a control centre using satellite links and they are capable of providing their operator with real-time video links. UAVs were used in Bosnia, Iraq and Afghanistan for reconnaissance missions. They have also been used as airborne sensors for ?re detection and even surveillance. Predator, Dark Star, and Pioneer are the names of some of the more well-known models. The di?erences amongst them are mainly related to the type of sensor payload they carry and the altitude they ?y.

(a) Cypher

(b) Predator

Figure 2.2: Two UAVs (Unmanned Airborne Vehicles). Cyberguard. Cyberguard is a mobile security robot marketed by a company called Cybermotion. The aim of Cyberguard is to patrol indoor building areas for intruders, ?res, and chemical threats. The system consists of an autonomous robot, an autocharger docking station and an array of software that allows the user to control the robot over a secure digital radio link

8

Chapter 2. Issues in Surveillance using a PC. The robot is equipped with a great variety of sensors that allow it to navigate and detect events of interest. The robot navigation relies partly on sonar sensors and partly on odometry. The robot localises itself using walls, halls and other existing building features and so no arti?cial landmarks are needed. The path to be followed is prede?ned by its operator using a CAD-based package and a building ?oor plan. The path representation consists of path lines and speci?c locations where the robot has to perform speci?c actions. There is a separate pathchecking program that checks all paths and interacts with the programmer to make sure that all programs are acceptable. Fuzzy logic is used to assess the degree of certainty of detected hazards and to assure appropriate response. The robot can operate continuously for over 12 hours between 3-hour battery charges. Cyberguards have been sold to academic, industry, and military customers. NRaD used a Cyberguard in a joint army-navy e?ort to provide automated intrusion detection and inventory assessment capabilities for US DoD warehouses and storage sites. For the inventory assessment task NRaD uses RF transponder tags that are attached to items of interest and hence used to track the location/presence of those items [RLGIG99].

Figure 2.3: Cyberguard patrolling a warehouse. Visual Surveillance. While a lot of work has been done in visual surveillance covering di?erent aspects and tasks, the topic is still open since none of the systems proposed is completely reliable. The problem can be divided into recognition of objects, tracking of objects/persons and analysis of behaviours. We mention a particularly representative project on visual surveillance. VSAM (Video Surveillance And Monitoring) is a consortium project sponsored by DARPA. The VSAM consortium consisted of about 15 cooperating US universities and its e?orts focus on fusion of information from a large number of surveillance devices to provide surveillance in areas where using humans is too dangerous, costly or impractical.

2.1. Robotic surveillance projects

9

In demonstrations, cameras distributed over a campus were used to track cars or humans [CLK+ 00]. The fusion problem was quite hard because many cameras of di?erent types (e.g colour and black and white) and possibly also moving (e.g. with a pan and tilt mechanism) were used. Further, fusion of information coming from UAVs together with information collected on the ground has been discussed [KCL+ 97, CLK+ 00]. One of the proposals of this project is that the cameras should rearrange their orientation so that a maximal area of the environment in which they operate is covered. This re-orientation issue seems to be related to the Art Gallery theorems and algorithms [O’R87]. In the Art Gallery problem the position and number of cameras in a gallery is to be decided to ensure that no area of the physical space is left uncovered. In VSAM, movable cameras or many static ones can be used to survey large areas of a campus. Several vision-related issues have been discussed within this framework like: how calibration of the data coming from di?erent sources should be done [LRS00], how objects should be tracked in the calibrated data [CD00, HHD00], and how sensors on moving robots could be considered [BK01]. Apart from the simple lower-level tracking, higher-level skills like identi?cation of activities of humans or of cars were an issue. The target behaviours were things like “person traversing parking lot”, “person meeting other person and walking away”, “car traversing parking lot” etc. For the identi?cation of this kind of activities, more commonly, static cameras (often a single one) were used, instead of many moving ones, but the assumption was made that in the future the low-level tracking would be su?ciently well solved to make this irrelevant. The models and techniques used to describe the types of possible human behaviours varied; [ORP00] used Coupled Hidden Markov Models (CHMMs), [BK00] used Hidden Markov Models (HMMs) and Entropy Maximisation, while [IB00] used probabilistic grammar-based descriptions of possible behaviours. In fact, more work has been done in other projects on this kind of behaviour identi?cation and in [Tes97] Petrinets were used while in [BG95] Bayesian Networks were the model of choice. Highways. Tra?c surveillance has also been considered. A type of visual tra?c surveillance is that of analysing behaviours of cars, like “car overtaking other car” or “car exiting roundabout”, just by static cameras being placed around a roundabout [BG95] Another type of tra?c surveillance is that of trying to determine the origin/destination counts of cars travelling on the highway system [HR98, PROR99]. For this kind of work cameras are placed in several places of the network (mainly entrances and exits). Various features of the cars, like colour parameters, approximate size, velocity of travel, lane occupied etc.,

10 Project name AUVs MSSMP Cyberguard Tracking Behaviour Tra?c Rescue

Chapter 2. Issues in Surveillance Event Agent Sensor Sensor type type range simple multi multi variable interm multi multi variable simple single multi variable interm multi multi variable hard single single static interm single multi static simple multi multi variable Structured Surveillance environ. plan level no no sensor no no sensor yes no sensor no some sensor yes no sensor yes no sensor yes/no yes coord.

Table 2.1: Classi?cation of surveillance approaches

can be identi?ed by the cameras. Then a probabilistic system can be used to make the association between vehicles currently observed and vehicles previously observed by other cameras. A solution to this problem can lead to better highway design and to automatic identi?cation of accidents and breakdowns in parts of the network not covered by the cameras. Search and Rescue. RoboCup-Rescue deals with the search and rescue operations after an earthquake [TKT+ 00]. A RoboCup competition was initiated in order to start an international cooperation between universities in applying robotics technology to the search and rescue problem. The idea is how to provide means for up to 1000 rescue agents to cooperate with each other. Within this framework many heterogeneous types of agents were used: ?re ?ghting agents, police agents, road clearing agents, ambulance agents etc. All these agents are given very incomplete information on what the current state of a?airs is and so cooperation and communication are of main concern. The behaviour planning problem is extremely complex and has widely time-varying objectives. The goal is not to ?nd optimal but rather semi-optimal strategies. Time for decisions is limited and apart from trying to rescue people, prevention of a secondary disaster is a goal. The eventual goal is to be able to use the system to develop and to coordinate real rescue agents and to simulate scenarios of potential disasters. So the planning involved is more of coordinator level surveillance as it was identi?ed in section 1.1. Perhaps it should be mentioned that this project is in a relatively early stage.

2.1.2

Classi?cation of surveillance approaches

Here, the projects mentioned before are classi?ed (Table 2.1) in terms of some of their characteristic properties. In the table presented, each project corresponds to a di?erent row while each property corresponds to a di?erent column. The

2.1. Robotic surveillance projects

11

property values assigned to each project are discrete. It should be mentioned that there are some borderline cases where it is not clear what such a discrete property value assignment should be. In such cases we try to make the most accurate assignment possible based on the information presented in the project descriptions available. In our table, the “event type” column states the di?culty involved in detecting the basic surveillance events considered by each project. Events whose detection is based directly on sensor data are considered to be simple events, e.g. discrete changes of thermometer readings. Intermediate di?culty events are events detected after some analysis of the sensor data is performed, e.g. moving object detection after computation of optic ?ow. The triggerings of cliche scenarios, e.g. “car overtaking other car” or “car parking”, are considered to be hard events since many simpler events have to be detected before a cliche can be triggered. The “surveillance plan” column is related to the type of planning the agent performs. In order for a project to qualify for our use of the term “surveillance planning”, it has to be able to autonomously select the areas of interest within its environment and move its focus of attention to those areas. The “agent” column refers to the number of agents used. A project is multiagent if many agents are used and the information obtained by them is fused to provide a single view of the world. The “sensor type” column is related to the data fusion problem. A project is “multi-sensor” if each of the agents used is equipped with multiple types of sensors. For example, a project that uses agents equipped with a sonar and a camera is considered to be a multi-sensor project. The “sensor range” column essentially refers to whether the sensor range is controllable or not. The sensor range is considered to be “variable” if the sensor can automatically move or adjust itself so that a di?erent view of the environment is obtained. An “environment” is considered to be structured if it is not a “naturally” occurring environment. For example, the AUV robots work inside the sea, so they work in an unstructured environment. On the other hand, Cyberguard operates in a structured environment since its environment is man-made. Finally, by level of surveillance we mean a classi?cation into “sensor”, “planner”, and “coordinator” along the lines of our de?nition in section 1.1. From this table, it can be observed that there has been a lot of work on the fusion of information derived from many agents and/or sensors. A system deployed in the real world has ?rst and foremost to solve its sensor interpretation problem. This is perhaps the reason why so much work exists on this aspect of surveillance. Although sensor interpretation and sensor fusion are hard problems, a lot of progress has been made on these, especially in the last decade. Also, the great variation in most properties indicates that di?erent types of di?culties have been encountered in each project. Further analysis of research in surveillance suggests that although a lot of work has been done at sensor level surveillance, little progress has been made in the planner and coordinator levels which were identi?ed in section 1.1. This

12

Chapter 2. Issues in Surveillance

indicates that there is potential for research in the area of surveillance planning.

2.2

Surveillance in this thesis

The main concern of this thesis is to develop strategies which use the available sensors in such a way that interesting events are detected. A mechanism for generating such strategies is an example of planner level surveillance. Therefore, this project ?lls in a gap left open by previous research into robotic surveillance. More precisely, this thesis deals with the problem of robot ?re surveillance in an o?ce-like building. The ?res are assumed to start with a certain probability and to have a certain cost when present. The robot is supplied with a simple virtual sensor that can detect a ?re within its sensor range with complete reliability. However, the robot has no way of knowing if a ?re is present at a location outside its sensor range. So our problem then becomes one of detecting ?res as soon as possible with preference given to expensive ?res. The ?res we are considering will be modelled to be independent of each other, i.e. we assume that a ?re starting at a speci?c location cannot increase the probability of a ?re starting at another location. Further, our ?res are modelled to produce the same cost per time unit. In contrast, real ?res usually cause more damage at their beginning before they eventually burn everything and extinguish themselves. These real ?res do not produce the same cost per time unit as our ?res do. The geometrical and topological structure of the environment is assumed to be perfectly known. The robot has perfect localisation and a perfect map of the environment to begin with. Also the actions for moving from one location to the next never fail and always take the same amount of time. Although we are discussing the problem of surveillance for ?res, the main aim is not to make a system that looks for ?res. The aim is to understand how costs, probabilities, and the topology of an o?ce building interact in order to specify how decisions should be taken. We are interested in how a decision process that can make decisions about probabilities and costs in such a situation can be constructed. Even though the speci?cs of the problem are not very realistic, this does not mean that it is not relevant to robotics since several similar problems exist. Leaky pipes in a nuclear power plant, or dirty rooms that need cleaning could replace ?re as an event to be detected. It should be seen as a general detection problem. In the next chapter the problem will be formalised to a great level of detail. Further, many of the simpli?cations made in this thesis can be removed in a straight forward way in future work. A lot of the results presented will be based on simulations of our situation. There is no real robot moving in a real o?ce building that can be on ?re. There are several reasons for this decision. One is that working with a real robot is

2.3. Key issues in surveillance planning

13

expensive, especially if the building can catch ?re. Another is that working with a real robot is time-consuming since a lot of mechanical problems can delay work. Further, and perhaps more importantly, although some assumptions about the sensors and the localisation are made, it does not mean that they are 100% realistic in the current state of robotics. The aim is to develop decision strategies that could command a robot once these capabilities are developed to a su?ciently high level. The most important decisions can be summarised as: Sensor interpretation Perfect sensing is assumed in this thesis. This is rather realistic in that we focus on simple events namely ?res. Sensor fusion The decision was taken to assume that the robot uses a single sensor. Normally robots have multiple sensors; however, we overcome this using a virtual ?re sensor. Indoor structured environment The robot works in a simple indoor environment. Further it has a map of that environment.

2.3

Key issues in surveillance planning

This section discusses some key issues that we have considered in order to achieve the goal of our project, and we indicate on which of them we concentrated our e?orts. Several of these issues refer to dimensions along which the di?culty of the surveillance planning task can vary, and we will indicate our preferences concerning choices that can make the surveillance problem manageable. Some simplifying choices have to be made, since one cannot expect to solve the complicated problem of automating surveillance planning in its full generality, especially since this problem had not been extensively studied before. The aim is to clarify what type of surveillance planning we are considering in this thesis.

2.3.1

Comparing surveillance strategies

Since the virtual ?re sensor of the robot cannot span the entire area of interest, the robot must somehow decide where to use it ?rst. In all but the most trivial situations, the robot will not be absolutely certain of the sensor location that is going to yield the best chances of the relevant events being detected and classi?ed. A probabilistic surveillance planner is one that explicitly reasons about the uncertain outcomes of sensor-location choices and attempts to make the best possible decisions. Of course, this type of decision problems can be masked by forcing the robot to explore its environment either randomly or systematically. It is also possible to use heuristic-based methods that make decisions on actions

14

Chapter 2. Issues in Surveillance

which are not necessarily based on probabilities. So the important observation here is that there is a number of possible strategies for surveillance. This multitude of possible surveillance strategies raises the question of which strategy is best for minimising the expected cost. A comparison between di?erent high-level sensing strategies is, therefore, important. There are many ways in which such a comparison can be made. The ?rst way is to start comparing di?erent strategies at a theoretical level. For instance, it might be possible to prove that in a speci?c scenario one strategy will be better than another. Such proofs can be extremely useful because their results are more general than the ones obtained through experimentation. However, being able to prove something largely depends on the assumptions and the scenario under consideration. For most realistic situations theoretical proofs are not possible. In such cases the comparison should be an experimental one. One could simulate an environment and try to decide which strategy performs best in simulation. For this type of experimental comparison, there is still the question of what the criteria are that should be used to compare two strategies. Our decision on how to compare the strategies was to use a mix of theoretical proofs and experimental simulation-based techniques. For the simulation-based comparison the metric used in comparing is that of the actual strategy cost as it is produced by simulating the strategy.

2.3.2

A priori information

The availability and reliability of pre-supplied (a priori ) information may vary. For instance, the robot can be supplied with a map of the environment which describes the locations of walls, doors, and other possible landmarks that the robot can use to determine its location in the environment. Or it can be supplied with (approximate) probabilities of interesting events occurring at various locations. One can also imagine a robot learning this kind of information. This raises the question of how much prior information is necessary and how much information can or should be learned. Prior information is usually available only in the form derived from a human expert. Converting the knowledge of a human to a format suitable for automated manipulation is usually very hard. Researchers in the area of expert systems have tried in the past to convert human knowledge into facts and control rules that can be processed by computer. There is experience within Knowledge Based Engineering on how such an undertaking can be handled. However, as workers in this area report, converting this knowledge to facts and rules is non-trivial. At ?rst sight, having a robot supplied with a map of the environment seems simpler than having a robot which ?rst has to learn such a map. However, as soon as both the prior and sensory information stop being completely reliable, it is not clear what should be done when they disagree. This di?cult problem is

2.3. Key issues in surveillance planning

15

avoided in the case of a learning robot. The main advantage of a learning robot is clear: It is capable of operating even if pre-supplied information is unavailable. However, for some applications, it may be impractical to include an extensive learning stage before the robot can be used reliably. Also, some of the necessary information may not be learnable at all within a reasonable amount of time. For example, it is a serious problem to learn the probabilities of relatively exceptional events, such as ?re, at di?erent locations from past experiences alone. We have not dealt with the problem of when to choose learning robots. Only the case where a great deal of reliable prior information is available to the robot was considered, and the aim was to develop strategies for using this information together with sensory information to optimise behaviour. Another case not considered is that where the available prior information is not su?cient or is too unreliable to exactly determine optimal behaviour. In such situations, one would be looking for behaviour that is robust, in the sense that small changes in the available information would result in small changes in the behaviour.

2.3.3

Level of detail

The supplied environment map may have various levels of detail. Some authors [Fab96] use a very coarse cell-like representation. Others, use ?ner-grained ones such as occupancy grids [Kon96]. A well-built surveillance system should probably dynamically adjust the detail of its world representation to suit the task at hand. For instance, on the one hand, route planning inside a building might require only a very coarse representation in terms of rooms and connections between rooms. A path that connects two rooms in that representation can easily be found using classical graph searching algorithms. On the other hand, trajectory planning and localisation within a room probably needs a much more detailed description so that individual obstacles can be cleared with only a few centimetres to spare. Moreover, even experts use di?erent detail levels when describing their ?eld. So the right environment description simpli?es planning and also makes the planning approaches more natural. Maintaining a correct representation of the environment at various resolution levels might be too hard. So an issue that arises within surveillance and, in fact, within almost any AI application is how much of the world needs to be represented and how this should be done. There has been a lot of work on this by both general AI researchers and roboticists. However, there is still a lot of room for improvement within this area. We have not attempted to solve this problem in its full generality. However, abstraction-based techniques were developed for handling the interaction between representations at di?erent levels of detail. This was unavoidable and we believe there is little hope of applying a probabilistic surveillance planner to realistic surveillance tasks without this kind of interaction between di?erent levels of detail.

16

Chapter 2. Issues in Surveillance

2.3.4

Intelligent and non-intelligent opponents

Another issue in surveillance is that of the potential di?erences between the methods needed to treat intelligent and non-intelligent opponents. Non-intelligent opponents can be thought to be natural phenomena or accidental events. For example, accidental ?res, leaky pipes and slippery ?oors can be classi?ed as nonintelligent opponents. Intelligent opponents are people and events caused by them. For instance, a ?re can be manipulated by an arsonist to make it especially dangerous and deceiving. This type of preplanned ?res would be classi?ed in the intelligent opponent category. One can imagine that the detection methods and surveillance strategies should be a?ected by the sort of opponents the system is supposed to be dealing with. The system should protect itself against attempts by opponents to modify its characteristics. So opponents should not be capable of changing detection rates, false alarm rates or the outcome of actions to alarms by some sort of intelligent manipulation. Therefore, the mode of operation of an autonomous robot should not be completely transparent to make the system resilient to this type of manipulation attacks by intelligent opponents. If the robot succeeds in hiding its strategy from the intelligent opponent, it is more or less justi?ed to treat the events caused by this intelligent opponent similarly to the events caused by nature. We have not considered the case of intelligent opponents. Instead, we concentrated our e?orts on surveillance of natural phenomena. To be more speci?c, in the rest of this thesis we are considering stochastic events of known probability that can only occur at speci?c locations in our environment. We took “?re” as an example of such an event. Although this makes our problem quite speci?c, di?erent types of natural events can replace that of “?re”. In fact, the “?res” we are considering are not totally realistic since, for example, they do not spread. In section, 3.1 we list the assumptions we are making about the problem, and some of them concern the types of events considered. We believe that the rest of this thesis is su?ciently general to deal with several types of non-intelligent opponents, provided that the opponent events have similar characteristics to the ones listed in section 3.1. For example, one could use the techniques developed here to check for leaky pipes in a nuclear reactor.

2.3.5

Con?dence

Finally, an important open problem is how to deal with information about the environment that is acquired at di?erent times. The fact that there was nothing interesting in a room at the point in time when sensor measurements in that room were taken does not exclude the fact that something interesting might happen there after some time. Temporal aspects like the decrease in con?dence in the relevance of sensor

2.4. Other work on surveillance planning

17

measurements, given the dynamic nature of the world, are very interesting. For example, an agent should deduce that there is no need to re-examine a recently examined room but wait instead until some time has passed. For realistic applications it is not obvious what these rates should be, and they may even vary over time. For example, an aerial robot should be capable of deducing that con?dence in the non-presence of bush ?res during the summer should decrease more rapidly than it does during the winter. This is because during the summer the woods are dry and, as a result, ?res can start and spread more easily. The decision was to use probabilities to model the probabilistic belief that a ?re is present some time after a sensor measurement is taken. This decision will be justi?ed further in section 3.2.

2.3.6

Problem complexity

The issues of this section are mentioned because they can determine the style and complexity of surveillance planning approaches. Therefore, the comparison of these approaches depends on the choices made in these issues. The simplest possible scenario is that of a system with a lot of a priori information and with a single scale world representation which ignores con?dence aspects and deals with non-intelligent opponents. The hardest scenario is probably that of a system that does not make any assumptions about the a priori information available, has a true multi-scale map, and at the same time, takes into account con?dence aspects in a sophisticated manner and deals with intelligent opponents.

2.4

Other work on surveillance planning

As we have mentioned there has been little work on surveillance planning. There is only one notable exception we are aware of. It is Fabiani’s PhD thesis which deals with several issues of uncertain decision making but surveillance is taken as an application area for these methods. Fabiani’s application is the selection of areas to be inspected for possible bush-?res, using an autonomous ?ying robot [Fab96]. This project takes some temporal reasoning aspects into consideration. His environment consists of a ?nite set of cells and it is assumed that at each moment the sensor range coincides with a single cell. All locations are thought to be immediately accessible from each other. The dynamics at each cell is modelled by a Markov process. The probabilities of ?res starting are assumed to be independent of the state of the neighbouring cells. The main theoretical improvement is that the con?dence with which something is believed in his system decreases over time. Such beliefs originate from sensor measurements. The decrease in con?dence of belief can be explained by taking into account the fact that changes occur in the environment while it is not

18

Chapter 2. Issues in Surveillance

observed. A strategy for surveillance based on this notion of con?dence, as well as several others were proposed in his thesis and in the next chapter we will be discussing some of them. Fabiani has also worked on tracking a partially predictable target using a moving, tracking robot [FGBLL01]. The target has some movement capabilities that set some constraints on how far it can move. Target visibility and position uncertainty are concurrently considered. Both the uncertainty in the target and tracker position are modelled. The tracker can reduce the uncertainty in its position by observing predetermined landmarks in the environment. A gametheoretic framework is applied to decide what actions the tracker should take to guarantee that it keeps the target within its ?eld of view, while managing its position uncertainty. This is a di?erent kind of surveillance problem because the target robot is moving and is evading its tracker. The events considered in this thesis are simpler than those in the target-tracker problem.

2.5

Summary

Although robots seem to be excellently suited for automating the process of moving surveillance sensors, most of the existing projects on robotic surveillance use robots as a kind of ?exible sensor and let human operators control the high-level decision-making on where to go to next. So far, relatively little attention has been paid to the problem of automating the planning of available resources to optimally use the sensors for the surveillance task. As a result of this observation, we decided to develop and evaluate surveillance strategies that can be used in an autonomous surveillance robot. The goal is a strategy producing surveillance plans of approximately minimum cost. The type of problem solved in this thesis would be classi?ed according to the scheme of table 2.1 as: simple event, single agent, single sensor, variable sensor range, structured environment, planning-based and aiming at the planning level. Further, the emphasis is on stochastic independent events, and we shall take ‘?re’ as an example of such an event.

Chapter 3

Problem Formalisation and Complexity

As we have indicated in the previous chapter, there exist various robot surveillance problems of several di?erent di?culties. Our thesis is focused on the speci?c problem of a robot surveying an o?ce-like building so that the expected cost of ?res is minimised. The aims of this chapter are: (a) to formalise this problem to a degree of speci?city that allows for it to be tackled, (b) to demonstrate that the various ways in which it is formalised in this chapter are equivalent, and (c) to show that approximations are necessary and that existing approximate methods cannot be used to solve it. Problems involving expected costs and probabilities are commonly set in a decision-theoretic way. In such a setting the various costs and probabilities, as well as the goal of the surveillance, have to be speci?ed. In robotics decision-theoretic problems are often described in a Markov Decision Process (MDP) setting or in a Partially Observable Markov Decision Process (POMDP) setting. (PO)MDPs are a general framework in which decision-theoretic problems can be speci?ed. Their use has become popular in robotics and other areas because of the existence of several standard solution methods for problems set in this way. We set our surveillance problem both in the general decision-theoretic way and in several (PO)MDP-based ways. Setting the problem in several ways helps us understand it better by examining its various representations. We shall demonstrate that all these settings are equivalent. The goal in all alternative settings of our problem is that of minimising the expected cost of ?re, but in general, a combination of costs (such as ?re damage to the building and ?re damage to robot) could be minimised. We believe that the decision-theoretic view on surveillance (including the (PO)MDP settings) is su?ciently general to incorporate, at least theoretically, many of the possible re?nements of the problem presented in this chapter. Although standard methods exist for solving problems set as (PO)MDPs, 19

20

Chapter 3. Problem Formalisation and Complexity

we shall demonstrate that our problem is too hard for conventional (PO)MDPsolving methods. In fact, any method attempting to solve our problem exactly would fail because its state space grows exponentially in the number of rooms present in our environment. The (PO)MDPs settings will not be pursued in subsequent chapters. Yet, this thesis would not be complete without discussing the possibility of using (PO)MDP-solving methods, and the reason why we think no extra bene?t can be gained in this representation.

3.1

Problem assumptions

As in most problems, some assumptions are necessarily made in order to formalise robot ?re-surveillance. They describe what the capabilities of our robot are, how time evolves, what the nature of the ?res we are trying to extinguish is, etc. Sections 3.2 and 3.3 elaborate on our assumptions by setting the problem and discussing some of the issues related to them. The fact that our state space grows exponentially (as it will be shown in section 3.4) indicates that our problem is not oversimpli?ed by the assumptions made here. ? Environment structure. The environment is discretised by assuming that there exists a ?nite partition of it into rooms. The robot is assumed to know the structure of this partitioning. ? Deterministic localisation. The robot always knows where it is. ? Deterministic movement. The robot always succeeds in going to a new location. ? Deterministic sensing. The robot always knows if its current location is on ?re or not. ? Sensor range. The sensor range is limited to the current robot location. ? Sensor type. It is assumed that only one virtual sensor is present that can be the result of data fusion. ? Time and actions. We assume time to be discrete, and we assume that at any time, all the robot’s possible actions considered are those that may change the sensor range. ? Time to move. The time to move between any two locations is the same and corresponds to one time-step. ? Stopping of ?res. Fires do not stop spontaneously (burn out). ? Independence of locations. Fires do not spread, which implies that the probability of ?re at a speci?c location is independent of other locations.

3.2. Decision-theoretic setting

21

? Probabilities. The robot is assumed to know the exact ?re starting probabilities. ? Costs. The robot is to know the exact cost a ?re causes per time-step. It is assumed that this cost per time-step is the same for the duration of the ?re. ? Start. For simplicity, we assume that the surveillance starts in a ?re-free environment. The robot starts from a speci?c known location.

3.2

Decision-theoretic setting

In this section we ?rst set the problem in a decision-theoretic way. We then discuss some simple, abstract examples of surveillance problems and show that in these examples the minimum expected cost criterion leads to reasonable surveillance strategies.

3.2.1

Formal environment model

The formal model of the environment that is used for describing the surveillance task is: 3.2.1. Definition. An environment E is a tuple X, A0 , A, F, C, P0 , P , where: ? X is a set {Xi : 1 ≤ i ≤ n} of mutually disjoint spatial areas, or locations, ? A0 ∈ X is the start location of the robot, ? A ? X ×X represents the relation of immediate accessibility (for the robot) between locations, ? F = {fi : 1 ≤ i ≤ n} is a set of Boolean variables indicating if a ?re is present at each location Xi . If a ?re is present, we write fi = 1 or fi , ? C is a function assigning to each location Xi ∈ X the cost C(fi ) associated with not detecting a ?re present at Xi , ? P0 is a function assigning to each location Xi ∈ X the probability P0 (fi ) that at time 0 an event occurs at Xi , ? and P is a set of transition probabilities. That is, for every Xi ∈ X and time t ≥ 0, P contains P (fi → 1) denoting the (prior) probability that a ?re starts at Xi during a time-step, and P (fi → 0) denoting the (prior) probability that the ?re at Xi stops during a time-step.

22

Chapter 3. Problem Formalisation and Complexity

In section 3.1 we stated that the robot starts working in a ?re-free environment. This means that P0 (fi ) = 0 for every Xi ∈ X. As a consequence, we can simplify the description of an environment X, A0 , A, F , C, P0 , P to X, A0 , A, F , C, P . Further, we have assumed that P (fi → 1) and P (fi → 0) do not depend on the time t. We also assume that the environment is connected, in the sense that for every Xi , Xj ∈ X there is a path Xi = Y1 , . . . , Ym = Xj such that Yk , Yk+1 ∈ A. As mentioned in section 3.1, we assume that the sensor range is one environment location and we write rt to denote the sensor range of the robot at time t. We assume that the sensor range at time t = 0 coincides with the robot’s start location, so r0 = A0 . The set of reachable states is a?ected by the immediate accessibility relation so for every t ≥ 1, rt ∈ {Xi : rt?1 , Xi ∈ A}. If rt = Xi , then we say that Xi is visited at time t. The decision strategies should decide which immediately accessible location to visit next. For the moment, we do not take recognition uncertainty into account and assume a ?re to be detected whenever it is in the sensor range of the robot. However, this type of uncertainty can be easily introduced. The above de?nition provides an abstract model of the decision problem. For realistic applications, we have to take into account that a robot can have several sensors, each with its own sensor range, that the sensor range is not necessarily an element of X, that the actions of the robot may include changing its location, its orientation, and possibly manipulating aspects of the environment, such as opening a door. We also have to take into account the exact state and dynamics of the environment, the exact position of the robot in the environment, the uncertainty in the recognition of ?res, et cetera. We could have captured more realistic surveillance problems by dropping some of our assumptions, but we preferred a simple model so that we could concentrate on the surveillance strategies. In spite of the many simplifying assumptions, the notions formalised above are su?ciently general to capture the abstract environment used in [Fab96] in order to experimentally compare di?erent surveillance strategies. Of course, not all applications of surveillance are meant to trigger intervening responses to the observed relevant event. For example, observations made in the context of a scienti?c study are primarily aimed at information gathering, not at intervening. However, when interventions do play a role, their e?ects should be incorporated in the model of the surveillance problem. Since the particular actions triggered by a detection are not themselves, strictly speaking, part of the surveillance behaviour of the agent, we will leave them out of our considerations. For simplicity, we have assumed that ?res do not stop spontaneously, but immediately after being detected by the surveillance agent. Formally, P (fi → 0) = 0, and Pt+1 (fi ) = P (fi → 1), if rt = Xi . It would, of course, have been possible to introduce some time delay for the countermeasures to take e?ect, but this would have raised the problem of deciding how important it is to monitor areas where ?res are known to be present or have been observed to occur. It would

3.2. Decision-theoretic setting

23

also have been possible to allow P (fi → 0) > 0 and to model the e?ect of the actions triggered by observing a ?re as an increase in P (fi → 0). Our simplifying assumption can be viewed as an extreme instance of this possibility. As in [Fab96], we assume that the cells are independent, and that the probability of fi → 1 is constant over time. It is then possible to express Pt (fi ) in terms of P (fi → 1) and the amount of time that has passed since the last visit to Xi . 3.2.2. Proposition. Let E = X, A0 , A, F, C, P be an environment where P (fi → 0) = 0, and rt = Xi implies that Pt+1 (fi ) = P (fi → 1). Then Pt (fi ) = 1 ? (1 ? P (fi → 1))t?t , where t is the largest time point ≤ t such that rt = Xi . Proof. Prove for t = t + 1. For t = t + 1 using the formula in proposition 3.2.2, we get Pt +1 (fi ) = 1 ? (1 ? P (fi → 1))t +1?t = P (fi → 1). This is correct since P (fi → 1) is the probability of a ?re starting in one time-step. Assume for t = t + k. Assume Pt +k (fi ) = 1 ? (1 ? P (fi → 1))k . Prove for t = t + k + 1. We ?rst compute Pt +k+1 (fi = 0) as the probability that no ?re existed at t = t + k times the probability that a ?re did not start during t = t + k + 1. We have Pt +k+1 (fi = 0) = (1 ? Pt +k (fi = 1))(1 ? P (fi → 1)) = (1 ? P (fi → 1))k (1 ? P (fi → 1)) = (1 ? P (fi → 1))k+1 . We know Pt +k+1 (fi = 1) = 1 ? Pt +k+1 (fi = 0). So, our formula also holds for t = t + k + 1 because Pt +k+1 (fi = 1) = 1 ? (1 ? P (fi → 1))k+1 .

3.2.2

Surveillance strategies

In [Fab96] a surveillance strategy is proposed based on the newly introduced notion of con?dence, which can be viewed as a second-order uncertainty measure. Whenever sensory information about the state of a location becomes available, the probability of an event occurring at that location at that time is updated, and one is assumed to be very con?dent about this assessment of the state of the location. This con?dence then drops gradually over time during a period in which no fresh sensory information concerning this particular location is obtained. The rate by which the con?dence decreases depends on the transition probabilities: the more likely the changes, the higher the decrease rate. Speci?cally, the factor λp is used as con?dence decrease rate, where p is the transition probability leaving from the observed state and λ is some unspeci?ed parameter. The actually used computation of con?dence is slightly more complicated, due to the fact that some time after the observation it is no longer clear which transition probability (P (fi → 1) or P (fi → 0)) should be used in the computation of the decrease rate. In our model, the situation is simpler, since we assumed that when visiting a location Xi at time t, it either observes that no ?re is present at Xi or it

24

Chapter 3. Problem Formalisation and Complexity

immediately extinguishes the ?re. In both cases, the robot can be con?dent that no ?re is present at Xi after t. This con?dence can decrease over time due to the possibility that a ?re starts after t. The rate of this decrease depends, of course, on P (fi → 1). The transition probability P (fi → 0) does not play any role. Since the factor λP (fi →1) is meant to be a decrease rate, one can infer that 0 < λ < 1. Every time a location Xi is not visited, the degree of con?dence of the robot that no ?re is present at Xi is multiplied by λP (fi →1) . 3.2.1. Observation. Let 0 < λ < 1. Then (λx )n > (λy )m i? nx < my. When visiting Xi , the robot can either see that no ?re is present, or the robot will immediately extinguish it. In either case, the con?dence of the robot that no ?re is present at Xi is maximal, say 1. It thus follows from the above observation that the location with the lowest con?dence at time t is the location Xi such that P (fi → 1)(t ? t ) is maximal, where t is the time of the last visit to Xi . The strategy proposed in [Fab96] can be described as follows. maximum con?dence Choose the action that changes the sensor range to the neighbouring location which has the lowest degree of con?dence attached to it. In [Fab96], this strategy is experimentally compared to the following strategies. random exploration Randomly choose a location as the next sensor range. methodical exploration Choose all the locations, one after the other, and always in the same order, as the sensor range at the next moment. maximum likelihood Choose the action that changes the sensor range to the neighbouring location with maximal uncertainty, where the uncertainty at location Xi is measured by min(P (fi ), 1 ? P (fi )) Notice that both random and methodical exploration, as described above, allow choosing non-neighbouring locations. Actually, in the experiments of [Fab96] it is assumed that all locations are directly accessible from each other (A = X × X). This is only realistic in the case when changing attention to a far removed location involves no or only a negligible amount of cost or time and this is not the case in robotic surveillance. Of course, it is not di?cult to restrict random exploration to choosing randomly between neighbouring locations only, but it is not clear how to put a similar restriction on methodical exploration. One possible strategy that can be considered to be a local variant of methodical exploration is the following. minimax interval Minimise the maximum time interval between visits of locations by choosing the action that changes the sensor range to the neighbouring location which has not been visited for the longest time.

3.2. Decision-theoretic setting

25

We propose to use this minimax interval strategy as a kind of reference strategy. Since this strategy does not use information about the uncertainties, it can be used to clarify how much other strategies which do use uncertainty information gain in e?ciency. It should be mentioned that in the case of the maximum likelihood strategy many uncertainty measures, including, for example, entropy (which is de?ned as Xi ∈X P (fi ) log(P (fi ))), give rise to the same preferences as min(P (fi ), 1?P (fi )). We use this de?nition of maximum likelihood rather than entropy, because we want to be able to compare our strategies with those of Fabiani [Fab96] and that is the de?nition used there. The uncertainty is maximal whenever the probability of ?re is closest to 0.5. This is because min(0.5, 1 ? 0.5) = 0.5 while for example min(0.7, 1 ? 0.7) = 0.3. As we will see in section 3.2.3, the maximum likelihood strategy seems more appropriate for symmetrical surveillance understood as maintaining a maximally correct model of the state of the environment, with respect to both the presence and the absence of relevant events, than for asymmetrical surveillance aimed at detecting relevant events. In [Fab96], no explicit choice is made between such a symmetrical view on surveillance and the asymmetrical view we take. Several criteria are used to evaluate the performance of the strategies in the experiments, including the symmetrical criterion of the percentage of erroneous estimations of the state of each location and the asymmetrical criterion of the percentage of undetected relevant events. We propose a surveillance strategy based on decision-theoretic considerations. By decision-theoretic surveillance we understand the kind of behaviour guided by the following decision strategy. minimum expected cost Choose the action that minimises the expected cost. This decision strategy can be interpreted both globally and locally. Under the global interpretation, the action that has to be chosen corresponds to the behaviour of the surveillance agent from the start to the end of the surveillance task. There is not an inherent end to a surveillance task, but in practice each particular task has a limited duration (say, until the next morning when the employees return to the o?ce building, or until the batteries of the robot have to be recharged). The (global) expected cost ECT until time T can be computed by the following formula. Pt (fi )C(fi ). ECT =

1≤t≤T,Xi =rt

Notice that a choice to visit location Xi at t not only removes the term Pt (fi )C(fi ) from the above sum, but also has some indirect bene?ts, due to the fact that it reduces Pt (fi ) for t > t and it makes some neighbouring locations of Xi available to be visited at time t + 1.

26

Chapter 3. Problem Formalisation and Complexity

The behaviour of the surveillance agent from the start to the end of the surveillance task can also be viewed as consisting of a sequence of simpler actions. One can apply the above decision strategy locally to choose at each time between the possible simple actions by comparing the consequences of these simple actions, or perhaps by comparing the (expected) consequences of small sequences of simple actions. Let us say that an n-step strategy compares the (expected) consequences of sequences of n (simple) actions. Of course, the strategy is more easily implemented for small n, whereas, in general, it better approximates the global strategy for large n. Notice that none of the strategies considered in [Fab96] takes into account a notion of cost which, for example, allows one to express the opinion that for some areas it is more important to observe relevant events than it is for other areas. Another use of the notion of cost is to express the opinion that early detection is important as it decreases the cost over time after the start of an event.

3.2.3

Examples and ?rst results

We have de?ned six decision strategies for surveillance, three of which make use of some kind of uncertainty information, namely maximum con?dence, maximum likelihood, and minimum expected cost. The following proposition shows that these strategies essentially agree if there is no (relevant) uncertainty information to be used. 3.2.3. Proposition. Let E = X, A0 , A, F, C, P , and assume that for all locations Xi , Xj ∈ X, P (fi → 1) = P (fj → 1) and C(fi ) = C(fj ). Then the maximum con?dence strategy and the 1-step minimum expected cost strategy both reduce to the minimax interval strategy. Also, for su?ciently small transition probabilities P (fi → 1), the maximum likelihood strategy will agree with the minimax interval strategy. Proof. Examining observation 3.2.1 in conjunction with the assumption that P (fi → 1) = P (fj → 1) for locations Xi , Xj ∈ X, we can conclude that an agent choosing the location with the smallest degree of con?dence essentially chooses the location that has not been visited for the longest period of time in part of the environment that is reachable in one time-step. Similarly, under the same assumption, a 1-step minimum expected cost strategy chooses the room with the largest expected cost but since the ?re costs are also uniform among rooms, this always corresponds to the room that has not been visited for more time-steps. In the case of the maximum likelihood strategy for su?ciently small transition probabilities P (fi → 1) and relatively small environment, we have min(P (fi ), 1 ? P (fi )) = P (fi ). If that is the case, choosing the action with maximal uncertainty corresponds to using the 1-step minimum expected cost strategy and we have

3.2. Decision-theoretic setting

27

1 C(f1 ) = 1 P (f1 → 1) = 0.5

2 C(f2 ) = 1 P (f2 → 1) = 0.8

Figure 3.1: The environment of example 3.2.4. already shown that, under our assumption, this corresponds to the minimax interval strategy. This result supports the choice of using minimax interval as a reference strategy. It follows that we can expect a di?erence between the strategies mentioned only in the case of varying probabilities or costs. Our ?rst example illustrates the di?erence between the various surveillance strategies introduced so far. 3.2.4. Example. Consider an environment E = X, A0 , A, F, C, P , where X is a set {X1 , X2 } consisting of two rooms, A = X × X (?g. 3.1). Assume C(f1 ) = C(f2 ) = 1, P (f1 → 1) = 0.5 and P (f2 → 1) = 0.8 (Remember that we also assume that P (fi → 0) = 0, and that Pt+1 (fi ) = P (fi → 1), if Xi ? rt ). In other words, we have two equally important rooms, each accessible from the other, but the probability of a ?re starting at room X2 is higher than the corresponding probability at room X1 . The strategy based on maximum likelihood will always look at room X1 (where the uncertainty is maximal), and will never take a look at room X2 . It is maximal for 0.5 because min(0.5, 1 ? 0.5) = 0.5 while for example min(0.7, 1 ? 0.7) = 0.3. The strategies based on methodical exploration and minimax interval go back and forth between both rooms, just as the maximum con?dence strategy does, at least if one assumes that the con?dence in an observation made at the immediately preceding moment is higher than the con?dence in an observation made before that time. This seems to follow from the way the decrease rate of con?dence is computed in [Fab96]. The strategy based on a 1-step minimisation of expected cost is slightly more complicated. At time 1, room X2 is chosen because P (f2 ) = P (f2 → 1) = 0.8 > 0.5 = P (f1 → 1) = P (f1 ). At time 2, P (f2 ) is again 0.8, since this room was visited at the immediately preceding time-step. However, P (f1 ) has only increased to 0.75, which is not enough to get chosen. Only at time 3, P (f1 ) = 0.875 has increased above the 0.8 probability that a ?re occurs in room 2. We thus obtain a sequence where room X1 is only chosen every third time-step. See table 3.1, where for the ?rst six time-steps the expected costs of not visiting a room are displayed.

28

Chapter 3. Problem Formalisation and Complexity time 1 2 3 4 5 6 EC(f1 ) EC(f2 ) 0.5 0.8 0.75 0.8 0.875 0.8 0.5 0.96 0.75 0.8 0.875 0.8

Table 3.1: The room expected costs of the ?rst six time-steps of example 3.2.4. The minimum expected cost strategy chooses the room with the maximal expected cost (printed in boldface). In this example, the maximum likelihood strategy does not result in an exhaustive exploration of the environment. Both maximum con?dence and minimum expected cost behave better in this respect if we assume that exhaustive exploration is desirable. The problem with the maximum likelihood criterion is that P (fi ) > P (fj ) is not guaranteed to result in a preference to visit Xi rather than Xj whenever P (fi ) > 0.5. In fact, if P (fi ) > P (fj ) ≥ 0.5, then the criterion prefers Xj . Notice that not only in the arti?cial example above, but also in practical applications, with su?ciently many locations and su?ciently high transition probabilities, it can happen that P (fi ) > 0.5. However, it should be mentioned that typically the ?re starting probabilities are very small and that the environment is not large enough for the probability to grow to 0.5. This means that in those far more normal cases maximum likelihood does correspond to minimising the probability of ?re. Still we believe that there is a theoretic point to be made. Since the maximum likelihood criterion does not prefer locations where the chance of detecting a ?re is high, but is more interested in locations where the occurrence of a ?re is highly unknown, we conclude that the criterion is more appropriate for symmetrical than for asymmetrical surveillance. In example 3.2.4, the 1-step minimum expected cost strategy results in a behaviour which seems intuitively appealing, since it clearly re?ects the fact that P (f1 → 1) is substantially lower than P (f2 → 1), whereas maximum con?dence, just as methodical exploration, treats both rooms the same. However, we will see below that this intuitive appeal may be somewhat misleading. The maximum con?dence strategy does also take the probabilities P (f1 → 1) and P (f2 → 1) into account, since the rate of con?dence decrease is a function of these probabilities. However, the decrease rate proposed in [Fab96] does not result in a di?erent treatment of both rooms in the example. Thus one can view the example as an indication that in the minimum expected cost strategy the probabilistic information is used more directly and taken more seriously than in

3.2. Decision-theoretic setting

29

the maximum con?dence strategy. The maximum con?dence strategy does not consider at all the possibility that for some areas it may be relatively more important to detect events. This is easily implemented in the minimum expected cost strategy by letting the cost C(fi ) of not detecting a ?re depend on the area Xi . Such varying costs may cause a problem, since they may prevent the 1-step minimum expected cost strategy from achieving an exhaustive exploration of the environment. It can be shown that if the cost of not detecting a ?re is constant over the different areas Xi , then, in the long run, the 1-step minimum expected cost strategy will result in an exhaustive exploration of the environment. More precisely, one can show the following: 3.2.5. Proposition. Let E = X, A0 , A, F, C, P , and assume that for all locations Xi , Xj ∈ X, C(fi ) = C(fj ). Then, in the long run, every Xi ∈ X is visited when applying the 1-step minimum expected cost strategy, and there is a ?nite upper bound Ni on the length of the time interval between visits of Xi . Proof. We prove the slightly more general proposition that says that for an environment E = X, A0 , A, F, C, P , with C(fi ) = C(fj ), P0 (fi ) < 1, P (fi → 1) < 1 ?Xi , Xj ∈ X. Then ?Xi ∈ X there exists a maximal time Ni between visits of Xi . The proof is by induction to the size of X = {Xi : 1 ≤ i ≤ n}: Prove for size 1. Trivial. The robot is always present in that room. Assume for size < n. Prove for size n. Let Xk be a location such that the in?mum of time between visits is at least as large as for the other locations. Suppose the assumption step is not true for size n, then ?Xm such that its in?mum time between visits of Xm is ∞. Therefore, for Xk the time between visits is ∞. Note also that by the induction hypothesis there is a ?nite maximum time Nm that the robot can move in X ? {Xk }, without exhaustively exploring X ? {Xk }. By the assumption step (for n-1 rooms) we have that there is a bound Ni on the time between visits for all rooms in X ? {Xk } or in other words: ?Xi ∈ X ? {Xk }, ?Ni > time between visits of Xi . This implies that there is a bound pi on the probability that there is a ?re in any of the rooms in X ? {Xk } or in other words: ?Xi ∈ X ? {Xk }, ?pi < 1 such that ?t, Pt (fi ) < pi . So the bound on the time between visits Ni of the inductive hypothesis sets a bound pi on how much the probability of ?re can rise. If the time between visits in room Xk is ∞, then ?N, ?X ∈ X ? {Xk }, pi < P (fk ), where P (fk ) is the probability of ?re at Xk after not visiting it for N times. Let Xj ∈ X ? {Xk } be a neighbouring location of Xk , then between N and N + Nm steps Xj will be visited. Since for that time ?X ∈ X ? {Xk }, pi < P (fk ), Xk will be visited at the next time-step. So room Xk is visited after the right amount of time and this completes our induction.

30

Chapter 3. Problem Formalisation and Complexity

Since a bound in the time between visits at every location exists, one can conclude that eventually every location will be visited. If ?res do not stop spontaneously before they are detected, exhaustive exploration implies that all ?res will eventually be detected. But even among strategies that are 100% successful, with respect to eventually detecting ?res, there may be a di?erence in performance if, for example, early detection is considered to be important. Before we can continue with the discussion on exhaustive exploration, we need to know the answer to another interesting question namely, that of if and how the indirect bene?ts of visiting a location can be taken into account. For example, visiting a location at time t decreases the probability of a ?re being present at that location after t, but this e?ect is not considered by simply comparing the expected cost over n time-steps. 3.2.6. Definition. Let t be the time of the last visit to Xi before t, and T be the time of the next visit after t. Then the indirect bene?ts of a visit to X i at t are equal to the following.

T

n=t+1

(1 ? P (fi → 1))n?t ? (1 ? P (fi → 1))n?t .

This expression computes for every time-step between t and T two probabilities: (a) the probability of a ?re given that room Xi was last visited at time t, and (b) the probability of ?re given that room Xi was not visited at time t (so its last time of visit was t ). The sum of the di?erences between these two probabilities is de?ned as the indirect bene?ts of a visit at time t. If T = t + 1, then the above expression provides a lower bound of the indirect bene?ts of a visit to Xi at t instead of later. Incorporating this amount of the indirect bene?t into the 1-step minimum expected cost strategy is similar to employing a 2-step minimum expected cost strategy, and it results in the back and forth behaviour in the environment of example 3.2.4. 3.2.7. Proposition. Let t be the time of the last visit to Xi before t. Then the indirect bene?ts of a visit to Xi at t have the following upper bound.

T T →∞

lim

n=t+1

(1 ? P (fi → 1))n?t ? (1 ? P (fi → 1))n?t

t?t

=

n=1

(1 ? P (fi → 1))n .

3.2. Decision-theoretic setting Proof. If we expand the series, we get the following sequence:

T T →∞

31

lim

n=t+1

(1 ? P (fi → 1))n?t ? (1 ? P (fi → 1))n?t = ? (1 ? P (fi ? (1 ? P (fi ? (1 ? P (fi . . . . . . ? (1 ? P (fi ? (1 ? P (fi ? (1 ? P (fi . . . . . . → 1))t+1?t → 1))t+2?t → 1))t+3?t → 1))t+1?2t → 1))t+2?2t → 1))t+2?3t (1) (2) (3) (4) (5) (6)

= (1 ? P (fi → 1)) +(1 ? P (fi → 1))2 +(1 ? P (fi → 1))3 . . . +(1 ? P (fi → 1))t+1?t +(1 ? P (fi → 1))t+2?t +(1 ? P (fi → 1))t+3?t . . .

The left part of line (4) cancels out the right part of line (1), the left part of line (5) cancels out the right part of line (2) and so forth. As T → ∞ only the left t?t column up to t ? t remains and therefore the limit is n=1 (1 ? P (fi → 1))n . We now return to the discussion about exhaustive exploration. The next example is a simple modi?cation of example 3.2.4 and in conjunction with this upper bound can be used to show that, in general, proposition 3.2.5 is no longer valid if the costs are allowed to vary. Consequently, the 1-step minimum expected cost strategy is no longer guaranteed to result in an exhaustive exploration of the environment.

1 C(f1 ) = 1 P (f1 → 1) = 0.5

2 C(f2 ) = 3 P (f2 → 1) = 0.8

Figure 3.2: The environment of example 3.2.8.

3.2.8. Example. Consider the situation of example 3.2.4, but now assume that C(f1 ) = 1 and C(f2 ) = 3 (?g. 3.2). Then the expected cost of not visiting room X1 has an upper bound of 1, whereas that of not visiting room X2 is 2.4, even though it has just been visited. Therefore, by the 1-step minimum expected cost strategy, room X2 will always be chosen. See table 3.2, where for the ?rst four time-steps the expected costs (of not visiting a room) are displayed.

32

Chapter 3. Problem Formalisation and Complexity time 1 2 3 4 EC(f1 ) EC(f2 ) 0.5 2.4 0.75 2.4 0.875 2.4 0.9375 2.4

Table 3.2: The room expected costs of the ?rst four time-steps of example 3.2.8. The minimum expected cost strategy chooses the room with the maximal expected cost (printed in boldface). This e?ect of ignoring room X1 can be avoided by allowing the cost of not detecting an event to grow as a function of the time passed since the event started. However, if the mentioned (expected) costs are correct, then this ignoring of room X1 is defensible. Intuitively, one should not require a surveillance agent to explore irrelevant or unimportant areas of the environment. This is substantiated by the following. 3.2.9. Proposition. In the environment of example 3.2.8 the 1-step minimum expected cost strategy minimises the global expected cost. Proof. To prove this proposition we will show that a visit to room 1 is never justi?ed, so our 1-step minimum expected cost strategy, which only visits room 2, is optimal. The direct bene?t of visiting room 1 is limt→∞ EC(f1 ) = 1. By proposition 3.2.7 we know that the upper bound of the indirect bene?t of a potential visit to t?0 room 1 at time t is n=1 (1 ? 0.5)n . The limit of this upper bound, as t → ∞, is also 1. Therefore, a potential visit to room 1 as t → ∞ can bring us a maximal potential bene?t of 2 (One util for the direct bene?t plus one for the indirect bene?t). But this is always less than the 2.4 direct bene?t of visiting room 2. So a visit to room 1 is never justi?ed. Perhaps a more important problem than possibly preventing an exhaustive exploration of the environment is that the varying cost can form an obstacle to obtaining optimal behaviour using the local (1-step) minimum expected cost strategy. 3.2.10. Example. Consider E = X, A0 , A, F , C, P , where X is a set {X1 , X2 , X3 } consisting of three rooms (?g. 3.3). The accessibility relation A is given by Xi , Xj ∈ A i? i and j di?er by at most 1, C(f1 ) = 10. C(f2 ) = 1, C(f3 ) = 3, P (f1 → 1) = 0.1, P (f2 → 1) = 0.5 and P (f3 → 1) = 0.8. In other words, we now have three rooms in a row, and after discarding room X0 , one obtains the environment of example 3.2.8.

3.2. Decision-theoretic setting

33

1 C(f1 ) = 10 P (f1 → 1) = 0.1

2 C(f2 ) = 1 P (f2 → 1) = 0.5

3 C(f3 ) = 3 P (f3 → 1) = 0.8

Figure 3.3: The environment of example 3.2.10. As in example 3.2.8, if A0 = X3 , the 1-step minimum expected cost strategy will always choose room X3 . But now the (possibly justi?able) ignoring of room X2 will make it impossible to visit room X1 , and, in the long run, the expected cost of not visiting room X1 will be very high. Obviously, such problems can be solved theoretically by looking ahead more than one step. However, looking many steps ahead is computationally expensive, since the required computation time depends exponentially in the number of steps considered. In [MV98], it is argued, based on some experiments, that in typical environments it is not feasible for real-time behaviour to use much more than a look-ahead of ?ve steps. Even for constant costs, the 1-step minimum expected cost strategy is not guaranteed to globally minimise the expected cost. 3.2.11. Proposition. In the environment of example 3.2.4 the 1-step minimum expected cost strategy does not minimise the global expected cost. Proof. In table 3.1 a stable repetitive behaviour can be observed after time-step 4 with a cycle size of 3 (see steps 4-6). The total environment cost for three time-steps of this behaviour is (0.5 + 0.96) + (0.75 + 0.8) + (0.875 + 0.8) = 4.685. A robot moving back and forth between the rooms in the same environment would enter a behaviour consisting of the states reached in time-steps 4 and 5 (a cycle of 2). The cost for two time-steps of this behaviour is (0.5 + 0.96) + (0.75 + 0.8) = 3.01. To make a comparison a cycle of 6 time-steps is examined. The 1-step minimum expected cost strategy gives us a 4.685 × 2 = 9.37 utils expected cost while the simple back and forth behaviour gives us a 3.01 × 3 = 9.03 utils expected cost. So the 1-step minimum expected cost strategy does not minimise the global expected cost. Actually, in example 3.2.4, the global expected cost of the 1-step minimum expected cost strategy is higher than that of the back and forth behaviour resulting from the methodical exploration, the maximum likelihood and the minimax

34 C(f1 ) = 1 P (f1 → 1) = 0.5 C(f2 ) = 1 P (f2 → 1) = 0.5

Chapter 3. Problem Formalisation and Complexity C(f4 ) = 1 P (f4 → 1) = 0.5 C(f3 ) = 1 P (f3 → 1) = 0.5

1

4

2

3

Figure 3.4: The environment of example 3.2.12. interval strategies. The 2-step minimum expected cost strategy already results in the same back and forth behaviour. It should be observed that looking ahead more steps does not always result in better performance. The following example shows that sometimes the 1-step minimum expected cost strategy behaves better than the 2-step strategy. 3.2.12. Example. Consider E = X, A0 , A, F , C, P , where X is the set {X1 , X2 , X3 , X4 }, the accessibility relation A is given by Xi , Xj ∈ A i? i and j di?er at most 1, and for all locations Xi , C(fi ) = 1 and P (fi → 1) = 0.5. This environment can model a corridor, with X1 , X2 , X3 , X4 as sections of the corridor, but it can also model two rooms, represented by X1 , and X4 , connected by a corridor with two sections represented by X2 and X3 . (See ?g. 3.4.) Therefore, this kind of environment is not unusual for an o?ce-like building. If the agent starts at X2 or X3 , then the 2-step minimum expected cost strategy results in going back and forth between X2 and X3 (without visiting X1 or X4 ), whereas the 1-step strategy results in going back and forth between X1 and X4 (and visiting X2 and X3 in between). It is easy to see that the latter behaviour is better. Notice that the above example shows that the 2-step minimum expected cost strategy is not guaranteed to result in an exhaustive exploration of the environment, even if the assumption of constant cost of proposition 3.2.5 is satis?ed. The problem is caused by the accessibility relation, since if one additionally assumes universal accessibility (A = X × X), then proposition 3.2.5 generalises to the n-step minimum expected cost strategy. The subtle role of the accessibility relation is not present in several problems which seem closely related to our robotic surveillance problem. The multi-armed bandit problems are examples of this. There a casino visitor has to select a lever to pull among many bandit machines (fruit machines). There are many variants of this problem but typically the longer a lever has not been pulled the more likely it is to give a big reward. An intelligent visitor should pull a lever that has been used a lot but has not given up a reward for some time. Unlike our problem, the agent may select any bandit and the accessibility relation does

3.3. (PO)MDP settings

35

play a role. Other similar problems are the problem of maintaining a system with independent component failures, and that of a surveillance camera selecting which part of a scene to focus on. In all of these it seems natural to assume a universal accessibility relation. The problem with the 2-step strategy in example 3.2.12 can be solved by implementing a commitment to complete the selected path of length 2, and make a new choice only every other time-step. In the concrete, pseudo-realistic example discussed in [MV99b], the 5-step minimum expected cost strategy performs better with commitment than without. The next example illustrates that a mobile platform may not always be very useful if the relevant events have an insigni?cant duration. 3.2.13. Example. Consider a robot with a digital photo camera on top of a hill. Assume that it has to take as many pictures of lightning as possible. Further, assume that the camera has a viewing angle of 90 degrees, and that at any time it can be directed in one of four directions: North, East, South, West. Finally, assume that after taking a picture, the robot has su?cient time to change the direction of the camera in any of these four directions before the camera is ready to take the next picture, and that the probability of lightning at a particular direction does not change over time. In this case, {N, E, S, W } can be viewed both as the partition of the environment and as the set of possible sensor ranges at each time. It is clear that the optimal strategy is to direct the camera towards the area where the probability of lightning is maximal, and to keep taking pictures in that particular direction, without making use of the possibility of changing the direction of the camera. An essential feature of this example is that the probability of an event e at time t always equals the probability of the event starting: Pt (e) = P (e → 1). Moreover, it is clear that, in this case, the decision problem at time t is independent of the actions that are chosen at other times. Therefore, the global decision strategy and the 1-step strategy are equivalent. From the description of the goal ‘to take as many pictures of lightning as possible’ it can be inferred that the cost of not detecting a particular lightning ?ash does not depend on the area where this ?ash occurs. In that case, minimising expected cost at time t reduces to minimising the probability of not detecting a particular ?ash at t, which is equivalent to the strategy mentioned in the example.

3.3

(PO)MDP settings

In the previous section the surveillance problem was set and analysed in a decisiontheoretic way. Here it will be set as a (PO)MDP. In fact, due to the generality of the (PO)MDP representation, it is possible to set it in several ways. To make clear that we are dealing with the same problem, we will show that the di?erent

36

Chapter 3. Problem Formalisation and Complexity

(PO)MDP settings are equivalent to each other and to the original decisiontheoretic setting. The (PO)MDP representation will not be used in the later chapters. However it is still used in the next section where we derive the statespace size.

3.3.1

Introduction to POMDPs

A Partially Observable Markov Decision Process (POMDP) is a tuple S, A, T , R, ?, O . We describe in turn each element of the tuple. States S. S is the ?nite set of states the world can be in. An element from this set is a possible way the world could exist. Actions A. A is the set of possible actions. The set of possible actions can be the same for all s ∈ S and in this case we write A. However, in some situations di?erent actions are available for each state s ∈ S and then we write As . Transitions T : S × A → Π(S). T is the state transition function. Given a state s ∈ S and an action a ∈ As , it gives a probability distribution over resulting states. We often write T (s, a, s ) to represent the probability that, given action a was taken at state s, s is the resulting state. Obviously, although the e?ects of actions can be probabilistic, they do not necessarily have to. In that case only one resulting state is possible. Immediate Rewards R : S × A → R. R is the immediate reward function. It gives the expected immediate reward obtained by the agent for taking each action in each state. We write R(s, a) to represent the expected immediate reward for taking action a in state s. Observations ?. ? is the ?nite set of observations the agent can experience in the world. An agent, instead of directly observing after each action the current environment state, receives an observation. This observation provides a hint about the current state of the environment. Observation function O : S × A → Π(?). O is the observation function1 . It gives for each action and resulting state the probability distribution over the possible observations. We write O(s a, o) to represent the probability that observation o was obtained when action a resulted in state s .

1 Note that in Puterman’s book [Put94] the transition and observation functions are referred to as the transition and observation models.

3.3. (PO)MDP settings Further assumptions

37

The POMDP model makes the assumption that the next state for a given action only depends on the current state. Looking back at the state history should not change our probabilities. So we are making the Markov assumption. An MDP is a POMDP without imperfect observations and is sometimes called FOMDP (Fully Observable MDP). That is, the observations ? and the observation function O are not present in MDPs. Instead, the agent always correctly determines the state S it ends up in, after taking action A at state S. Note that the outcome of the actions is still probabilistic, the di?erence being that the agent can just determine directly the state it is in. Optimality criteria The goal of an agent using a POMDP is to maximise the future reward for an in?nite time horizon. However, the methods of optimising over an in?nite time horizon for general problems require an in?nite amount of computational time. Since this is not possible, two alternative optimality criteria are proposed in the POMDP community. The ?rst one is the ?nite horizon model. The other is the in?nite-horizon discounted model. In the ?nite horizon model we are trying to maximise the reward EUT over the next T steps:

T

EUT =

t=0

Rt

where Rt is the reward obtained at time-step t. It can be claimed that this criterion is inconvenient since the value of T has to be known in advance and it is hard to know T exactly. For our problem it is not too hard to pick a value of T based on the duration in real time units (e.g. seconds) of a time-step and the length of a shift of the robot (e.g. a night). This optimality criterion is almost identical to our expected cost criterion. In the in?nite-horizon discounted model, the discounted sum of the rewards over an in?nite horizon EDU is maximised. The discounting is done using a discount factor 0 < γ < 1 in order to guarantee that the sum is ?nite.

∞

EDU =

t=0

γ t Rt

where Rt is as before. Policies and belief states A policy describes how the agent should behave in some speci?c environment. For the MDP case (POMDP without the partial observability) a policy π : S → A

38

Chapter 3. Problem Formalisation and Complexity

speci?es what action should be taken in every environment state. Solving an MDP can be seen as equivalent to ?nding a policy for the given MDP. In the case of POMDPs the situation gets slightly more complicated. This is due to the fact that the agent does not know exactly what the current state of the environment is. For POMDPs the policy is speci?ed on belief states rather than on environment states. A belief state is probability distribution b over the possible actual world states S. This distribution encodes the agent’s subjective probabilities about the state of the environment and provides the basis for decision making. We let b(s) denote the belief (subjective probability) of the agent that the current state is s. A state estimator has to compute the new belief state b given the old belief state b, action a, and observation o. 3.3.1. Proposition. The new belief in some state s is b (s ) and can be written as: O(s , a, o)Σs∈S T (s, a, s )b(s) b (s ) = P (s |o, a, b) = P (o|a, b) The following assumptions are made: Assumption 1: The probability of an observation o depends only on the state s and action a: P (o|s , a, b) = P (o|s , a) = O(s , a, o). Assumption 2: The probability of moving into a new state s , given a previous state s and an action a, does not depend on the old belief state b: P (s |a, b, s) = P (s |a, s) = T (s, a, s ). Assumption 3: The probability of being at state s, given a belief state b and an action a, is independent of the a and is just the belief b of being at state s: P (s|a, b) = b(s) Proof. The proof is by substitution. b (s ) = P (s |o, a, b) P (o|s , a, b)P (s |a, b) = P (o|a, b) O(s , a, o)P (s |a, b) = P (o|a, b) O(s , a, o)Σs∈S P (s |a, b, s)P (s|a, b) = P (o|a, b) O(s , a, o)Σs∈S T (s, a, s )b(s) = P (o|a, b) de?nition of b Bayes’ rule assumption 1 total probability rule assumptions 2,3

The denominator in the equation of proposition 3.3.1 can be seen as a renormalisation factor that causes b to sum up to 1. It can be computed as: P (o|a, b) = Σs ∈S O(s , a, o)Σs∈S T (s, a, s )b(s)

3.3. (PO)MDP settings

39

The state estimator SE(b, a, o) gives as an output the new belief state b by repeating the calculation of b (s ) for all s ∈ S. Now that the concept of a belief state has been introduced, we can talk about belief MDPs. A belief MDP is de?ned as follows: ? B is the set of belief states. ? A is the set of actions, de?ned as for the POMDP. ? τ (b, a, b ) is the new state transition function and can be computed as follows: ,b,a,o) τ (b, a, b ) = P (b |b, a) = Σo∈? P (b (b,a) P = Σo∈? P (b |b, o, a)P (o|b, a) where: P (b |b, o, a) = 1 if SE(b, a, o) = b 0 otherwise

? ρ(b, a) is the reward function on the belief states and is de?ned as: ρ(b, a) = Σs∈S b(s)R(s, a)

3.3.2

POMDP setting

In this section the problem of ?nding ?res described in section 3.2.1 will be set as a POMDP. We describe here all the parts of the resulting POMDP plus the initial belief state. States S. f1 , . . . , fn , Xl , fi ∈ {0, 1}, Xl ∈ X, where fi is a Boolean variable being 1 if a ?re is present in room Xi , and Xl is the current robot location. Actions As . The actions are state dependent. For a state S = f1 , . . . , fn , Xl the available actions are of the form GO(Xg ) with an action for each Xg such that (Xl , Xg ) ∈ A. So the actions available at each state are dependent on the robot’s immediate accessibility relation (A ? X × X). Transitions T : S × A → Π(S) The transitions are speci?ed as follows: P (s = f1 , . . . , fn , Xj ) = 0 if j = g P (s = f1 , . . . , fk , . . . , fn , Xg ) = 0 if fk < fk &k = l P (s = f1 , . . . , fn , Xg ) =

fa >fa

For s = f1 , . . . , fk , . . . , fn , Xl and a = GO(Xg ) the probabilities are:

P (fa → 1)

fb =0 (1

? P (fb → 1))

The ?rst if-statement states that the robot cannot end up in the wrong location. This is a consequence of the perfect navigation-localisation assumption. The second if-statement denotes that ?res do not stop on their

40

Chapter 3. Problem Formalisation and Complexity own. This is a consequence of the assumption that only the robot can put out ?res. The third if-statement calculates the probability of all remaining states (those not covered by the ?rst two if-statements) as the product of the probability of ?res starting and the probability of ?res not starting.

Immediate Rewards R : S × A → R. If in state s action a is taken, then a lot of resulting states s are possible with di?erent probabilities (speci?ed in the transitions section). The utility of each possible resulting state s = f1 , . . . , fn , Xl is U (s ) = C(fl ) if fl = 1 and U (s ) = 0 otherwise. This makes the expected reward for the state/action combination: R(s, a) =

s

T (s, a, s ) ? U (s )

Observations ?. Observations ? have the form: fl where fl is a Boolean variable signifying whether a ?re is present at the current robot location l. Observation function O : S × A → Π(?). The observation function for state s = f1 , ..., fn , Xl and action a = GO(Xl ) is: if fk = fl else P (o = fk ) = 0 P (o = fl ) = 1

You can notice that although the POMDP model requires us to specify the action on top of the resulting state, this action is not necessary in our case for specifying the observation function. Initial State Initially, only state A0 is considered possible. This is represented by having a belief state b such that b(A0 ) = 1 and b(s) = 0 for all other s ∈ S. Then as time passes by, the belief is distributed over more states. Belief MDP resulting from our POMDP Since now we know what the POMDP for our problem is, the question of what the resulting belief MDP looks like can be answered. To do that we examine the part of section 3.3.1 on belief states for POMDPs and specify the corresponding belief MDP components step-by-step: States B. A single belief state is a distribution over all possible states S of the POMDP. For example, b( 1, 1, . . . , 1, Xl = 2 ) = 0 can be our belief that everything is on ?re and that we are currently in location Xl = 2. This belief is part of one belief state, and B is the set of all possible belief states. One important characteristic of the belief states in our problem is that only the world states with the correct robot location are possible. This is an artifact of the perfect robot localisation assumption. So b( . . . , Xk ) = 0 if Xk = Xl where Xl is the robot’s location.

3.3. (PO)MDP settings

41

Actions Ab . The actions are as before. They are belief state dependent. For a belief state b the available actions are of the form GO(Xg ) with an action for each Xg such that (Xl , Xg ) ∈ A. Here note that Xl can be extracted from the belief state b by looking at just one state s such that b(s) > 0 Transitions τ : B × A → Π(B). These transitions have to be computed as previously described. Immediate Rewards ρ : B × A → R. The reward function of the belief states B is de?ned as: ρ(b, a) = Σs∈S b(s)R(s, a) = Σs∈S b(s)Σs inS T (s, a, s ) ? U (s ). So R(s, a) is substituted with the value it has for the POMDP.

3.3.3

Direct MDP settings

Although it is possible to obtain a belief MDP from the POMDP setting of the problem, another alternative is to de?ne an MDP directly based on the model of the environment introduced at the beginning of this document. The resulting MDPs are “belief MDPs” in the sense that they specify how our belief should be updated. A di?erence from the belief MDP resulting from the POMDP is that here our state is related to how much we believe a ?re to be present in each environment location. In the POMDP case our belief state is how much we believe an entire state of the environment is possible. First setting as MDP States S. t1 , . . . , tn , Xl , ti ∈ [0, ∞), Xl ∈ X, where t1 to tn are the times since a room was last visited and Xl is the current robot location. Actions As . The actions are state dependent. For a state s = t1 , . . . , tn , Xl the available actions are of the form GO(Xg ) with an action for each Xg such that (Xl , Xg ) ∈ A. Transitions T : S × A → Π(S). The transitions are completely deterministic and so each (s, a) pair corresponds to a single state s . So T : S ×A → Π(S) has now the form T : S × A → S T (s, a) = T ( t1 , . . . , tl , . . . , tn , Xl , GO(Xg )) → t1 + 1, . . . , tl = 1, . . . , tn + 1, Xg ) In the setting of section 3.2.1, once a room is visited, the ?re is immediately extinguished. Here we e?ectively do the same but the e?ects of the extinguishing action are delayed until the next time-step. This is done so that

42

Chapter 3. Problem Formalisation and Complexity the reward of visiting a state where the room is on ?re can be speci?ed. If the ?re was extinguished immediately, then reaching a state where the room is on ?re would be no di?erent than reaching a state where the room is not on ?re. This would make the reward speci?cation very hard.

Immediate Rewards R : S × A → R. Since the actions are deterministic, the state action pair (s, a) always results in the same state s . Then the reward is: R(s, a) = C(fl ) ? P (fl ) where P (fl ) = 1 ? (1 ? P (fl → 1))tl and Xl is the robot location in s . So given the state s , the a priori probabilities of a ?re starting, and the cost for each room, the reward for each (s, a) can be computed. Second setting as MDP States S. p1 , . . . , pn , Xl , pi ∈ [0, 1]2 , Xl ∈ X, where pi is the probability of a ?re being present in a room Xi , and Xl is the current robot location. Actions As . The actions are state dependent. For a state s = p1 , . . . , pn , Xl the available actions are of the form GO(Xg ) with an action for each Xg such that (Xl , Xg ) ∈ A. Transitions T : S × A → Π(S). The transitions are completely deterministic and so each (s, a) pair corresponds to a single state s . So T : S ×A → Π(S) has now the form T : S × A → S T (s, a) = T ( p1 , . . . , pl , . . . , pn , Xl , GO(Xg )) → p1 = 1 ? (1 ? p1 )(1 ? P (f1 → 1)), . . . , pl = P (fl → 1), . . . pn = 1 ? (1 ? pn )(1 ? P (fn → 1)), Xg So at every time-step the ?re starting probabilities are used to update the current state of the environment. The e?ect of the ?re extinguishing action is delayed until the next time-step for the same reason as in the transition model of the ?rst MDP setting. Immediate Rewards R : S × A → R. Since the transitions are completely deterministic each (s, a) pair corresponds to a single state s . So, given the state s and the cost for each room, the reward for each (s, a) can be computed as R(s, a) = C(fl ) ? pl , where Xl is the robot location in s .

2

note pi is an alternative notation for P (fl )

3.3. (PO)MDP settings

POMDP model 5(*) 4(e) Original Model Belief MDP model 2(e) MDP model 2 1(*) MDP model 1

43

3(e)

Figure 3.5: Equivalence demonstration schema. Equivalence proof is supplied for (*)

and equivalence claim is made plausible by an example for (e).

3.3.4

Equivalence of di?erent settings

In this section we show that the various surveillance models presented so far are representationally equivalent. That is, we show the underlying problem to be the same, no matter which representation is used. The model equivalence is demonstrated by means of a collection of pairwise equivalences. There are some proofs involved and the equivalence demonstration schema can be seen in ?gure 3.5. Some of the equivalences are made plausible by means of working out by hand example 3.2.4 for both the belief POMDP and the second MDP settings. The choice to use this example is slightly arbitrary. The main reason for choosing it is that it is simple and this makes it possible to work it out by hand. In fact, for the case of belief MDPs this simulation on example 3.2.4 is still too long and it is exhibited in appendix A to save space in the text of this chapter. Although, this example is simple, it has been used in the section on strategies to demonstrate some di?erences between them. We believe that it is hard to ?nd an example where these equivalences would not hold but, unfortunately, we have not been able to ?nd proofs. Two MDP models (equivalence 1) We begin our equivalence proofs by demonstrating that the following proposition holds (label 1 of ?g. 3.5). 3.3.2. Proposition. The two direct MDP settings are equivalent. Proof. To prove this we show that there is a mapping between the states of the two MDPs and that it does not matter which path of ?gure 3.6 we follow during state transitions. We begin by showing that there is mapping between states. Based on proposition 3.2.2 and the de?nition of states in the second MDP model, we can say that the probability pi of a room being on ?re in a state s2 of the second MDP

44

Chapter 3. Problem Formalisation and Complexity

model is pi = 1 ? (1 ? P (fi → 1))ti , where ti is the time since the last visit in the corresponding state s1 of the ?rst MDP model. This formula gives us an easy way of converting states of the ?rst MDP model into states of the second. This formula can be inverted. Beginning from: pi = 1 ? (1 ? P (fi → 1))ti we can write: (1 ? P (fi → 1))ti = 1 ? pi ? ti log(1 ? P (fi → 1)) = log(1 ? pi ) giving ?nally: ti = log(1 ? pi ) log(1 ? P (fi → 1))

This last formula gives us a way of converting states of the second MDP model into states of the ?rst. Actually, the above equation is unde?ned if P (f i → 1) = 0 or P (fi → 1) = 1. This means that in those two cases we cannot really tell how long ago a room was visited from its probability of being on ?re. Now that the mapping between states has been demonstrated, we continue by showing the equivalence of state transitions. Suppose that the state in the ?rst MDP is s1 = t1 , . . . , tl , . . . , tn , Xl . Using the information in s1 the corresponding state s2 of the second MDP can be found: s2 = p1 = 1 ? (1 ? P (f1 → 1))t1 , . . . , pl = 1 ? (1 ? P (fl → 1))tl , . . . , pn = 1 ? (1 ? P (fn → 1))tn , Xl Then suppose that action GO(Xg ) is taken while at state s2 , the resulting state s2 is: s2 = p1 = 1 ? (1 ? p1 )(1 ? P (f1 → 1)), . . . , pl = P (fl → 1), . . . , pn = 1 ? (1 ? pn )(1 ? P (fn → 1)), Xg State s2 can be rewritten after substitution of values from s2 : s2 = p1 = 1 ? (1 ? P (f1 → 1))t1 +1 , . . . , pl = P (fl → 1), . . . pn = 1 ? (1 ? P (fn → 1))tn +1 , Xg Note now that s2 is equivalent (using the inverse mapping mentioned above) to: s1 = t1 + 1, . . . , tl = 1, . . . , tn + 1, Xg But s1 is the same as s1 that can be directly computed from state s1 when transition GO(Xg ) is followed. We have thus shown that the transitions in the two MDPs produce equivalent results.

3.3. (PO)MDP settings

MDP 1 t1 , . . . , tl , . . . , tn , Xl

6 ?

45

GO(Xg )t1 + 1, . . . , tl = 1, . . . , tn + 1, Xg

6 ?

MDP 2 p1 , . . . , pl , . . . , pn , Xl

GO(Xg )-

Figure 3.6: Equivalence sketch for the two MDPs. Original and Belief MDP model (equivalence 4)

p1 = 1 ? (1 ? p1 )(1 ? P (f1 → 1)), . . . , pl = P (fl → 1), . . . , pn = 1 ? (1 ? pn )(1 ? P (fn → 1)), Xg

We now make plausible that the following claim holds (label 4 of ?g. 3.5). 3.3.1. Claim. The belief MDP and the original models are equivalent. The example given in appendix A can be seen as an informal demonstration that the original setting and the belief MDP are equivalent. The justi?cation of this statement is that although the settings of the given problem are di?erent, the results computed by iterating the two models are the same, both in terms of robot actions taken and in terms of expected action bene?t. Belief MDP and direct MDP model (equivalence 2) We now make plausible that the following claim holds (label 2 of ?g. 3.5). 3.3.2. Claim. The belief MDP and the second direct MDP models are equivalent. To make this equivalence plausible, we will give a setting of example 3.2.4 within the second MDP model, then iterate the resulting MDP, and show that the same results are computed. States S. p1 , p2 , Xl , pi ∈ [0, 1], Xl ∈ {X1 , X2 }. Actions As . Here the actions are not state dependent because the state is too small. Two actions are available: GO(X1 ) and GO(X2 ). Transitions T : S × A → Π(S). The transitions are:

T (s, a) = T ( p1 , p2 , X1 , GO(Xg )) → p1 = 0.5, p2 = 1 ? (1 ? p2 )(1 ? 0.8), Xg T (s, a) = T ( p1 , p2 , X2 , GO(Xg )) → p1 = 1 ? (1 ? p1 )(1 ? 0.5), p2 = 0.8, Xg

Immediate Rewards R : S × A → R. The rewards are given as previously described.

46

Chapter 3. Problem Formalisation and Complexity

Now given this concrete setting, the MDP can be iterated with a one-step look-ahead. The complete iteration can be found in ?gure 3.7. The starting state is the state where the probability of both rooms being on ?re is zero. Then both actions are checked to determine their rewards R(s, a) and the action with the largest reward is taken. This procedure is iterated a few times. It can be seen that the actions taken during this iteration are the same as those in appendix A. Furthermore, the rewards on which the decisions are based are also the same as those found in appendix A. Original and direct MDP model (equivalence 3) This example can be seen to make the following claim (label 3 of ?g. 3.5) plausible. 3.3.3. Claim. The original and the second direct MDP models are equivalent. POMDP and Belief MDP model (equivalence 5) 3.3.3. Proposition. The POMDP and belief MDP models are equivalent. This proposition corresponds to label 5 of ?gure 3.5. This is true according to the de?nition of belief MDPs. A proof that the belief state is a su?cient statistic for past history of observations of the POMDP can be found in [SS73]. Their proof shows that it is not necessary to remember the entire history of the POMDP to determine its next state. That is, knowing the old belief state, the action taken and the observation received are su?cient to determine the new belief state. Furthermore, the state estimator SE(b, a, o) is derived as part of their proof.

3.4

Surveillance state space size

Having described several problem settings and having shown that these are equivalent, we proceed to derive the state space size for the surveillance problem. We begin by giving a simple argument about why it is exponential and give a possible objection to that argument. We then go on by proving that the state space size is indeed exponential. In the case of surveillance the state space is described as a tuple of times since last visit containing one such time per room. A single state can be described as: t1 , . . . , tn , Xl , ti ∈ [0, ∞), Xl ∈ X The times since last visit are in theory allowed to go to in?nity, which makes the state space in?nite in size. However, in normal situations not all times are used because the behaviour of the robot exhibits some periodicity. Consequently, a limit T is imposed on how long a time since last visit can be. The state space

3.4. Surveillance state space size

47

s1 = 0, 0, X1 a1 s2 = 0.5, 0.8, X1 R(s1 , a1 ) = 0.5 a1 s4 = 0.75, 0.8, X1 R(s3 , a1 ) = 0.75 a1 s6 = 0.875, 0.8, X1 R(s5 , a1 ) = 0.875 a1 s8 = 0.5, 0.96, X1 R(s6 , a1 ) = 0.5 a1 s4 = 0.75, 0.8, X1 R(s9 , a1 ) = 0.75 a2 s3 = 0.5, 0.8, X2 R(s1 , a2 ) = 0.8 a2 s5 = 0.75, 0.8, X2 R(s3 , a2 ) = 0.8 a2 s7 = 0.875, 0.8, X2 R(s5 , a2 ) = 0.8 a2 s9 = 0.5, 0.96, X2 R(s6 , a2 ) = 0.96 a2 s5 = 0.75, 0.8, X2 R(s9 , a2 ) = 0.8

Figure 3.7: Iterating the second MDP setting.

48

Chapter 3. Problem Formalisation and Complexity

permissible time

T?n+3 T?n+2 n+2 n+1 n n?1 n?2 3 2 1 room number

1

2

3

4

n?1 n

Figure 3.8: Permissible times vs room number. size is exponential in the number of rooms included in a state (it is T n , where n is the number of rooms). One can object to the previous argument and say that the state space size is not exponential in the number of rooms. A speci?c time since last visit for a single room sets constraints on the times since last visit for the other rooms. Those constraints are imposed by (a) the fact that the robot can be in only one room at a given time and (b) by the connectivity between the rooms. The constraints seem to imply that the state space size does not increase exponentially in the number of rooms.

3.4.1

State space size derivation

Here it is argued that although the argument about the presence of constraints and about the state space size not being T n is correct, the state space size is, in fact, still exponential. To show this we ?rst limit the types of states we are examining by: (a) concentrating on a corridor environment, and (b) concentrating on the type of states where the robot is on the leftmost room of the corridor. A diagram representing the admissible states of this type can be generated (?g. 3.8). You can see in the boxes at the bottom of the diagram that the robot is in the leftmost room and this room can only have a time since last visit of 0. The rest of the rooms can have times at some interval between 0 and T (shaded region), but not all values are allowed because of the constraints mentioned.

¤ ?? ¤??¤ ???¤? ????

?????????? ? ? ? ? ? ? ? ? ? ????????????????????????? ??????????????????? ??????????????????? ???????????????????? ???????????? ??????????????????? ????????????????????????? ??????????????????? ??????????????????? ???????????? ???????????? ???? ? ? ? ? ? ? ? ????????????????????????? ??????????????????? ??????????????????? ???????????????? ???????????? ???????????????????? ???? ? ? ? ? ? ? ? ????????????????? ?????????????????? ???????????????????? ??????????????????? ???????????????? ????????????????????? ??????????????????? ??????????????????? ?????????????????? ????????????

T T?1

Legend: T n

horizon number of rooms example path path going back on itself permissible region

3.4. Surveillance state space size

49

Assuming that every room was visited at some point, the lowest time value allowed for any room is its distance from the leftmost room. This is 1 for room 2, 2 for room 3 and so on. An interesting observation is that there are n ? 1 places where our constraints apply and that the width of the possible options at each of those places is at least T ? (n ? 1). A possible MDP state corresponds to a path in this diagram. Note that although the times are drawn as straight lines, they are discrete, and so should be jagged. An example of a state represented as a path is the state 0, 2, 3, . . . , n ? 1, n + 1 which is depicted as a dashed line in ?g. 3.8. These paths have the further constraint of having to be monotonically increasing with a slope of at least 1. Paths that partially go back on themselves are possible but still the time since last visit would be strictly decreasing with a slope of -1. An example of such a path is given in ?gure 3.8 and the times since last visits generated by this path are represented by its lower line segment as: 0, 3, 4, . . . , n + 1, n + 2 . Perhaps the important observation is that it is the most recent visit that matters in determining the time since last visit and hence the state of the MDP. The number of such paths is, at least, like putting T distinguishable balls into n ? 1 indistinguishable boxes and then sorting those indistinguishable boxes based on the size of the ball in ascending order. The number of such possibilities is: T! T (3.1) = n?1 (n ? 1)!(T ? (n ? 1))! This formula gives us for a speci?c T and n the number paths satisfying the constraints in our diagram of ?g. 3.8. So we have a formula to compute how many states of this speci?c type (corridor, leftmost room) exist. In reality, there are a few more possibilities because the ?gure is wider on one end, and because paths that go back on themselves are allowed.

3.4.2

The value of T and the exponential expression

The size of T is related to the number of states that can be represented. If T = (n?1)! n ? 1, we get only one state (n?1)!0! = 1 , so the rest of the states encountered by the robot in a circular path cannot be represented. T should be set to a value that is large enough to accommodate all possible interesting paths and all the states that the robot would encounter along those paths. In the case of the corridor the least possible T to pick is 2(n ? 1) because we need so many time-steps to walk along the corridor and return to the original position. So a path just doing a round trip in the corridor needs at least T = 2(n ? 1) to be representable in our MDP. The example of the path going back on itself should be seen as an indication of why that is so. We can convert equation 3.1 into one not involving factorials by using their

50

Chapter 3. Problem Formalisation and Complexity

Stirling approximation. The Stirling approximation of factorial is: √ n n n! ≈ 2nπ e Applying this to the formula for combinations we get: √ T 2T π T T e ≈ √ m m 2mπ m 2(T ? m)π T ?m e e √ T 2T π T = 2π m(T ? m)mm (T ? m)T ?m

T ?m

The last equation is the general equation for combinations for any T and m. Now in our case m = n ? 1, and for a corridor case we can set T = am = a(n ? 1). Now we get: am m √ 2π 2amπ(am)am

≈ =

m(am ? m)mm (am ? m)am?m √ am am aa m

2πm(a ? 1)mm (m(a ? 1))m(a?1) √ a aa (a ? 1)(a?1)

m

giving: am m ≈

Equation 3.2 is the general exponential expression for any a. Taking a = 2, the number of states can be approximated as: 2(n ? 1) n?1 ≈ 1 π(n ? 1) 4n?1

2πm(a ? 1)

(3.2)

So for T = 2(n ? 1) we get an exponential growth. Therefore, even for the simple path in a corridor, the number of states available grows exponentially. Given that the probabilities and the costs can also vary, we are likely to have more complicated paths with the robot staying in some rooms. Those paths require T > 2(n ? 1) even for the corridor case, not to mention that there are other types of environment like stars (chapter 5) that might need an even larger T.

3.5

Standard (PO)MDP solving approaches

In this section we discuss several standard (PO)MDP solving approaches. The focus on (PO)MDP based methods is due to the fact that they are what is most

3.5. Standard (PO)MDP solving approaches

51

commonly used to solve other decision-theoretic problems in robotics. We hope it will become clear that none of these methods is really suited to solve interesting instances of the robot surveillance problem.

3.5.1

Value functions; value and policy iteration

In section 3.3.1 we have already de?ned a policy as a mapping π : S → A from states S to actions A that speci?es what action should be taken in every environment state. Given the policy of an MDP, and assuming that we are using the discounted in?nite horizon optimality model, we can de?ne for each state its in?nite horizon value function as: Vπ (s) = R(s, π(s)) + γΣs ∈S T (s, π(s), s )Vπ (s ) The value function is the expected cost resulting from starting in state s and the following policy π. Here the discounted optimality criterion of section 3.3.1 is used in which the value of states away from state s are discounted by γ. The discounted in?nite horizon solution to the MDP is a policy that maximises its expected value Vπ . Howard [How60] showed that there is an optimal policy π ? that is guaranteed to maximise Vπ , no matter what the starting state of the MDP is. The value function for this policy Vπ? , also written V ? , can be de?ned as: V ? (s) = max[R(s, a) + γΣs ∈S T (s, a, s )V ? (s )]

a

This has a unique solution and if the optimal value function V ? is known, then the optimal policy π ? is de?ned as: π ? (s) = argmaxa [R(s, a) + γΣs ∈S T (s, a, s )V ? (s )] There are two very standard ways of ?nding the optimal policy, (a) policyiteration and (b) value iteration. The policy iteration method (see algorithm 3.9) starts from a randomly chosen policy. The algorithm proceeds by repeatedly trying to ?nd alternative actions for each state that improve the current state value function. The new actions found replace the old policy actions. The iterative improvement of policies stops when no policy-improving actions can be found. In value iteration (see algorithm 3.10) optimal policies are produced for successively longer ?nite horizons until they converge with some error less than . Assuming that for a look-ahead of 0, V0 (s) = 0, the algorithm computes value functions Vt based on the policy found using the value function Vt?1 . The algorithm terminates when the maximum change in the value function is below a threshold. Policy iteration and value iteration can ?nd optimal policies in time polynomial in the number of states in the MDP. However, as we have shown, the number

52

Chapter 3. Problem Formalisation and Complexity

of MDP states is exponential in the number of locations used to describe the state space. This state space is de?ned as the cross product of those locations. So these methods cannot be used to solve the Direct MDP formalisations of section 3.3.3. Furthermore, in the case of POMDPs, the belief MDP corresponding to the POMDP has a continuous state space and this complicates matters further. In fact, in [PT87] it is shown that ?nding policies for POMDPs is a PSPACEcomplete problem and this makes exact solutions in polynomial time less likely than for NP-complete problems. Policy iteration and value iteration can, therefore, ?nd optimal policies but for the smallest POMDPs. In the rest of this section we are going to describe techniques for ?nding POMDP policies for larger problems. for each s ∈ S do π(s) ← RandomElement(A) end; repeat Compute Vπ (s) for all s ∈ S; for each s ∈ S do Find some a such that [R(s, a) + γΣs ∈S T (s, a, s )Vπ (s )] > Vπ (s); if such an a exists then π (s) ← a; else π (s) ← π(s); end end until π (s) = π(s) for all s ∈ S; return π; Figure 3.9: The policy iteration algorithm

for each s ∈ S do V0 (s) ← 0 end; t ← 0; repeat t ← t + 1; for each s ∈ S do for each a ∈ A do Qt (s, a) ← R(s, a) + γΣs ∈S T (s, a, s )Vt?1 (s ) end πt (s) ← argmaxa Qt (s, a); Vt (s) ← Qt (s, πt (s)) end until maxs |Vt (s) ? Vt?1 (s)| < ; return πt ; Figure 3.10: The value iteration algorithm

3.5. Standard (PO)MDP solving approaches

53

3.5.2

Piecewise linear discretisations

As previously mentioned, one of the problems in ?nding policies for POMDPs is that the state space of the belief MDP is continuous. Several methods have been proposed for solving POMDPs [Lov91]. All of them have to deal with the problem of continuous state spaces. One naive suggestion is that of discretising the continuous state space using a ?xed discretisation. However, this is not an e?cient way of dealing with the discretisation problem. In [SS73], it was suggested that for the ?nite horizon optimality model the optimal POMDP value function is piecewise linear and convex. For a given horizon, the continuous belief space can be split into regions (pieces) and within those regions the optimal value function can be described using a linear function of the belief space. The region boundaries arise naturally as a side e?ect of the ?xed ?nite horizon. The piecewise value function is convex in that the surface de?ned by the union of the hyper surfaces within each region is convex. For a proof of those two statements look at [SS73]. These two properties can be used to derive a POMDP version of value iteration that discretises the environment accurately at the borders of these regions in each iteration. In [Son78] this method has been extended for the in?nite discounted horizon POMDP optimality model. The function remains piecewise linear because the value iteration in the in?nite horizon case stops iterating when the di?erence in the value functions between iterations is under a limit. Therefore, the value function computed in the in?nite-horizon version of value iteration is still computed in a ?nite number of iterations. Hence, the piecewise and convex properties are still present. If the value iteration was to continue for an in?nite number of iterations, the resulting function would still be piecewise linear, but would possibly have in?nite piecewise segments. The discounting is necessary to guarantee that the optimal value function will be ?nite and this, in turn, is necessary to guarantee that value iteration will eventually converge with this optimal value function. The value iteration methods used for POMDPs using the piecewise linearity property can solve problems where the underlying MDP has around 2-5 states [LCK95b] So these methods are rather restrictive. In [KLC96] an improved version of value iteration for POMDPs called the “witness algorithm” is proposed. In [LCK95a] it is mentioned that the witness algorithm can deal with up to around 16 states in the underlying MDP, but this is still a rather restrictive result. In [LCK95b] an attempt is presented to ?nd approximate solutions that can provide satisfactory solutions for problems with around 100 states. An even newer approximation method [MKKC99] has been used to solve problems with up to 1000 states. However, in our problem the underlying MDP has at least 2|X| states where X is the set of all possible environment locations (as described in section 3.2.1). For |X| = 32 the number of states is close to 4 billion. The conclusion is that the methods based on the piecewise linearity of the value function performing either exact or approximate

54

Chapter 3. Problem Formalisation and Complexity

solution computation cannot be used for our problem.

3.5.3

Planning with factored representations

As previously mentioned, the state space of an MDP is exponential in size in the number of literals used to describe it. Consequently, the MDP state transition and state reward tables are also exponential in size in the number of literals used to describe the state space. In [BDH96] the distinction is made between ?at and factored state spaces. A ?at state space is a state space where a single literal is used to describe each state - the literal taking on as many values as the number of states in the state space. A factored state space is one where each state is composed of many literals (factors / ?uents) - each literal taking on fewer values than the size of the state space. Note that we have already been implicitly considering factored state spaces in our discussion of the surveillance problem. The main observation in [BDH96] and in later articles [BDH99, HSAHB99] is that in some problems independences can be exploited to reduce the size of the problem description. At the action level it may happen that the post action value of a literal depends on the pre-action values of only a few literals. Similarly, the reward of an action and the value of a state might be structurable on the value of state space literals. It can possibly be so that some literals will not in?uence at all the value of the state. The suggestion in [BDH96, BDH99] is to use temporal Bayesian networks to represent the state transition probabilities. The claim is that Bayesian networks can reduce the space needed for transitions, from exponential in the number of literals (using transition tables) to polynomial (using Bayesian networks). In ?gure 3.11 such a Bayesian network is used to represent the action GO(X3 ) of our surveillance application. From the network and the conditional probability tables (CPTs) associated with it, one can see that the robot location Xlt+1 , after action GO(X3 ) is taken, does not depend on what the location Xlt was in the pre-action state. Similarly, the presence of a room ?re at t + 1 only depends on whether a room ?re already existed at t and on what the location Xlt was. One such Bayesian network has to be introduced for each action in our environment. Actually, the Bayesian network only requires less space than the state transition table when independence structure is present, that is, when each time t + 1 literal does not have all time t literals as parents. If no independence structure is present, it does not make sense using Bayesian networks in the representation. However, typically such structure exists. The other suggestion in [BDH96, BDH99] is using in?uence diagrams to compactly represent the value/reward of a state based on the value of the literals. This makes sense if the reward is only dependent on a few literals. In our case the value depends on the current location Xl and the presence of a ?re fl at the current location Xl . This means that all literals f1 . . . fn have to be used in the in?uence diagram and so the in?uence diagram is not signi?cantly smaller than a

3.5. Standard (PO)MDP solving approaches

t+1 t Xlt f3 f3 X1 T 1.0 X1. F 0.2 .

55

f1 f2 f3

. . .

f1 f2 f3

. . .

Xl T 0.2 Xl . F 0.2 . Xn T 1.0 Xn F 0.2 Xlt Xlt+1 X1 X3 . .

. . . . . . . .

. .

. .

fn X1

Time t

fn X1

Time t+1

Xn X3

Figure 3.11: Action network for the GO(X3 ) action with two CPTs shown.

Xl f 1 X1 T X1 T X1 F X1 F X2 T X2 T X2 F X2 F f2 T F T F T F T F Reward Xl 1.0 X2 X1 1.0 f1 f1 0.0 T F T F 0.0 f2 f2 1.0 1 0 T F T F 0.0 1.0 1 0 1 0 0.0

f1 f2 Xl

Reward

Figure 3.12: The reward in?uence diagram and the decision tree for the two-room example. reward table. The decision tree corresponding to the in?uence diagram, however, is smaller than the reward table. Both of these suggestions are used in [BDH96, BDH99] to produce an algorithm called SPI. This algorithm is based on Sondik’s value iteration, but instead of computing piecewise linear value functions, it is computing new in?uence diagrams to represent the state value function. The algorithm performs better than classical piecewise linear value function algorithms because only literals that are relevant to the outcome under a particular action are considered. Knowing the state value function trivially gives us the policy by picking the action that takes us to the best state. In [HSAHB99] algebraic decision diagrams are used for representing rewards and value functions instead of in?uence diagrams. Algebraic decision diagrams are generalisations of the binary decision diagrams often used in computer circuit

56

Chapter 3. Problem Formalisation and Complexity

design. Algebraic decision diagrams allow assigning non-Boolean values to leaves and so can be used to represent rewards (which can be real valued). Various operations are de?ned upon algebraic decision diagrams, such as adding two diagrams or multiplying them. The claim in [HSAHB99] is that the algorithms based on algebraic diagrams can bene?t from the fast tree merging implementations already existing for Boolean decision diagrams. Furthermore, an approximate method is suggested whereby parts of the algebraic decision diagrams that have little e?ect on the ?nal value of a state are pruned to reduce the computation.

3.5.4

Reinforcement learning using neural networks

In this section we discuss reinforcement learning using neural networks as a method for approximately computing the value function. We begin by discussing the problem described in [CB96] of optimal elevator dispatching policies for a building with 4 elevators and 10 ?oors. This problem has several characteristics similar to those of our problem. The elevators are responding to call buttons being pressed. In our problem, if uniform room costs are assumed, the surveillance robot will visit rooms likely to be on ?re. In a sense pressed buttons can be thought to be corresponding to ?res. Furthermore, a lift takes into account, while parking, where it is more likely to be called up next (in our problem, the robot looks at the probability of a ?re starting). However, there are also some di?erences. Firstly, the call buttons can be thought of as perfect sensors distributed over all ?oors (in our case, each room would have a perfectly reliable ?re sensor). Secondly, queues of people waiting for a lift can be formed (in our case, multiple ?res would be present in a room). Thirdly, once the person is in the lift, the lift still has to transport the person to the right ?oor so the task is not equivalent to just picking up the person (as it is in our case). Finally, in the elevator problem we are dealing with a one-dimensional motion problem, while in the surveillance problem, connections between rooms play an important role. So the two problems are not immediately equivalent. However, here too, the state space is huge since all possible combinations of call and elevator buttons plus all the possible lift positions and directions have to be represented. The solution proposed in [CB96] combines the concept of reinforcement learning with a neural network based approach to it. The neural network and reinforcement learning combination performs better in this problem than the standard known solutions to the elevator dispatching problem. However, we have several objections to this solution. Our ?rst objection concerns the number of inputs and the input signi?cance, as part of the state representation is ignored in the inputs, and some inputs correspond to heuristically determined properties. Our second objection has to do with the number of hidden units and various learning parameters, such as the discount factor and learning rate, which were experimentally set . Finally, no attempt of justi?cation for those choices was made. The authors

3.5. Standard (PO)MDP solving approaches

57

themselves admit that a signi?cant amount of experimentation was necessary for determining the appropriate network shape parameters. Another frequently cited example of good performance exhibited by reinforcement learning methods is that of TD-Gammon [Tes95, Tes94, Tes92]. TDGammon is a backgammon-playing program that uses a version of reinforcement learning called temporal di?erence learning (TD-learning) [Sut88] to learn the playing policy. The problem in TD-Gammon is very di?erent from ours since a policy that can be followed against intelligent opponents has to be found. This is the domain of game theory instead of decision theory. In TD-Gammon the agent learns the best playing policy by starting without any a priori knowledge about the game of backgammon, apart from the rules. That is, there is no knowledge derived from experts about how a player should play backgammon in order to win. TD-Gammon plays games of backgammon against a copy of itself and learns from the mistakes it makes. At the end of each game a reinforcement (reward) for the entire game is computed (using the backgammon scoring scheme). The problem then becomes one of assigning some of this game reinforcement to each of the actions performed by the program during the game. A similarity between the solution in [CB96] and the elevator problem is that in both cases the value function for each state is represented in a multi-layer perceptron and that the correct value function is learned using back propagation. The results obtained with TD-Gammon are impressive. A policy, good enough to follow against master level humans, is found. If some a priori knowledge is incorporated in the input to the multi-layer perceptron in the form of interesting board con?guration features, then TD-Gammon can reach grand master level play. However, as in the elevator scheduling problem, some parameters have to be con?gured experimentally. Furthermore, and perhaps more importantly, no de?nite explanation of its good performance can be given. In fact, it is always possible that TD-Gammon will ?nd a locally optimal solution although, according to the authors, that seldom happens. From our analysis of TD-Gammon and of the elevator scheduling problem solution we conclude that although after some experimentation good reinforcement learning solutions using neural networks can be found for those problems, they are not necessarily open to introspection and cannot be guaranteed to be optimal or near optimal.

3.5.5

State-action pair sampling

As we have previously mentioned, standard exact methods, like value and policy iteration in the case of MDPs, have runtimes polynomial in the number of states in the underlying MDP and exponential in the number of problem variables. In [KMN99] a method is proposed whose runtime is not dependent on the number of states in the MDP. This method relies on sampling the look-ahead tree and producing a sparse look-ahead tree that is dense enough to guarantee some

58

Chapter 3. Problem Formalisation and Complexity

optimality conditions. The algorithm uses a generative model G that, given an action a and a state s, randomly outputs a new state s according to the transition probabilities T (s, a, s ). The algorithm takes an on-line view; given a state, a decision about the best action has to be taken. No ?xed policy is computed as in value or policy iteration. Going back to the optimality conditions the algorithm guarantees that, given the generative model G, an input state s ∈ S and a tolerance value > 0, the action output satis?es the following conditions: ? The runtime is O(kC)H , where k is the number of actions available at each state, C is the number of samples taken for each state-action pair, and H is the look-ahead depth of the tree. C and H are determined by the algorithm but are independent of the state space size (see ?g. 3.14). ? The value function of the approximate policy V A (s) is such that its di?erence from the optimal policy V ? is below : |V A (s) ? V ? (s)| ≤ The complete algorithm A can be found in ?gure 3.14. The top level function calculates the right values for C and H given the required tolerance and discount ? factor λ. Subsequently, it proceeds to compute an estimate Q? of the optimal H ? ? state-action value function QH and select the action with the best Q? value. H The Q functions should be seen to be a special kind of value functions V with two arguments, one for the states s we begin from and another for the action a taken at that state. The computational advantage of the algorithm derives ? from the approximate computation of the state-action value function Q? . In the H ? ? (s, a) (line 1 in the algorithm), only C sample resulting states calculation of Qh are considered instead of the full state space S. If the size of the state space |S| = N ≥ C, then a computational advantage can be gained because the tree searched is smaller than the full search tree. A proof that the algorithm really satis?es the above-mentioned optimality criteria can be found in [KMN99]. The state-action sampling approach is well suited for situations where an action can take the system to every state s ∈ S. However, as was seen in the example of appendix A, our belief MDPs have rather structured transition probability tables and this structure can probably be better exploited. Furthermore, in [MV98], we have shown that our search for a decision-theoretic strategy using the original problem setting could provide us with a solution of O(k h ) time complexity, where k is the number of actions available at each location and h is the number of steps to look ahead. There, as well as in the MDPs proposed in section 3.3.3, the actions are completely deterministic and only a single resulting state is possible for each state-action pair. So, in fact, the decision-theoretic search can do better than the state-action sampling approach for our speci?c problem.

3.6. Summary

s0

59

a1

a2

a1

a2

a1

a2

a1

a2

a1

a2

a1

a2

a1

a2

Depth H

a1

a2

Figure 3.13: The look-ahead tree for the state-action sampling algorithm for C = 3

and k = 2.

Furthermore, the interpretation of tolerance here is not immediately intuitive. Tolerance is not de?ned as a proportional di?erence from the optimal value function. One needs to have an idea of what the value of the optimal value function is before a reasonable tolerance can be speci?ed. One seeming advantage of the discussion in [KMN99] over our discussion in [MV98] is that given the tolerance and the discount factor, the depth H that has to be used in the search in order to achieve the desired tolerance can be computed. In [MV98], no optimality guarantees are given and the maximal depth used is only determined based on computational constraints (the robot has to act in real time). Although the approach of [KMN99] is more formal, if some reasonable values ( = 0.2, γ = 0.9, Rmax = 1) are substituted in the equations of algorithm 3.14, large values are computed for C and H (C = 3003, H = 39). For such large values for C and H the runtime will most probably be very large, making the application of this algorithm unrealistic for the desired C and H. In any case, we hope that from this section it is clear that our problem has transition tables with a rather speci?c structure and that the state-action pair sampling algorithm is designed for problems with a di?erent type of transition table structure.

3.6

Summary

There are several equivalent ways in which the surveillance problem can be set. The state space size is exponential in the number of locations used to describe the environment in all these settings. The exponential state space size in conjunction with the results in [PT87] implies that standard exact MDP solving methods

60

Chapter 3. Problem Formalisation and Complexity

Function: Estimate Q(h, C, γ, G, s) if h = 0 then return (0, . . . , 0); for each a ∈ A do generate C samples using G; let Sa be the set containing these C samples; end for each a ∈ A do let our estimate of Q? (s, a) be:

1 ? Q? (s, a) ← R(s, a) + γ C Σs ∈Sa Estimate V(h ? 1, C, γ, G, s ) h

end ? ? ? return (Q? (s, a1 ), Q? (s, a2 ), . . . , Q? (s, ak )) ; h h h Function: Estimate V(h, C, γ, G, s) ? ? ? (Q? (s, a1 ), Q? (s, a2 ), . . . , Q? (s, ak )) ← Estimate Q(h, C, γ, G, s); h h h ? ? return maxa∈{a1 ,...,al } {Qh (s, a)} ; Function: Algorithm A( , γ, Rmax , G, s0 ) 2 λ ← (1?γ ) ; 4 Vmax ← Rmax ; 1?γ λ H ← logγ Vmax ; λ δ ← Rmax ;

2

max δ C ← Vmax (2H log + log 1 ) ; λ2 λ2 δ ? ? ? (Q? (s, a1 ), Q? (s, a2 ), . . . , Q? (s, ak )) ← Estimate Q(H, C, γ, G, s0 ); H H H ? return argmaxa∈{a1 ,...,al } {Q? (s, a)}; H

kHV 2

log

1

Figure 3.14: Sparse MDP sampling

3.6. Summary

61

cannot solve our problem, and that approximate methods need to be tried. We have demonstrated that all the di?erent settings are equivalent, and so our choice of representation should not a?ect results. We felt that it would be clearer and more convenient if just one of these settings was used in the rest of this thesis. We are probably right in saying that POMDPs are harder to compute using paper and pencil (e.g. in appendix A). So we decided to use the decisiontheoretic setting of section 3.2.1 in the rest of this thesis. A further observation is that at least some of the POMDP solving methods discussed attempt to use the structure of the speci?c problem attacked. Stateaction pair sampling computes solutions faster, due to the fact that in their problem setting some resulting states are more likely than others for a given state-action pair. Factored representations make use of the fact that in some problems the sets of possible actions and resulting states depend only on some of the parameters of the current state. This seems to indicate that the structure of the problem is important in e?ciently computing the solution. In fact, in [WM95], it is shown that the best way to solve a speci?c search problem is to make a customised search method that looks carefully into what the “salient features” of the problem are. However, none of the standard approximate methods proposed can be applied to our problem because they take advantage of di?erent types of structure from that present in our problem. We decided to concentrate on approximate methods built speci?cally for our problem and for the scenarios that a robot is likely to encounter in an o?ce-like environment. Our problem has a lot of geometric structure in it. For example, the possible state transitions are greatly constrained by the structure of the physical environment. None of the methods we have seen so far explicitly tries to take into account the geometrical structure present in the motion actions. Our claim is that using it to guide our approximations is the best way to produce fast and accurate solution algorithms for the surveillance problem.

Chapter 4

Hierarchy and Strategies

In this chapter, we show that a hierarchical approach to surveillance can improve the simple minimum expected cost (MEC) strategy proposed in section 3.2.2. The minimum expected cost strategy can only be evaluated with a limited lookahead and this causes problems related to local minima. One such local minima problem is that of “expected cost obstacles” in example 3.2.10. The hierarchical decision theoretic strategy proposed here attempts to overcome these problems by simultaneously using descriptions of the environment at di?erent levels of abstraction, where decisions made at more abstract levels guide the decision making at less abstract levels. Although it also uses a limited look-ahead, it deals with much larger time horizons than the original minimum expected cost strategy, because the higher levels of the hierarchy can shift the focus of attention to areas of the environment far away from the current robot position. While the hierarchical strategies of this chapter are heuristic and the approximation of the expected cost is rather ad hoc, they improve performance in situations with local minima. They are presented to justify the use of a hierarchical approach. In later chapters, we will devise an improved hierarchical strategy which better exploits the geometrical structure in our environment.

4.1

An o?ce-like environment

The strategies of this chapter are compared in a realistic o?ce-like environment that is based on an existing o?ce building (?g. 4.1:a). This environment is used to guide our thought on what kind of methods are applicable in a real life scenario. Describing our method in terms of a real environment serves to make our ideas more concrete. This environment will also be used in later chapters to compare surveillance strategies. Although it is just an instance of an o?ce-like building it has features that are common in many such buildings. 63

64

Chapter 4. Hierarchy and Strategies

It is assumed that an a priori map of the building is available. This map is an accurately labeled occupancy grid that is composed of walls, doors, and free space. Locations Xi are rooms and segments of the corridor (?g. 4.1:b) and ?res have given starting probabilities P (fi → 1) and known costs C(fi ).

4.1.1

Environment graph construction

Before applying the decision strategies to the o?ce environment, each room is identi?ed and marked. The corridor (room 0) is segmented into many smaller rooms. The places where to segment the corridor are determined by the position of the doors connecting other rooms to the corridor. By segmenting the corridor, a mapping of all potential decision points of the robot to individual locations is achieved (?g. 4.1:b). Such a segmentation is also considered natural, for an o?ce environment in [TRS01]. Finally, the marked room environment representation is converted into a graph where rooms correspond to graph nodes and doors to graph links (?g. 4.1:c). This representation is called the environment graph. The n-step minimum expected cost strategy can be applied to the environment graph. An n-step minimum expected cost strategy is essentially doing a depth?rst search of a tree of possible decisions. The time needed to take a single decision using such a strategy is exponential with an O(bn ) time complexity (see e.g. [Lug02] page 106), where b is the branch factor of the decision tree being explored and n is the number of steps to look ahead. In our environment, b is the average connectivity of the environment graph being explored and has a value of about 3. In connected environments, where there is a link to a neighbouring room and one to the room itself, the lowest value b can take is 2. This sets a lower bound of 2n on the time complexity.

4.1.2

The cost barrier problem

It is obvious that under certain circumstances, n-step strategies can face a di?cult situation in our environment. 4.1.1. Example. Consider the case in ?g.4.2 where C(fi ) = 0 for i ∈ {5, 6, 7, 8, ? 9, 10, 32, 33, 34, 35, 36}, C(fj ) = c for j ∈ {1, 2, 3, 4, 28, 29, 30, 31}, C(fk ) = 1 for all k = i ∧ k = j, P (fl → 1) = 0.001 for all l and A0 = 50. The variable c can ? take any positive value and values from 1 to 100 are used in our simulations. In this setting an “expected cost barrier” is formed by the Xi s that an n-step minimum expected cost strategy (section 3.2.2) with n ≤ 5 cannot cross. All the barrier locations Xi have an expected cost of 0. A robot starting at room 50 can get up to room 37 but then its immediate neighbours will all have an expected cost of 0 and a visit there will not be justi?ed. The rooms Xj will not be visited either because they can only be considered once the robot is in one of the Xi s. Therefore, the rooms Xj will never be visited although, after a su?ciently large

4.1. An o?ce-like environment

65

14

26

24

23

22

20

19

17

15 0

10

7

6

5

3

1

27

25

21

18

16

13

12

11

9 8

4

2

(a)

14

26

24

23 47

22

20

19 43

17 42

15 41

40 39 38

37 36

10

7

6 33

5 32 31

3 30 29

1 28

50 49 48

46 45 44

35 34

27

25

21

18

16

13

12

11

9 8

4

2

(b)

26 50 27 49 25

24 48

23 47

22 46 45 21

20 44

19 43 18

17 42

15 41 16

14 40 39 13 (c) 38 12 37 11 36 9

10 35

7 34

6 33 8

5 32

3 31 30 4 29 2

1 28

Figure 4.1: (a) Initial map (b) Segmented map (c) Environment graph.

26 50 27 49 25

24 48

23 47

22 46 45 21

20 44

19 43 18

17 42

15 41 16

14 40 39 13 38 12 37 11 36 9

10 35

7 34

6 33 8

5 32

3 31 30 4 29 2

1 28

Figure 4.2: Fire costs in example 4.1.1. Rooms for which C(f ) = 0 (barrier rooms) are grey, rooms for which C(f ) = c are black and rooms for which C(f ) = 1 are white. ?

66

Chapter 4. Hierarchy and Strategies

period, they will almost certainly be on ?re and the cost of ?re there is relatively high (if c > 1). This is an example of an expected cost barrier. ? Of course, a 6-step strategy can overcome this barrier, but then a larger one can always be constructed. Furthermore, the decision to use 6 rather than 5 steps has to be based on knowledge about the presence of a barrier. Such knowledge must either be given in advance or has to be derived based on an analysis of the environment. It is not obvious how such an analysis might be performed since e?ective cost barriers can exist in more subtle situations. Finally, as we mentioned in the previous section, there is a real-time, performance-imposed upper limit on the value of n that can be used since the complexity of the search is exponential.

4.1.3

Simulation experiments

The environment graph of this example was used in the simulations. Furthermore, virtual sensors that can detect if a ?re is present within a certain graph node were implemented. Initial ?re locations, probabilities and costs are speci?ed at the level of environment graph nodes. The hierarchical as well as the minimum expected cost and the minimax interval strategies were simulated in the environment of example 4.1.1. Several instances of that environment with di?erent c values for the barrier rooms were ? used in the simulations. Further, a “uniform” environment was used where all rooms had costs of 1 and so no barrier was present. Both the evolution of ?res in the environment and the decisions taken by the robot were simulated. Fires could start at all locations and their costs were accumulated over many simulated time-steps. The accumulative costs represent the ground truth on how well the strategy did in that particular run. The smaller the resulting cost, the better the strategy minimised the cost. The robot had knowledge about the time since last visit for all locations as well as for the P (fi → 1) and C(fi ) for all Xi ∈ X. These values were used to compute the estimate of the expected cost for each location and guide the strategies accordingly. The simulated robot attempted to take the actions that minimised the expected cost. In the simulations, 10000 iterated time-steps were used for each run of the simulator. Taken together with a probability of ?re starting of P (f i → 1) = 0.001 means that a ?re almost de?nitely starts in each of the environment’s 50 locations during the simulation run. Twenty runs of 10000 were used for each strategy to give some estimate of the e?ects of the randomness and to compute the standard deviation of our measurements. Since the size of our time-steps in real time units would be determined by the time a real robot would take to move to an adjacent room, one can claim that a P (fi → 1) = 0.001 is too high. Although this is true in most cases, a lower ?re starting probability would imply that more iterations would be necessary in our simulations to ensure that statistically correct results

4.2. The hierarchical minimum expected cost strategy

67

are gathered. Although, P (fi → 1) = 0.001 is quite high, we believe that, if this probability is lowered, but at the same time the environment becomes larger (incorporating many ?oors or buildings), the same qualitative results will be obtained. We have plotted the number of times each room is visited by the 5-step minimum expected cost strategy (?g. 4.6:i). Since this strategy cannot overcome the expected cost barrier, the groups of nodes a, b, c and d in the plot are not visited. So neither the nodes beyond the barrier nor the nodes in the barrier are visited.

4.2

The hierarchical minimum expected cost strategy

The proposal is to overcome the barrier problems by means of a hierarchical minimum expected cost strategy. In order to implement a hierarchical minimum expected cost strategy two subtasks are performed. ? The environment graph is abstracted. The aim is to generate a hierarchical model of the environment. ? A reasonable strategy is found. This strategy has to apply minimum expected cost to the hierarchical model of the environment, in a way that corresponds in a well-de?ned approximate manner to the actual MEC strategy at the lowest level. The hierarchical strategy is proposed here to demonstrate the bene?t of using a hierarchy in our approximations.

4.2.1

Abstraction tree construction

To construct the abstraction tree, the environment graph was repeatedly clustered and abstracted. Only example 4.1.1 was considered and so the clustering process was not automated as the e?ort was not justi?ed. However, the following criteria leading to the possibility of its automation were taken into account. 1. A path within the cluster should exist between any two nodes in a cluster. 2. No path between any two nodes within a cluster should be greater than the look-ahead size n (we took n = 5). 3. The resulting tree should be balanced. This means that at each level all clusters should have approximately the same number of children. The clustering was performed recursively until the environment graph could not be further abstracted. In the resulting abstraction tree, the abstract nodes

68

61

Chapter 4. Hierarchy and Strategies

level 3

60

59

level 2

58

57

56

55

54

53

52

51

level 1

26 50 27 49 25

24 48

23 47

22 46 45 21

20 44

19 43 18

17 42

15 41 16

14 40 39 13 38 12 37 11 36 9

10 35

7 34

6 33 8

5 32

3 31 30 4 29 2

1 28 level 0

Figure 4.3: Abstraction tree. at each level are interconnected with links that map to doors (?g. 4.3 (levels 1 and 2)). 4.2.1. Definition. The expected cost EC of an abstract node is de?ned recursively as: EC(Xi ) = Pt (fi )C(fi ) if level(Xi ) = 0. EC(Xi ) = j∈ch(Xi ) EC(Xj ) if level(Xi ) > 0. where ch(Xi ) is the set of children of node Xi , Pt (fi ) is the probability of ?re in room Xi at t time-steps since its last visit (de?nition 3.2.2), and C(fi ) is the cost of ?re per time-step at room Xi . So the expected cost of an abstract node is the sum of the expected costs of all its level 0 descendants. 4.2.2. Definition. The expected cost ECp of a path p = [Xf |r] is de?ned recursively as: ECp ([Xf |r]) = EC(Xf ) + ECp ([r]) where Xf is the ?rst node in the path, r is the rest of the path, and EC is the expected cost of an abstract node as computed by de?nition 4.2.1.

4.2.2

Decision procedure

The hierarchical minimum expected cost strategy (?g. 4.4) begins from the top of the abstraction tree and works its way down through its di?erent levels. The original minimum expected cost strategy is used to select the best node to visit at each level. If at the current abstraction level the best node does not correspond to the robot location at that level, a path is planned to a child of the selected best node, and so the abstract cluster is changed. If the best node corresponds

4.2. The hierarchical minimum expected cost strategy

present room ← ROOM ; % where ROOM is the room the robot is currently in abstraction level ← M AX ; % where MAX is the most abstract level with more than 1 node change abstract node ← f alse; %if change abstract node is set to true then the robot directly plans a %path to another abstract node

69

while abstraction level > 0 do present node ← map present room to abstraction level node; select best next node using 5-step MEC at current abstraction level; if best next node = present node then abstraction level ← abstraction level ? 1; else change abstract node ← true; target ← map best next node to closest 0 level node to present room; best path ← plan shortest level 0 path from present room to target; abstraction level ← 0; end end if change abstract node=false then best path ← plan using 5-step committed MEC at level 0; end follow best path until its end is reached; repeat the decision procedure to select new best path;

Figure 4.4: Algorithm for the hierarchical minimum expected cost strategy. to the robot location, the same decision procedure is repeated for the next lower abstraction level. If no decision is made to change abstract nodes and level 0 is reached, the robot has to plan a 5-step “committed minimum expected path” within its present level 1 node. “Committed minimum expected cost” means that once a path is selected by the strategy, the robot sticks to it until its end. Ordinary 5-step strategies are reevaluated after each step is taken. So once a path is selected by an ordinary strategy, the ?rst location in the path is visited and then a new 5-step reevaluation takes place. By using “committed minimum expected cost” a path staying in the cluster visits ?ve rooms and hence its length is closer to that of a path moving to a di?erent part of the environment. Further, using commitment solves the problem with the n-step strategies described in example 3.2.12. To clarify the hierarchical strategy we give a couple of concrete examples. In the ?rst example the abstract node is changed (?g. 4.5:a).

70

61

Chapter 4. Hierarchy and Strategies

level 3

60

59

level 2

54

53

52

51

level 1

14 40 39 13 38 12 37 11 36 9

10 35

7 34

6 33 8

5 32

3 31 30 4 29 2

1 28 level 0

(a)

61

level 3

60

59

level 2

54

53

52

51

level 1

14 40 39 13 38 12 37 11 36 9

10 35

7 34

6 33 8

5 32

3 31 30 4 29 2

1 28 level 0

(b)

Figure 4.5: Example 4.2.3 (a) and 4.2.4 (b). 4.2.3. Example. The robot starts in room 3 at level 0 of the abstraction tree. The strategy has ?rst to decide which node at level 2 should be visited. Suppose its outcome is that the robot should stay within the abstract node 59. Then another decision has to be made at level 1. If the decision is that it should visit node 52, then the robot has to move to another node. Consequently, a direct path is planned and the robot moves from 3 to 31 and then 32 because this is the shortest path from 2 to some descendant of node 52. In the second example the robot never changes abstract node (?g. 4.5:b). 4.2.4. Example. The robot again begins at room 3 and the same line of reasoning is followed until abstraction level 1. Then suppose a decision is taken at level 1 to visit node 51. The main loop should be exited and the robot has to use a committed minimum expected cost strategy starting from node 3. Consequently,

4.2. The hierarchical minimum expected cost strategy

71

a longer visit of node 51 is planned rather than a move to a di?erent abstract node. The proposed hierarchical strategy was applied in the environment of example 4.1.1 for c = 50. The number of visits per location was again plotted ? (?g. 4.6:ii). The location groupings a and c are visited indicating that the robot went beyond the expected cost barrier. This is a positive result. Furthermore, the barrier locations in the corridor (group d) are visited when the robot crosses to the other side of the barrier. The barrier locations that are not in the corridor (group b) correctly are not visited since no bene?t can be gained by doing so.

4.2.3

Revised decision procedure

One problem we encountered is that the robot frequently swaps between abstract nodes 59 and 60. This problem cannot be seen directly from ?g. 4.6:ii, but its e?ect is that we have rather a lot of visits along the corridor of the barrier (around rooms 32-36) and along the rest of the corridor of node 59 (along rooms 36-40). It occurs because, although the rooms past the barrier can have a rather high c, they are not su?ciently many to keep the robot busy for a long period. This ? means that at the top level of abstract nodes 59 and 60 not visiting node 60 for many steps is also comparatively expensive. However, a brief visit to one of these abstract nodes is sometimes enough to tip the scales in favour of the other and then a lot of direct cluster changing paths are planned as a net result. To make things worse, a visit to cluster 59 most of the time results in a visit to cluster 51, which is more expensive, and that is why so many visits along the corridor of cluster 59 take place. To ?x this problem we revised hierarchical MEC to prefer paths that start at the current node Xl . This might seem strange since we just visited those rooms, but at the abstract levels it makes sense since it tends to make the robot explore the local neighbourhood properly before moving elsewhere.

? 4.2.5. Definition. The revised expected cost ECp of a path p = [Xf |r] is de?ned as: ? ECp ([Xf |r]) = κECp ([Xf |r]), κ > 1 if Xl = Xf . ? ECp ([Xf |r]) = ECp ([Xf |r]) if Xl = Xf .

where Xf is the ?rst node in the path, r is the rest of the path, κ is a delay constant, Xf , Xl are nodes at the same level of the abstraction tree and ECp is computed using de?nition 4.2.2. The purpose of κ is to delay moving from the current node. It can perhaps be thought of as a “hysteresis” parameter. The bigger the value of κ the longer the delay in deciding when to move into a new cluster. It might appear counterintuitive to want to stay longer at a node that was just examined. However, this

72 c = 100 ? c = 50 ? 407 ± 28 278 ± 17 350 ± 32 230 ± 14 279±20 162 ± 11 329 ± 6.6 145±11 325 ± 9.2 302 ± 6.0 328 ± 8.3 301 ± 5.7 327 ± 8.7 304 ± 7.0 326 ± 11 304 ± 6.6 331 ± 9.9 305 ± 7.4 358 ± 28 316 ± 13 7, 460 ± 320 3, 830 ± 130

Chapter 4. Hierarchy and Strategies c = 10 ? 130 ± 7.0 79.5 ± 5.5 61.9±4.1 64.4 ± 3.6 92.0 ± 5.7 123 ± 6.9 143 ± 7.5 158 ± 7.4 183 ± 9.1 202 ± 9.0 975 ± 22 c=1 ? 100 ± 5.5 27.6 ± 1.7 20.4 ± .93 18.9 ± 1.2 18.0 ± 1.3 17.3±1.2 19.3 ± 1.0 22.0 ± 1.4 35.0 ± 2.3 690 ± 3.4 315 ± 7.7 uniform 321 ± 6.4 39.3 ± 2.6 29.1 ± 2.2 25.8 ± 1.1 26.5 ± 1.7 22.0±1.6 28.7 ± 1.4 34.4 ± 1.5 47.7 ± 2.4 92.9 ± 4.9 415 ± 6.8

κ = 1.0 κ = 1.1 κ = 1.2 κ = 1.3 κ = 1.4 κ = 1.5 κ = 1.6 κ = 1.7 κ = 1.8 κ = 1.9 κ = 2.0

Table 4.1: The search for a κ value. Total costs and standard deviation after 10000 iterations in units of 103 . Best costs for each environment in bold.

is reasonable if the current node is important and the expected cost computed for the other nodes is an overestimation of what will be seen once there. At the abstract levels of the hierarchy the expected cost of a node is an overestimation of what is seen once a robot visits that node. This is demonstrated in the following example. 4.2.6. Example. The expected cost computed for node 60 is that of visiting all its subnodes. These subnodes correspond to almost half the environment of example 4.1.1. However, it is possible for the hierarchical expected cost to just spend a 5-step path within node 60 and then move back to node 59. If that is the case, the decision to move to node 60 would be based on the expected cost of half the environment, and consequently on an overestimation of what was seen by moving to node 60. We have tried various values of κ for di?erent c and the results can be sum? marised in table 4.1. We found that the values of κ that worked best were between 1.2 and 1.5. In the table the best costs for each environment are printed in bold. We decided to set κ to 1.3. This value is probably rather speci?c for the environment we consider. It is a rather arbitrary ad hoc choice and it is the weakest part of this approach. We then replotted the number of visits per room for the revised hierarchical strategy for κ = 1.3 (?g. 4.6:iii). Although now there are some irregular peaks in the plot, the previous problem of swapping between nodes 59 and 60 is solved. Furthermore, the nodes in groups a and c are visited more often and this also constitutes an improvement.

4.2. The hierarchical minimum expected cost strategy

73

Number of visits vs. Room number 1800

1800

Number of visits vs. Room number

1600

1600

1400

1400

1200

Number of visits Number of visits

1200

1000

1000

800

800

600

600

400

400

200

200

0

1

10

20

30 Room number

40

50

0

1

10

20

30 Room number

40

50

a

b

c

d

a

b

c

d

(i)

(ii)

Number of visits vs. Room number 1800 1800

Number of visits vs. Room number

1600

1600

1400

1400

1200

Number of visits Number of visits

1200

1000

1000

800

800

600

600

400

400

200

200

0

1

10

20

30 Room number

40

50

0

1

10

20

30 Room number

40

50

a

b

c

d

a

b

c

d

(iii) Legend: a = 1,2,3,4 = rooms that are beyond the barrier. b = 5,6,7,8,9,10 = rooms that form part of the barrier. c = 28,29,30,31 = corridor segments that are beyond the barrier. d = 32,33,34,35,36 = corridor segments that form part of the barrier

(iv)

Figure 4.6: Number of visits vs. room number for example 4.1.1 (i) 5-step MEC (ii)

hierarchical MEC (iii) revised hierarchical MEC (iv) minimax interval.

74

Chapter 4. Hierarchy and Strategies minimax 5-step interval MEC uniform 19.6±1.2 22.6 ± 1.4 c=1 ? 15.3±1.1 81.4 ± 2.8 c = 10 ? 47.0±5.3 734 ± 26 c = 50 ? 196 ± 24 3, 640 ± 130 c = 100 ? 371 ± 53 7, 250 ± 210 hierarchical MEC original revised 321 ± 6.1 25.8 ± 1.1 102 ± 4.4 18.9 ± 1.2 127 ± 9.7 64.4 ± 3.6 281 ± 14 145±11 410 ± 31 329±6.6

Table 4.2: Total costs after 10000 iterations in units of 103 .

4.2.4

Discussion

The hierarchical strategy succeeds in crossing the barrier. More importantly, however, the hierarchical strategy also reduces the total cost in comparison to MEC. In table 4.2 the total costs (in units of 103 ) and their standard deviations are listed for minimax interval, 5-step MEC, the original hierarchical MEC and the revised hierarchical MEC. Minimax interval and 5-step minimum expected cost were ?rst de?ned in section 3.2.2. The minimax interval strategy, given a room, selects its neighbour with the largest time since last visit. It completely ignores costs and probabilities and, in the long run, it explores all rooms in our environment. The 5-step MEC strategy minimises the expected cost over 5 timesteps. Instances of our example with di?erent c values were simulated for 10000 ? iterations to collect the data in the table. The original hierarchical MEC strategy does not always reduce the total cost in comparison with MEC. However, the revised MEC wins over MEC in all cases but the one of uniform costs. This is an indication that we have improved on the original MEC strategy in the cases where costs matter. The improvement in total cost arising from making “not moving” at higher levels more attractive, is quite dramatic. The delay κ is necessary because the expected cost assigned to the abstract nodes is always an overestimation of the actual cost. The expected cost estimate assumes that once the strategy visits an abstract node, it stays there long enough for all sub-nodes to be visited. However, it is often the case that the strategy stays in a node only long enough for its expected cost to decrease and for the other nodes to attract the attention of the strategy. When that happens, the actual cost is only a portion of the estimated expected cost. We believe that by making “not moving” more attractive, the revised hierarchical strategy stays longer in a node and so our abstracted cost is a better estimate of the real cost (see example 4.2.6). A possibly more interesting observation that can be made from table 4.2 is that minimax interval, which does not use the probabilities and costs, performs better than MEC and hierarchical MEC in the “uniform” cost case and in the sit-

4.3. Summary

75

uations where c is 1 or 10. These are situations where the cost di?erences are not ? very big. Minimax interval moves in the environment in a way that can almost be described as “methodical”. Some environment locations are visited more frequently than others only due to the connectivity of the environment (?g. 4.6:iv). In the situations where c = 1 and c = 10 this “methodical exploration” of the ? ? environment produces better results than our ad hoc revised MEC. However, we believe that a good hierarchical MEC strategy should be able to perform better even in those cases and, therefore, we consider this an indication that the realisation in this chapter of this approach is not the best possible. The fact that MEC performs so badly should not be surprising since MEC never passes to the other side of the barrier. The original hierarchical MEC also appears to be problematic despite perhaps being an improvement on MEC. It is clear that both our original and revised hierarchical MEC strategies are not optimal.

4.3

Summary

In chapter 3 we have shown that a minimum expected cost strategy is a promising method for planning in robotic surveillance. In practice, however, minimum expected cost typically cannot be evaluated globally. Only n-step minimum expected cost strategies are feasible. In order to overcome problems imposed by local minima and barriers, we propose using hierarchical versions of the minimum expected cost strategy. The hierarchical minimum expected cost strategies in this chapter are ad hoc. Three problems were identi?ed: ? The abstracted expected cost is an overestimation of the actual one (example 4.2.6) ? The introduction of κ is not an exact solution to the overestimation problem ? The selection of a value for κ was experimental, will have to be repeated for a di?erent environment and the selected value 1.3 does not have an immediate qualitative explanation. Despite these problems, the hierarchical minimum expected cost strategy shows the potential for improvement over the unabstracted strategies. Given that our approach in this chapter was cursory, there are several areas that can be improved. The computation of the expected cost for abstract clusters seems to be the main problem of this approach. In the next two chapters we shall start again working on a strategy that incorporates a hierarchy of the environment but uses its geometrical structure in the clustering and in the computation of the cluster expected costs.

Chapter 5

Path-based Clustering

In chapter 4, it was shown that making a hierarchical description of the environment helped produce better surveillance strategies. However, the method presented there was rather ad hoc and did not always improve performance. One of the problems of the hierarchical strategies presented so far is that the expected cost assigned to the abstract nodes does not closely correspond to the actual cost of visiting one. To be more exact, the expected cost of an abstract node is the sum of the expected costs of its children. However, a 5-step path within an abstract node does not always visit all its children and, further, some children are perhaps visited twice. The parameter κ in the revised hierarchical strategy was introduced to deal with some of the side-e?ects of this inexact expected cost assignment. Despite being an improvement in some cases, this did not produce better results in all test environments. A much better assignment of the expected costs can be produced if the geometry of the environment is considered. After discussing some general desiderata for clustering an o?ce building, we will concentrate on our speci?c case of a corridor-based o?ce building. A better method for assigning expected costs to paths visiting abstract nodes is produced with that instance of an o?ce building in mind.

5.1

Clustering desiderata

A clustering is de?ned to be a graph of connected environment locations ful?lling certain criteria. In general, there are many ways in which one would want to cluster an environment. In our case, the main reason for being interested in clustering is the potential for computational time savings. An abstracted nstep lookahead is prohibitively slow for values of n, large enough to examine every room in our environment before acting. By clustering the environment and approximating the expected cost at higher level nodes, we produce an abstract 77

78

Chapter 5. Path-based Clustering

description of the problem. Then a decision procedure for planning at the simpler higher levels can be produced. Our clustering is driven by the need to produce such a faster decision procedure that computes approximate solutions. To generate a clustering, a clustering criterion needs to be selected. There are three main possibilities on what could be used to cluster the environment. It is possible to use the structure present in the type of the indoor environment considered to guide our clustering. A natural clustering of an o?ce building would be to form clusters consisting of separate ?oors. Within a ?oor di?erent corridors can form separate clusters and so on. Another possibility is to let the costs guide the clustering. Some environments might have clumps of rooms that are similar, therefore have the same costs in ?re presence. A last option is to let the type of permissible paths guide the clustering. For example, it might be reasonable to opt for clusters that produce equal path lengths when visited. In the case of the surveillance application, the types of paths likely to be useful for a robot performing a surveillance are limited by the properties of the surveillance task. For the purposes of a surveillance task a lot of the possible paths can be ignored because they are not e?cient in reducing the expected cost. For example, under mild assumptions, a robot should not stay for many steps in one room before moving to another. Typical paths for a robot performing surveillance are: explore a cluster (visit each room in it at least once), transit through it on its way to something interesting (visit only the rooms needed to get through it), ignore it altogether (do not visit anything) or visit a speci?c location where, for example, the cost is comparatively high (visit just the rooms on the way to that location and out). These task-dependent path preferences can help us both decide what type of clusters to consider and, related to this, decide to only consider speci?c paths when assigning expected cost to clusters. Further, there is an interaction between the shape of the selected clusters and the properties of the considered paths. One such property is that of path length. The path length can be de?ned to be the total number of room visits along the path. The worst and best case scenarios for exploration path length in a cluster are those where the cluster has a star and a linear topology respectively (?g. 5.1). For the case of a star cluster with n rooms the exploration length, if we enter and leave at the centre of the cluster, is 2n. For the case of a linear cluster the exploration length is n, if we enter from the sides; for loops this is n, no matter where we enter. This can give us an idea on the bounds of how much time is necessary to explore each type of cluster. For the case of transit paths the situation is rather di?erent. To transit a star can be as short as a single room visit if the entrypoints are in the center of the cluster. To transit a loop can be a lot more complicated with up to n/2 nodes visited. In fact, for transiting paths the worst case scenario is when the rooms are arranged as a corridor where the path length is n. From the discussion on path lengths it is obvious that when discussing a path within a cluster the entrypoint at which the cluster is entered is relevant.

5.2. O?ce-like environments

79

Linear cluster

Circular cluster

Star cluster

Intersecting linear

Figure 5.1: Di?erent types of cluster shapes. The probabilities and costs at the lowest level also play a role in the clustering. An exploration path of ?xed allotted time within a cluster assumes that the clusters subnodes are equally important and that therefore staying longer somewhere does not give any bene?t. This is de?nitely the case in o?ce environments where most rooms are equally important o?ces. However, even within the o?ce environment, one can imagine areas that are more important than others. A ?re in the part housing o?ce rooms can be less important than in the area where the computer data storage is housed. So the boundaries of a cluster could depend on many parameters and in our discussion here we have given a stronger emphasis on paths and costs, because those are more important in our surveillance problem. In the rest of the chapter we work out, in full detail, the o?ce-building example which was given in chapter 4.

5.2

O?ce-like environments

As mentioned in the last section, o?ce-like environments contain a lot of structure. O?ces are normally organised along corridors. These corridors are in turn connected with each other. If they are at di?erent ?oors they are connected with staircases or elevators. If they are parallel they are connected at some point along their length. If halls are ignored, it is clear that almost any o?ce-like building can be described using corridor shaped structures. The building blocks of a corridor are, in turn, star-shaped clusters. By connecting the centres of many star-shaped clusters, a long corridor can be formed. The leaves of those clusters correspond to the o?ces along the corridor. Further, a hall-shaped structure with many rooms connected to it, can be described as a star-shaped cluster with usually a higher number of leaves than a typical corridor star-shaped cluster. In general, the four di?erent types of cluster depicted in ?gure 5.1 are relevant within an o?ce building. Linear clusters correspond to corridors, circular clusters correspond to parallel corridors connected at their ends, stars correspond to the

80

Chapter 5. Path-based Clustering

Figure 5.2: Star-shaped blocks of various sizes. local structure within corridors and crosses correspond to intersecting corridors. The exact geometrical shape of the rooms and corridors can vary from building to building and largely depends on architectural considerations. However, we believe that the topological structure, although di?erent among buildings, has a common basis. A clustering process can be developed for o?ce-like buildings that treats level 1 clusters as a special kind of cluster. We call level 1 clusters blocks. The blocks considered have the shape of a star (?g. 5.2). Of course, several other shapes could be considered, but block clusters of this shape are sensible building blocks of corridor-shaped o?ce buildings. The expected cost of star-shaped blocks can be computed directly without examining individual rooms and this simpli?es the computation of the expected cost later in this chapter. A clustering process can be created for o?ce-like environments. This can be based on repeated clustering until the environment cannot be further abstracted. We propose a clustering process with the following properties: 1. (block-based) star-shaped blocks ?rst have to be found to form level 1 of the abstraction. 2. (connected) a path within the cluster should exist between any two nodes in a cluster. 3. (balanced) the resulting tree should be balanced; this means that at each level the exploration paths within the clusters should have more or less equal length. 4. (uniformity) the ?re costs and ?re starting probabilities have to be uniform within a block. The last two criteria deserve further discussion. On the issue of tree balancing several options were available. The tree could be balanced on the number of rooms in each cluster (as in section 4.2.1). It could also be balanced on cost so that all clusters of a certain level have equal expected costs. In the next chapter decisions will be taken between di?erent cluster paths. Choosing to balance the tree on path length makes these comparisons fairer. It is the time others are ignored, instead of the particular cost of a cluster itself, that appears more important in our situation.

5.3. Route abstraction simpli?cation

40 39 36 34 38 33 32 37 31

81

level 3 level 2 level 1

35

28 26 30

27

23 21

22

18 16

17

13 11

12

8 6

7

3 1

2 level 0 4

29

25

24

20

19

15

14

10

9

5

Figure 5.3: A clustered corridor based environment. The decision for uniformity in the probabilities and costs within star blocks was taken to simplify the computation of block expected costs. The equations for computing the expected cost of a path in a star rely on the assumption that the ?re costs and ?re starting probabilities are uniform within a b

赞助商链接

- Prof. Dr. Dr. h. c. Hans Meyer
- Prof. Dr. Beat Schmid
- Copromotor Prof Dr. G. van der Veer
- for Artificial Neural Networks Supervisor Prof.dr.ir. A.J. van de Goor
- prof. dr. S.C de Vries
- Prof. Dr. Helmut Schauer, Prof. Dr. Beat Schmid,
- Prof.dr.ir. W. van Etten Assistent-promotor
- Graduation committee Prof. dr. F.M.G. de Jong, promotor
- Promotors Prof. dr. ir. P. van der Veer
- Landau parameters for isospin asymmetric nuclear matter based on a relativistic model of co
- STRANGE QUESTIONS ABOUT BRAIDS
- Performance evaluation of a temporal database management system
- The mass and energy budget of Cassiopeia A
- H.D. Communication challenges raised by open co-operative teams in robocup. In Encontro Cie

更多相关文章：
更多相关标签：