PARALLEL SEEDED REGION GROWING ALGORITHM
D.E. Singh, D.B. Heras and F.F. Rivera.
Dept. Electronics and Computer Science. Univ. Santiago de Compostela Campus Sur. 15706 Santiag
o de Compostela. Spain. e-mail: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
In this paper a new algorithm to segment images based on a seeded region growing strategy is presented. This algorithm is suitable for processing on multiprocessors, achieving low gures in terms of runtime. Our proposal is based on the distribution of seeds in processes, in such a way that their growing can be implemented in parallel. In order to control its behaviour, several parameters have been introduced. The execution time and the quality of the segmentation can be xed by these parameters, and their in uence is studied in great detail in this paper. Several tests have been carried out on the Fujitsu AP3000 parallel system. We used di erent images as benchmark. Keywords: seeded region algorithm, parallel algorithms, distributed memory systems.
One of the early tasks in image analysis is to segment an image into constituent parts. Actually, image segmentation is a critical problem in arti cial vision and it has been intensively studied. This process is usually very costly and therefore parallel algorithms are an attractive approach. Nowadays, multiprocessors workstations are widely used, and they can speed up these tasks. The Seeded Region Growing (SRG) algorithm, introduced by Adams and Bischof 1], segments images of intensity starting from a set of seeds (selected pixels to initialise the process on each region). It consists of the expansion of these seeds adding neighbouring pixels based on an iterative process to obtain di erent regions on the image. Our proposal is based on the particular SRG implementation introduced by Mehnert and Jackway 2]. All these strategies are inherently sequential because the pixels have to be processed in a mandatory order. Therefore the parallel implementation of them in not possible. The proposed algorithm relaxes this constraint by allowing independent region growth. The regions are partitioned into sets that are assigned to di erent processes. Then, a
This work was supported in part by the CICYT under grant TIC96-1125-C03-02 and Xunta de Galicia under grant XUGA20605B96. The authors are grateful to the CESGA for the use of their systems.
i < k , i<k PQ
FQ 1 2 3 i
To be processed
variant of the SRG algorithm is executed on the regions that belong to the same set. Some parameters are introduced to control the region growth according to the processing on other sets. When this algorithm is e ectively executed on a multiprocessor, two main points a ect the performance: the load balance and the communication overhead. The former is handled by using e cient methods to partition the region sets. The second one is minimised by establishing good distributions of sets and through parallel programming techniques. The system on which we implemented the parallel version was the Fujitsu AP3000, a distributed memory multiprocessor which consists of 12 SuperSparc processors connected by a mesh network 3]. We adopt a SPMD implementation 4] using C language with message-passing routines of the MPI library 5]. We validate the algorithm using a large group of images, and as an example, results using two standard images are presented. This presentation is organised in four more sections as follows: in section 2 a brief description of the SRG algorithm is presented. Our proposal is introduced in great detail in section 3 and section 4 summarises some relevant results. We end by showing the main conclusions.
2 The SRG algorithm
The SRG algorithm is an iterative method that starts from some selected seeds in the image. The pixels in their neighbourhood are candidates to be included in the corresponding region. Some of them are selected for inclusion because of their similarity in terms of intensity to the neighbouring region. This method is applied pixel by pixel in a sequential way. In each iteration of this algorithm, the pixels in the border of regions that are considered candidates for inclusion are stored in a queue called neighbours holding queue NHQ. The similarity of each pixel of the NHQ is evaluated. is de ned as the di erence between the intensity of the neighbouring pixel and the average of intensities of the pixels already belonging to the region 2]. The SRG algorithm organises the information in a sophisticated structure. The candidates are stored in LIFO (Last Input First Output) queues that store pixels with values of in the interval i = imin ; imax]. The queues are ordered according to the value of
Figure 1: Queues used in the SRG algorithm.
, forming a list called ascending priority queue PQ. In each iteration, the pixels to be processed will be those with the greatest similarity, which correspond to those in the LIFO queue of the lowest , called rst queue FQ. Figure 1 illustrates the organisation of these queues. In order to get a balanced growth, the average of the intensity of all the pixels already assigned to the regions is not updated at this point, but is stored temporarily and updated at the next stage. Once the pixels of the FQ have been analysed, each is assigned to the corresponding region. This process is repeated until every pixel has been processed.
3 The Parallel Seeded Region Algorithm
The basic idea of our proposal, called PSRG, is to distribute the processing of an independent set of regions among processes. In this way, di erent regions can grow simultaneously. In a parallel execution, each process is assigned to a di erent processor. At a preliminary stage, the seeds are distributed among the processes in order to get an optimum load balance and to minimise subsequent communication overheads. Then, the SRG algorithm is applied to each process with additional checks to control the growth which depends on the behaviour of the growth in the rest of processes. The independent growth could produce an overlap area that consists of the so called boundary pixels, that is, those pixels that are assigned to di erent regions in di erent processes. In most cases the boundary pixels are few and can be assigned, if necessary, at a nal stage in a straightforward way. Below we will describe this algorithm in greater detail. This preprocessing consists of e ciently distributing the regions among the processes. The main objective at this stage is to minimise the number of boundary pixels in the following stages. The key point is that regions assigned to the same process will not generate overlap because they are treated as in the sequential algorithm. The idea is to assign to the same process those regions which foreseeably will be neighbours, in the sense that they share some boundary pixels. As a criterion for that, we will use the Euclidean distance between seeds, as the smaller this is, the greater will be the probability of being neighbours. Thus we will use a clustering algorithm based on this distance. On the other hand, to obtain a good load balance, the same number of seeds is assigned to each process. Evaluating the optimal clustering is a NP-complete problem, and can be handled using several heuristics 6]. Due to its speed and e ciency Prim's algorithm 7] was chosen to establish this partition.
3.1 Initial distribution of regions
3.2 Parallel algorithm for growth of regions
The algorithm proposed by Mehnert and Jackway is applied to these sets of regions simultaneously. However, their growth can not be carried out independently in di erent processes because it would produce large boundary regions. Two kinds of synchronisations are introduced to control this e ect, one for detecting collisions between regions and the other for controlling the speed of growth. The rst synchronisation arises from the necessity of limiting the zones in which a region can grow to avoid a large number of boundary pixels between regions. The second one is needed to avoid abnormal growth of regions. For abnormal growth we mean the situation in which a region occupies pixels
which do not correspond to it in the SRG algorithm. This area can be occupied by other regions assigned to other process with smaller work load at that moment, and which can spend more time on its growth. Figure 2 illustrates the results of applying this algorithm using three processes. In particular gures 2(a), 2(b) and 2(c) show the segmentation obtained in each process and in gure 2(d) the nal result, where black pixels correspond to boundary pixels. This gure shows the fact that the Prim's algorithm produces high locality in the regions assigned to each process and a good balance in terms of the number of pixels processed.
(a) Process 1
Figure 2: Regions and boundary pixels using 3 processes.
(b) Process 2
(c) Process 2
Detection of collision between regions
The objective of this operation is to limit the zones of growth for the regions by detecting their overlap. When a pixel is identi ed as being assigned to more than one region it will be labelled as a boundary and thereby the growth of these regions is blocked in this pixel. This means that between neighbouring regions, assigned to di erent processes, a boundary zone exists, which will not be allowed to process it in the future. The way of carrying out this synchronisations is by means of a reduction operation 4] as follows: initially each process sets the label for each pixels assigned to a particular region. Then, the summation of all these labels is obtained by the reduction operation which implies communications among processes. If the result for some pixel is greater than one, it will be labelled as boundary. It is necessary to perform this operation on every pixel. This can produce long messages. To reduce this overhead, two improvements have been used: 1. Codi cation of labels using just two bits per pixel according to table 1. Table 1: Codi cation 00 01 10 Meaning not belonging to shared by more assigned just one region than one region 2. Use of windows. It is not necessary to obtain an exact outline of the regions, so the images are partitioned into square windows of pixels. Instead of using labels for each pixel, they are de ned for each window. In this way, the detection of overlap presents some kind of imprecision, at the expense of reducing signi cantly the size of the messages.
Where Nregions is the number of regions, Nprocesses is the number of processes and Dmin is the minimum Euclidean distance among seeds in di erent processes. In this equation, it is taken into account both the work load of each process, given by the number of regions it has to process, and the minimum distance between regions. High values of T allow di erent regions to grow in the same zone of the image, thereby producing large overlaps. On the other hand if T is too small, a large communication overhead will be produced. For this reason, the value of T is modi ed dynamically based on the increase in the number of boundary pixels according to:
1 1 1
The reduction operation is carried out every T iterations. The initial value of T is an estimation of the number of processed pixels before the rst overlap between two regions occurs, which was taken as: N T Init = regions Dmin (1) N
1 1 1( )
Where depends on the cost of the communication routines in each particular system and represents an estimation of a pro table number of boundary pixels between successive communications. In expression 2, 4K is the number of new boundary pixels generated since the last synchronisation. We found that = 400 and = 3000 are adequate for the AP3000 multiprocessor. This means that if the speed of growth in the regions remains constant, every 400 iterations, 3000 new boundary pixels will be added. In addition, in order to avoid extremal values of T that decrease performance or increase the number of boundary pixels, T is constrained to the interval 400 6 T 6 T Init .
1 1 1 1( )
Each process works on its local PQ and FQ. However, it is important to check the position of each local FQ in the global PQ, because if it presents a globally low similarity, it can give rise to problems of abnormal growth. The strategy adopted to solve this e ect consists of carrying out a reduction operation in order to compute the maximum ( max ) and minimum ( min) global values of . We de ne a control parameter that speci es the size of an interval of length L as: (3) If the priority of a pixel in the local PQ in a process is less that min + L it is processed, otherwise, it will not be considered. This operation is regulated by a given parameter T , which indicates the number of iterations between successive synchronisations. This operation does not have a high cost because it involves just two values: max and min.
Limitation of the speed of growth
L = ( max ? min )
We have tested the e ciency of parallel algorithm with di erent images. Moreover, we have analysed the in uence in the program's e ciency of each parameter that was introduced in section 3. All these tests were realised on the Fujitsu AP3000 using di erent
Execution time (seconds)
% boundary pixels
2 4 6 8 10 12
20 20 10
Number of processors
Number of processors
Figure 3: Execution time and percentage of boundary pixels. numbers of processors. The execution time and the percentage of boundary pixels were used to evaluate the e ciency and quality of the PSRG. Two images called boats and peppers with a size of 702*575 and 512*512 pixels respectively were used to illustrate the results. Figure 3 shows the execution time and the percentage of boundary pixels for di erent number of processors.Those measures correspond to = 6 and = 0:24 for peppers and = 1:00, = 6 for boats. The SRG algorithm is a particular case of the PSRG because when executed in one processor is exactly the same as the SRG. Note that the execution time is reduced signi cantly as the number of processors increases. However, this implies an increase in the number of boundary pixels due to the overlap of regions assigned to di erent processors. These pixels are allocated in the borders of the regions, so although this number could be high, the regions can be located with an acceptable precision. This e ect can be appreciated in gure 4 which shows, for
Figure 4: Parallel SRG algorithm with 4 (central) and 9 (right) processors.
T dinamic 1 T =2T init 1 1 T =200
T dinamic 1 T =2T init 1 1 T =200
Execution time (seconds)
140 25 120
2 4 6 8 10 12
15 80 10 60 5 40 0
Number of processors
Figure 5: Results of T parameter for boats image.
a di erent number of processors, the results of the segmentation process. The parameters that in uence the e ciency of our algorithm are T , T , and . T establishes the frequency of the communications for controlling the speed of growth. Because the size of these messages is small, this operation is fast and will have a minimal impact on the performance of the program. The AP3000 communication speed is 200 MB/sec, so a transmission of two words represents around 80ns, a small percentage of the execution time of the program. Similar results can be obtained for other multiprocessor systems. In this way, with a constant value of 300 the growth can be controlled without a ecting the performance. The messages controlled by T have high sizes, so the situation is clearly di erent. Figure 5 illustrates the execution times for di erent values of T and the percentage of boundary pixels that are detected in successive communications. When the PSRG algorithm starts the regions grow producing no boundary pixels. At some point, there will begin to be overlap and the number of boundary pixels increase. During the last iterations the regions are completely surrounded by neighbours which prevent growth, and so, the number of new boundary pixels decreases. Small values of this parameter ensure a precise result with high execution times due to the increase in the number of messages (in gure, T = 200). On the other hand, when T is high (T = 2 Tinit ) the number of communications dramatically decreases, and as a consequence, the overlap is high and the growth stops later (in each step). With an adaptative value of T , the synchronisation frequency increases when the number of boundary pixels increases, thus
1 2 2 1 1 1 1 1 1
Figure 6: Results of parameter for peppers image.
avoiding high overlaps. And, when the number of boundary pixels is small during the initial and nal iterations, cost of communications is reduced. Assuming that this is the most critical parameter, an adaptative value for T was used, that attains a balance between the quality of the segmentation and the execution time. The in uence of is also high, gure 6 shows the results for four di erent values . When this parameter decreases the number of boundary pixels is lower and the execution time is higher. However, for adequate values of , the improvement in the quality compensates for the loss of speed. The parameter regulates the size of the windows. Sizes of 5 or 6 will produce good results without great increase in the number of boundary pixels. Lower sizes increase the sizes of communications making it more costly, and higher sizes a ect the quality of the image because the size of the window is too large.
A generalisation of the SRG algorithm is presented in this work. It is suitable for being processed in parallel because the data dependencies were relaxed. Several parameters were introduced to control the behaviour of the algorithm although just one must be set by the user depending on the features of the target image. With parameter, the user can control the quality of segmentation at the cost of losing some speed in the process. Other parameters more critical in the system, such as T , are established automatically, so the user doesn't have to worry about its control. An example of the results of our algorithm is that the execution time is reduced from 85 seconds to 37 seconds using 4 processors and to 13 seconds using 9 processors.
1] Rolf Adams and Leanne Bischof. Seeded Region Growinng . IEEE transactions of pattern analysis and machine intelligence (1994) 6, vol. XVI, pp 641-647. 2] Andrew Mehnert and Oauk Jackway. An improved seeded region growing algorithm . Pattern Recognition Letters (1997) 18, pp. 1065-1071. 3] Hiroaki Ishihata, Masanori Takahashi and Hiroyuki Sato. Hardware of AP3000 Scalar Parallel Server . FUJITSU Sci. Tech (1997), pp 24-30. 4] Vipin Kumar, Ananth Grama, Anshul Gupta and George Karypis. Introduction to Parallel Computing. Design and Analysis of algorithms . The Benjamin/Cummings Publishing Company, Inc (1994). 5] William Gropp, Ewing Lusk and Anthony Skjellum. USING MPI Portable Parallel Programming with the Message-Passing Interface . The MIT Press (1994). 6] Manshat Mansour and Geo rey C. Fox. Allocating data to multicomputer nodes by physical optimization algorithms for loosely synchronous computations . Concurrency: Practice and experience (1992) 4, vol. VII, pp 557-574. 7] Alan Gibbons. Algorithmic graph theory . Cambridge University Press (1984), January. Department of Computer Sciences, University of Warwick.