当前位置:首页 >> >>

Centralized versus distributed schedulers for multiple bag-of-task applications

Centralized versus distributed schedulers for multiple bag-of-task applications
Olivier Beaumont1 , Larry Carter2 , Jeanne Ferrante2 , Arnaud Legrand3 , Loris Marchal4 and Yves Robert4
2 Laboratoire LaBRI, Dept. of Computer Science and Engineering, CNRS-INRIA Bordeaux, France University of California, San Diego, USA Olivier.Beaumont@labri.fr {carter,ferrante}@cs.ucsd.edu 3 4 Laboratoire ID-IMAG Laboratoire LIP ? CNRS-INRIA,Grenoble, France CNRS-INRIA,Ecole Normale Sup? erieure de Lyon, France Arnaud.Legrand@imag.fr {Loris.Marchal,Yves.Robert}@ens-lyon.fr 1

Multiple applications that execute concurrently on heterogeneous platforms compete for CPU and network resources. In this paper we consider the problem of scheduling applications to ensure fair and e?cient execution on a distributed network of processors. We limit our study to the case where communication is restricted to a tree embedded in the network, and the applications consist of a large number of independent tasks that originate at the tree’s root. The tasks of a given application all have the same computation and communication requirements, but these requirements can vary for di?erent applications. Each application is given a weight that quanti?es its relative value. The goal of scheduling is to maximize throughput while executing tasks from each application in the same ratio as their weights. We can ?nd the optimal asymptotic rates by solving a linear program that expresses all necessary problem constraints, and we show how to construct a periodic schedule. For single-level trees, the solution is characterized by processing tasks with larger communicationto-computation ratios at children with larger bandwidths. For multi-level trees, this approach requires global knowledge of all application and platform parameters. For large-scale platforms, such global coordination by a centralized scheduler may be unrealistic. Thus, we also investigate decentralized schedulers that use only local information at each participating resource. We assess their performance via simulation, and compare to a centralized solution obtained via linear programming. The best of our decentralized heuristics achieves the same performance on about two-thirds of our test cases, but is far worse in a few cases. While

our results are based on simplistic assumptions and do not explore all parameters (such as bu?er size), they provide insight into the important question of fairly and optimally co-scheduling heterogeneous applications on heterogeneous grids.

1. Introduction
In this paper, we consider the problem of scheduling multiple applications that are executed concurrently, hence that compete for CPU and network resources. The target computing platform is a heterogeneous network of computers structured either as a star network (a one-level rooted tree) or a multi-level rooted tree. In both cases we assume full heterogeneity of the resources, both for CPU speeds and link bandwidths. Each application consists of a large collection of independent equal-sized tasks, and all tasks originate at the tree’s root. The applications can be very different in nature, e.g. ?les to be processed, images to be analyzed or matrices to be manipulated. Consequently, we assume each application has an associated communication-to-computation ratio for all of its tasks. This ratio proves to be an important parameter in the scheduling process. This scenario is somewhat similar to that addressed by existing systems. For instance BOINC [11] is a centralized scheduler that distributes tasks for participating applications, such as SETI@home, ClimatePrediction.NET, and Einstein@Home. The scheduling problem is to maintain a balanced execution of all applications while using the computational and communication resources of the system e?ectively to maximize throughput. For each applica-

tion, the root node must decide which workers (i.e. which subtree) the tasks are sent to. For multi-level trees, each non-leaf worker must make similar decisions: which tasks to compute locally, and which to forward to workers further down in the tree. The scheduler must also ensure a fair management of the resources. If all tasks are equally important, the scheduler should aim to process the same number of tasks for each application. We generalize this by allowing each application Ak to be assigned a priority weight w(k) that quanti?es its relative value. For instance, if w(1) = 3 and w(2) = 1, the scheduler should try to ensure that three tasks of A1 are executed for each task of A2 . For each application Ak , let ν (k) (t) be the number of tasks of Ak completed by time t. At any given time t, we can de?ne the throughput α(k) of application k to be ν (k) (t)/t. To balance the tasks according to the speci?ed priority weights, we use the objective funcα(k) tion Maximize mink w(k) . This function, called fair throughput in the following, corresponds to the wellknown MAX-MIN fairness strategy [8, 18] among the di?erent applications, with coe?cients 1/w(k) . We will consider both centralized and decentralized schedulers. For smaller platforms it may be realistic to assume a centralized scheduler, which makes its decisions based upon complete and reliable knowledge of all application and platform parameters. With such knowledge at our disposal, we are able to determine an optimal schedule, i.e. a schedule that maximizes the fair throughput asymptotically. This is done by formulating all constraints into a linear programming problem, and using the solution to construct a periodic schedule. Except during the (?xed length) start-up and clean-up periods no schedule can have higher throughput. For single-level rooted trees, we provide an interesting characterization of the optimal solution: applications with larger communication-to-computation ratio should be processed by the workers with larger bandwidths, independent of the communication-tocomputation ratios of the workers. For large-scale platforms, particularly ones in which resource availability changes over time, a centralized scheduler may be undesirable. Only local information, such as the current capacity (CPU speed and link bandwidth) of a processor’s neighbors, is likely to be available. One major goal of this paper is to investigate whether decentralized scheduling algorithms can reach optimal throughput, or at least achieve a signi?cant fraction of it. We provide several decentralized heuristics that rely exclusively on local information to make scheduling decisions. The key underlying principles of these heuristics come from our characterization of the optimal solution for star networks: give priority to high-bandwidth children, and assign them

tasks of larger communication-to-computation ratios. We evaluate the decentralized heuristics through extensive simulations using SimGrid [17], and use a centralized algorithm (guided by the linear program solution) as a reference basis. The rest of the paper is organized as follows. In Section 2, we state precisely the scheduling problem under consideration, with all application and platform parameters, and the objective function used. Section 3 explains how to analytically compute the best solution, using a linear programming approach, and characterizes the solution for single-level trees. Then Section 4 deals with the design of several decentralized scheduling heuristics, while Section 5 provides an experimental comparison of these heuristics. Finally, we state some concluding remarks in Section 6.1

2. Platform and Application Model
In this paper, we make a number of overly simplistic assumptions; nevertheless, we believe that both by the theory and the experiments provide insight into the important question of how to optimally and fairly coschedule heterogeneous applications on heterogeneous grids.

2.1. Platform Model
The target computing platform is either a singlelevel tree (also called a star network ) or an arbitrary tree. The processor at the root of the tree is denoted P0 . There are p additional “worker nodes”, P1 , P2 , . . . , Pp ; each worker Pu has a single parent Pp(u) , and the link between Pu and its parent has bandwidth bu . We assume a linear-cost communication model, hence it takes X/bu time units to send a message of size X from Pp(u) to Pu . We ignore processor-task a?nities; instead, we assume one can express the computational requirements of tasks as a number of ?oating-point operations, and that processor Pu can execute cu ?oatingpoint operations per second (independent of which application it is executing). There are several scenarios for the operation of the processors, which are discussed in Section A3 of the Appendix. In this paper, we concentrate on the full overlap, single-port model [9, 10]. In this model, a processor node can simultaneously receive data from one of its neighbors, perform some (independent) computation, and send data to one of its neighbors. At any given time, there are at most two communications involving a given processor, one sent and the other received.
1 Due to lack of space, we do not provide related work in this article but a thorough survey can be found in the extended version of this article [5].

2.2. Application Model
We consider K applications, Ak , 1 k K . The root node P0 initially holds all the input data necessary for each application Ak . Each application has a priority weight w(k) as described earlier. Each application is composed of a set of independent, same-size tasks. We can think of each Ak as bag of tasks, and the tasks are ?les that require some processing. A task of application Ak is called a task of type k . We let c(k) be the amount of computation (in ?oating point operations) required to process a task of type k . Similarly, b(k) is the size (in bytes) of (the ?le associated to) a task of type k . We assume that the only communication required is outwards from the root, i.e. that the amount of data returned by the worker is negligible. Our results are equally applicable to the scenario in which the input to each task is negligible but the output is large. The communication-to-computation ratio of tasks of type k is de?ned as b(k) /c(k) . Note that our notations use indices for platform resources (bandwidth bu , CPU speed cu ) and exponents for application parameters (bytes b(k) , ?oating-point operations c(k) , weight w(k) ). .

makespan optimization problem outweighs the disadvantage of not knowing the exact length of the start-up and clean-up phases of a ?nite schedule.

3. Computing the Optimal Solution
In this section, we show how to compute the optimal throughput, using a linear programming formulation. For star networks we give a nice characterization of the solution, which will guide the design of some heuristics in Section 4.

3.1. Linear Programming Solution
A summary of our notation follows: - P0 is the root processor and Pp(u) is the parent of node Pu for u = 0. - Γ(u) is the set of indices of the children of node Pu . - Node Pu can compute cu ?oating-point operations per time unit, and, if u = 0, can receive bu bytes from its parent Pp(u) . - Application k has weight w(k) , and each task of type k involves b(k) bytes and c(k) ?oating-point operations. We use linear programming to solve for the variables: - αu , the number of tasks of type k executed by Pu each time unit. - α(k) , the total number of tasks of type k executed per time unit. - sentu→v , the number of tasks of type k received by Pv from Pp(v) per time unit. Any feasible schedule must be a solution to the linear programming problem: Maximize mink w(k) under the constraints ? (k ) (k ) ? ?k, (de?nition of α(k) ) ? u αu = α ? (k ) (k ) (k ) ? ? ?k, ?u = 0, sentp(u)→u = αu + v∈Γ(u) sentu→v ? ? ? ? ? (data movement conservation) ? ? ? (k ) (k ) ? cu ? ?u, k αu · c (computation limit at node Pu ) P ? (k) (k) ? ? k sentu→v ·b ? ?u, 1 ? v ∈ Γ( u ) bv ? ? ? (communication limit out of Pu ) ? ? ? (k ) (k ) ? ? ? k, u α 0 and sent 0 u u→v ? ? (non-negativity) (3) We assume that all the parameters to the linear programming problem are rational numbers, and hence the solution will be rational also.
α(k) (k ) (k )

2.3. Objective Function
If each application had an unlimited supply of tasks, our objective function would be Maximize lim min
t→∞ k

ν (k) (t) w (k ) · t


where ν (k) (t) is the number of tasks of application Ak completed by time t. However, we can do better than studying asymptotic behavior. Following standard practice, we optimize the “steady-state throughput”, i.e. Maximize min

α w (k )

(k )



where α(k) is the average number of tasks of Ak executed per time unit. There are two features of this approach. First, if we can derive an upper bound on the steady-state throughput for arbitrarily long periods, then this is an upper bound on the limit of formula (1). Second, if we construct a periodic schedule – one that begins and ends in exactly the same state – then the periodic schedule’s throughput will be a lower bound on the limit of formula (1). Thus, this approach allows us to derive optimal results. When the number of tasks per application is large, we believe the advantage of avoiding the NP-completeness of the

3.2. Reconstructing a Periodic Schedule
Suppose we have solved linear program (3). The conditions in the linear program deal with steady state behavior, but it may not be immediately obvious that there exists a valid schedule, where precedence constraints are satis?ed (i.e. a task is processed on a processor only when the corresponding input ?le has been routed to this processor), that achieves the desired throughput. Nevertheless, suppose we have determined (k ) (k ) all the values αu , and sentu→v . De?ne the time period Tperiod to be the least common multiple of the denominators of these rational numbers. Thus, in one time period, there will be an integral number of tasks sent over each link and executed by each node. We give each node su?cient bu?er space to hold twice the number of tasks it receives per time period. Each task it receives in period i will, in period i + 1, either be computed locally or sent to a child. Since each node receives tasks from only one other node (its parent), there is no concern with scheduling the receives to avoid con?icts. Further, each node is free to schedule its sends arbitrarily within a time period. Thus, this schedule is substantially easier than if the processors were connected as an arbitrary graph (c.f. [3]). A node at depth d doesn’t receive any tasks during the ?rst d ? 1 time periods, so will only enter “steady state mode” in time period d. Similarly, the root will eventually run out of tasks to send, so the ?nal time periods will also be di?erent. It is often possible to improve the schedule in the start-up and clean-up time periods, which is the concern of the NP-complete makespan minimization problem. However, the periodic schedule described above is asymptotically optimal. More precisely, let z be the number of tasks executed by the periodic schedule in steady state during d time periods, where d is the maximum depth of any node that executes a positive number of tasks. Then our schedule will execute at most z fewer tasks than any possible (not necessarily periodic) schedule. One ?nal comment is that the time period Tperiod , and the amount of bu?er space used, can be extraordinarily large, making this schedule impractical. We will revisit this issue later.

ecuted by the next slice of processors, and so on. There is a possible overlap between the slices. For instance Pa1 , the processor at the boundary of the ?rst two slices, may execute tasks for both applications A1 and A2 . To simplify notations in the following proposition, we consider the root P0 as a worker with in?nite bandwidth (b0 = +∞): Proposition 1. Sort the nodes by bandwidth so that b0 b1 . . . bp , and sort the applications by communication-to-computation ratio so b(1) b(2) b(K ) that c ... . Then there exist in(1) c(2) c(K ) dices a0 a1 . . . aK such that only processors Pu , u ∈ [ak?1 , ak ], execute tasks of type k in the optimal solution. Proof. The essential idea is to show that if a node Pi is assigned a task with a lower communication-tocomputation ratio than a task assigned to Pi+1 , then these two nodes could swap an equal amount of computational work. This would reduce the communication time required by the schedule without changing any throughputs. Thus, by a sequence of such swaps, any schedule can be transformed to one of the desired structure, without changing the fair throughput. See the Appendix A1 for a detailed proof. We did not succeed in deriving a counterpart of Proposition 1 for tree-shaped platforms. Intuitively, the problem is that a high-bandwidth child of node Pi can itself have a low-bandwidth, high-compute-rate child, so there is no a priori reason to give Pi only communication-intensive tasks. Still, we use the intuition provided by Proposition 1 and its proof to design the heuristic of Section 4.5.

4. Demand-driven and Decentralized Heuristics
As shown in Section 3.1, given a tree-shaped platform and the set of all application parameters, we are able to compute an optimal periodic schedule. This approach su?ers from several serious drawbacks. The ?rst is that the period of the schedule is the least common multiple of the denominators of the solution of linear program (3). This period may be huge, requiring the nodes to have unreasonably large bu?ers to ensure uninterrupted steady-state behavior. The problem of bu?er size has already been pointed out in [12, 7], where it is shown that no ?nite amount of bu?er space is su?cient for every tree. It is also known that ?nding the optimal throughput when bu?er sizes are bounded is a strongly NP-hard problem even in very simple situations [7]. Since unlimited bu?er space is unrealistic, we will only consider demand-driven algorithms. Each

3.3. The Optimal Solution for Star Networks
When the computer platform is a star network, we can prove the optimal solution has a very particular structure: Informally, each application is executed by a slice of consecutive nodes. The application with the highest communication-tocomputation ratio is executed by a ?rst slice of processors, those with largest bandwidths. Then the next most communication-intensive application is ex-

node has a worker thread and a scheduler thread. The worker thread is an in?nite loop that requests a task from the same node’s scheduler thread and then, upon receiving a task, executes it. Figure 1 shows the pseudo-code for the scheduler thread. Note that line 2 says when there’s room, the scheduler requests work from its parent. Because a request for work doesn’t specify the type of the application, there must be enough room for any type task even after all previous outstanding requests are satis?ed with tasks of the largest type. The “select” choices in line 5 depend on the particular heuristic used, and can be based on, for instance, the history of requests and task types it has received and the communication times it has observed for its children.
1: 2: 3: 4: 5:

we can solve the linear programming problem (3.1) and tell each node the number of tasks of each type it should assign to each of its children each time period. Thereafter, no further global communication is required. Each scheduler thread uses a 1D load-balancing mechanism [4] to select a requesting thread and an application type. The 1D load-balancing mechanism works as follows: if choice i should be made with frequency f (i), and has already been made g (i) times, then the next task to be )+1 g (k)+1 sent will be of type , where g( f ( ) = mink f (k) . We might hope the LP heuristic would always converge to the optimal throughput, but we will see in Section 5.2.1 that it may not, primarily because of insu?cient bu?er space.

4.2. First Come First Served (FCFS )
Loop If there will be room in your bu?er, request work from parent. Get incoming requests from your local worker and children, if any. Move incoming tasks from your parent, if any, into your bu?er. Select which thread (your worker or a child’s scheduler) to assign work to and the type of application that will be assigned. If you have a task and a request that match your choice Then Send the task to the chosen thread (when the send port is free) Else Wait for a request or a task The FCFS heuristic is a very simple and common decentralized heuristic. Each scheduler thread simply ful?lls its requests on a First Come First Served basis, using the tasks it receives in order from its parent. The root ensures fairness by selecting task types using the 1D load-balancing mechanism with frequencies given by the applications’ priority weights w(k) . This simple heuristic has the disadvantage, not shared by the other methods we consider, that an extremely slow communication link cannot be avoided. Thus, optimal performance should not be expected.

6: 7: 8: 9:

4.3. Coarse-Grain (CGBC )


Figure 1. Demand-driven scheduler thread, run in each node

A second problem that some schedulers (including the one of Section 3.2) encounter is that centralized coordination may become an issue when the size of the platform grows beyond a certain point. It may be hard to collect up-to-date information and disseminate it to all nodes in the system. Consequently, a decentralized scheduling algorithm, where all choices are based exclusively on locally available information, is desirable. In the following we consider one demand-driven algorithm that is based on global information (the solution to the linear programming problem), and four that are decentralized.

4.1. Centralized LP-based (LP )
If we know the computation power and communication speeds of all nodes in the distributed system,

This heuristic (CGBC ) builds upon our previous work for scheduling a single application on a tree shaped platform [6, 3]. In bandwidth-centric scheduling, each node only needs to know the bandwidth to each of its children. The node’s own worker thread is considered to be a child with in?nite bandwidth. The scheduler thread prioritizes its children in order of bandwidth, so the greatest bandwidth has highest priority. The scheduler always assigns tasks to the the highest-priority requester. Bandwidth-centric scheduling has been shown to have optimal steadystate throughput, both theoretically and, when the bu?ers are su?ciently large, in extensive simulations. The idea of the coarse-grain heuristic is to assemble several tasks into a large one. More precisely, we build a macro-task out of w(k) tasks of type k , for each k . These are the units that are scheduled using the bandwidth-centric method. Thus, fairness is assured. Unfortunately, even though bandwidth-centric scheduling can give optimal throughput of macrotasks, the CGBC heuristic does not reach the optimal fair throughput. Indeed, Proposition 1 asserts that nodes with faster incoming links should process only

tasks with larger communication-to-computation ratios. But since a macro-task includes tasks of all types, CGBC will send communication-intensive tasks to some low-bandwidth nodes.

(resp. lowest) weighted throughput relative to the target. Those operations attempt to increase the number of tasks of type B that are assigned, sometimes by reducing the number of A’s.2 Communication Trading Suppose A has a higher communication-to-computation ratio than B , which is the common case since we start with only tasks of type 1. Then if a child reports that it is not fully utilized (either because its CPU is idle or because it can’t keep up with the requests it receives from the grandchildren) then the parent can substitute some tasks of type A for type B (i.e. send some tasks of type B instead of tasks of type B to his under-utilized child). It should make the substitution in a way that keeps the communication time the same (i.e. trading them in the ratio of b(B ) ’s A’s for b(A) B ’s), and limited by the number that would make the weighted throughputs equal. Gap ?lling Suppose that some bandwidth is not used and that a remote processor Pu could receive more tasks of an unfavored application. Let εB denote the number of tasks of type B that this processor could handle. If we de- +εB p ( i) note by CPU the CPU occupation of processor Pu , we have: k) (k) α( u .c i CPU = , and the folk cu lowing condition on εB has to hold (B ) 1. We also true: CPU + εB ccu need to verify that there is enough u free bandwidth along the path from the root node to Pu . Therefore for any node i along this path, we need the following condition on εB to hold true: sentp(i)→j .b(k)
k j (k )

4.4. Parallel Bandwidth-Centric (PBC )
The parallel bandwidth-centric heuristic (PBC ) superposes bandwidth-centric trees for each type of task, running all of them in parallel. More precisely, each node has K scheduler and K worker threads that run concurrently, corresponding to the K application types. Threads only communicate with other threads of their own type. Fairness is ensured by the root, which allocates tasks to its separate schedulers in the desired ratios. In all our simulations, we enforce the one-port constraint for each scheduler thread. But for this PBC heuristic, we have not enforced the constraint globally across the schedulers. Thus, it is possible that a node will send as many as K tasks concurrently, one of each type. In this case, we do model the contention on the port, so the aggregate bandwidth doesn’t exceed the port’s limit. (Similarly, the node’s processor can multitask between multiple tasks.) This gives the PBC strategy an unfair advantage over the other heuristics. In fact, it has been shown [12] that allowing interruptible communication (which is similar to concurrent communication) dramatically reduces the amount of bu?er space needed to achieve optimal throughput.

4.5. Data-Centric CENTRIC )



This heuristic is our best attempt to design a decentralized demand-driven algorithm that converges to the a solution of the linear program (3). The idea is to start from the bandwidth-centric solution for the most communication-intensive application and to progressively replace some of these tasks for more computation-intensive ones. Doing so, we come up with (k ) better values for the expected αu and the expected (k ) sentu→v , which can in turn be used in the demanddriven algorithm of the Figure 1. These frequencies are continuously recomputed so as to cope with potential availability variations. The rest of this subsection is devoted to details of the trading operations. We sort the task types by non-increasing communication-to-computation ratios. We start the algorithm using the pure bandwidth-centric approach for tasks of type 1, but as the computation proceeds, a node will ?nd itself receiving a mix of di?erent types of tasks. To reduce the imbalance, the root applies the four operations described below, in the listed order of precedence. In the following, A (resp. B ) denotes the application that currently has the highest



b( B ) bi


bus occupation (p(i))

Lastly, to avoid over-reducing the imbalance between α(A) and α(B ) , we add the following constraint: α(A) α(B ) ? εB . Therefore, we have: ? cu (1 ? CPU ) (B ) εB = min ? ,α ? α ( A) , c(B ) ? 1 ? bus occupation (p(i)) min .bi ? i ∈ path from b( B )
the root to Pu


In the following, we suppose without loss of generality that application characteristics have been scaled so that they all have the same priority weight.

Bus de-saturation The bus may have been saturated by tasks with a high communication-tocomputation ratio. We may then still be using only workers with high communication capacity. In such a situation, the tree has to be widened (i.e. use additional subtrees) and the only way to do that is to reduce the amount of tasks of type A that are pro( A) ( A) cessed by the subtrees. The αi and senti→j values of any node of the branch with the smallest bandwidth that process some tasks of type A are then scaled down by a factor of 0.9. This operation allows us to decrease the communication resource utilization and precedes “Gap ?lling” operations. Task trading on the root At some point (when application A is processed only on the root node) we may have no choice but to trade εA tasks of type A for εB tasks of type B . Then we will have the follow( A) ing constraints: εA α0 , α(A) ? εA α(B ) + εB and (B ) (A) εB . cc0 = εA . cc0 . Therefore, we have εA = min αroot ,
( A)

steady-state behavior, and measuring throughput becomes even trickier when the schedule is not periodic. We took a pragmatic, heuristic approach for our experiments. Let T denote the earliest time that all tasks of some application were completed. Let Nk (t) denote the number of tasks of type k that were ?nished in time period [0, t]. We can then de?ne the achieved throughput ρk for application k by: ρk = Nk ((1 ? ε)T ) ? Nk (εT ) (1 ? 2ε)T , where 0 ≤ ε <, 0.5.

The ε is an artifact that lets us ignore the initial and ?nal instabilities (in practice, we set ε to be equal to 0.1). In the following, we will refer to ρk as the experimental throughput of application k as opposed to the expected throughput that can be computed solving linear program (3). Likewise, the minimum of the weighted experimental throughputs is called the experimental fair throughput. 5.1.2. Platform generation The platforms used in our experiments are random trees described by two parameters: the number of nodes n and the maximum degree degreemax . To generate the interconnection network topology, we use a breadth-?rst algorithm (see Appendix A2 for more details) in order to have wide trees rather than ?liform (deep and narrow) ones. In our experiments, we generated trees of 5, 10, 20, 50 and 100 nodes. The maximum degree was 2, 5, or 15, and 10 trees of each con?guration were generated. Thus, our test set comprised 150 trees in total. Then we assign typical capacity, latency and CPU power values on edges and nodes at random. Those values come from real measurements performed, using tools like pathchar, on machines spread across the Internet. CPU power ranged from 22.151 M?ops (an old Pentium Pro 200MHz) to 171.667 M?ops (an Athlon 1800). Bandwidth ranged from 110 kb/s to 7 Mb/s and latency from 6 ms to 10 s. Note that in the simulator that we are using (see Section 5.1.4), latency is a limiting factor as well as the link capacity for determining the e?ective bandwidth of a connection. 5.1.3. Application generation An application is mainly characterized by its communication-tocomputation ratio (CCR ). We decide that the smallest reasonable CCR was CCR min = 0.001, which corresponds to the computation-intensive task of multiplying two 3500 × 3500 matrices. We also decided on an upper bound for CCR of 4.6, corresponding to the addition of two such matrices. In choosing application types, we chose CCR max between 0.002 and 4.6, and then chose the applications’ CCR ’s to be evenly spaced in the range [CCR min , CCR max ]. For simplicity, we made all priority weights be 1.

α ( A) ? α ( B ) 1+
b(A) b(B )

and εB =

c(A) εA c(B )

The above operations are continuously performed (with the listed order of precedence) until we reach a satisfying balance, such as maxk
α(k) w(k)

? mink
α(k) w(k)

α(k) w(k)

< 0.05.


The above operations may appear as needing a global knowledge. For example, it may seem at ?rst sight that when performing a “Gap ?lling ” operation, the master needs to know all informations on the the path connecting him to its remote descendant Pu . However, this operation in fact simply amount to compute a minimum along this path which can easily (and ef?ciently) be done by using a distributed propagation mechanism along this path, thus making the need of the master to know Pu irrelevant. The same kind of technique can be used for all other operations as they only imply descendants in a single subtree.

5. Simulation Results
5.1. Evaluation methodology
5.1.1. Throughput evaluation It is not at all obvious how to determine that a computation has entered

5.1.4. Heuristic implementation The experiments were performed using the SimGrid simulator [17]. The simulator’s performance is much more complex than the simplistic bandwidth and computational speed model used to design our heuristics. Therefore, the values of ci and bi values were measured from within the simulator and used to make the decisions in the algorithms of Section 4. As explained in section 4.5, the demand-driven algorithms send requests (involving a few bytes) from children to parents. Our simulations included the request mechanism, and we ensured that no deadlock occurred in our thousands of experiments, even when some load-variations occurred. Except where otherwise noted, throughput evaluations were performed using 200 tasks per application. Note, that we carefully checked using a larger number of tasks that this was always su?cient to reach the steady-state.
0.12 0.1 Frequency 0.08 0.06 0.04 0.02 0 0 0.2 0.4 0.6 0.8 1 Deviation from theoretical throughput

heuristics CGBC , LP and DATA-CENTRIC . All three heuristics exhibited a similar distribution, so they were combined in this ?gure. The average deviation is equal to 9.426%. However, when we increased the bu?er size by a factor ten (and increased the number of tasks per application to 2000), the mean average deviation dropped to 0.334%. Even though the larger bu?er size led to much better throughput, we considered it unrealistic, and used size 10 in all other experiments. 5.2.2. Performance of Heuristics Let us ?rst compare the relative performances of our ?ve heuristics (FCFS , PBC , CGBC , LP and DATA-CENTRIC ). More precisely, for each experimental setting (i.e. a given platform and a given CCR interval), we compute the (neperian) logarithm of the ratio of the experimental fair throughput of LP with the experimental fair throughput of a given heuristic (applying a logarithm enables us to have a symmetrical value). Therefore, a positive value means that LP performed better than the other heuristic. Figure 3 depicts the histogram plots of these values. First of all, we can see that most values are positive, which illustrates the superiority of LP . Next, we can see on Figure 3(a) that DATA-CENTRIC is very close to LP most of the time, despite the distributed computation of the weights. However, the geometric average3 of these ratios is equal to 1.164, which is slightly larger than the geometric average for CGBC (1.156). The reason is that even though in most settings DATACENTRIC ends up with a very good solution, in a few instances its performance was very bad (up to 16 times worse than LP ). In contrast, CGBC (see Figure 3(d)) is much more stable since its worst performance is only two times worse than LP . Note that those failures happen on any type of tree (small or large, ?liform or wide) and that the geometric average of these two heuristics are always very close to each other. We also have checked that these failures are not due to an artifact of the decentralized control of the scheduling by ensuring that the theoretical throughput has the same behavior (i.e. the bad behavior actually comes from the compu(k ) (k ) tation of the expected αu and sentu→v ). We are still investigating the reasons why DATA-CENTRIC fails on some instances and suspect that it is due to the use of the sometimes misleading intuition of Proposition 1. Indeed, in this heuristic, applications with a low communication-to-computation ratio are mainly performed on the rightmost part of the tree while applica3 It is a well-known fact [16] that arithmetic average of ratios can lead to contradictory conclusions when changing the reference point. Therefore, we use a geometric average of ratios which is known to be closer to the general idea of average ratio.

Figure 2. Deviation of experimental fair throughput from expected theoretical throughput

5.2. Case study
5.2.1. Theoretical versus observed throughput For the heuristics LP , DATA-CENTRIC and CGBC , we can easily compute an expected theoretical fair throughput. This allowed us to explore how implementation issues result in the experimental fair throughput di?ering from the expected theoretical fair throughput. There are many reasons that the decentralized scheduling might have a smaller fair throughput than the corresponding theoretical one, such as the overhead of the request mechanism or a startup periods longer than the 10% we allowed for. It turned out that the major cause of ine?ciency was the limit on the bu?er size. In general, our experiments assumed enough bu?er space to hold 10 tasks of any type. In this case, Figure 2 depicts the experimental fair throughput deviation from the expected theoretical throughput for

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 0 2 0.05 0.04 0.03 0.02 0.01 0 0 0.5 Frequency


0.7 0.6 0.5 0.4 0.3 0.2 0.05 0.04 0.03 0.02 0.01 0 0 -2 0 2 0.5 1 4 1.5 Frequency


1 4


2 6


3 8

0.1 0

2 6


3 8

Log(deviation from LP heuristic)

Log(deviation from LP heuristic)

(a) Performances of DATA-CENTRIC
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 0 2 4 6 8 Log(deviation from LP heuristic) 0.05 0.04 0.03 0.02 0.01 0 0 0.5 1 1.5 2 2.5 3 Frequency Frequency PBC 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2

(b) Performances of FCFS

0.05 0.04 0.03 0.02 0.01 0 0 0 2 0.5 1 4 1.5 2 6 2.5 3 8

Log(deviation from LP heuristic)

(c) Performances of PBC

(d) Performances of CGBC

Figure 3. Logarithm of the deviation from LP performances. tions with a high communication-to-computation are mainly performed on the leftmost part, which is de?nitely not optimal on particular instances. Unsurprisingly, PBC leads to very bad results. In many situations (more than 35%), an application has been particularly unfavored and the fair experimental throughput was close to 0. The logarithm of the deviation for these situations has been normalized to 8. These poor results advocate the need for fairness guarantees in distributed computing environments like the ones we consider. Lastly, the geometrical average of FCFS is 1.564 and in the worst case, its performance is more than 8 times worse than LP . On the average, FCFS is therefore much worse than LP . On small platforms, the performances for FCFS and CGBC have the same order of magnitude. However, on larger ones (size 50 and 100), CGBC performs much better (geometrical average equal to 1.243) than FCFS (geometrical average equal to 2.0399). platform made of heterogeneous processing and communication resources. Our contributions to this problem are the following: ? We ?rst presented a centralized algorithm which, given the performances of all resources, computes an optimal schedule with respect to throughput maximization. We also have characterized a simple way of computing the optimal solution on single-level trees. ? However, on general platforms the centralized algorithm requires gathering information about the platform at a single location, which may be unrealistic for large-scale distributed systems, particularly when these parameters (bandwidths, processing power) may be constantly changing. Furthermore, the optimal schedule may require to have an arbitrary large number of bu?ers and may induce very large latencies. We have therefore concentrated on distributed algorithms and designed several decentralized heuristics using only a limited number of bu?ers. ? We have evaluated the e?cacy of these heuristics using a wide range of realistic simulation scenarios. The results obtained by the most sophis-

6. Conclusion
In this paper, we present several heuristics for scheduling multiple applications on a tree-connected

ticated heuristics are quite reasonable compared to the optimal centralized algorithm. Thus far, the best solutions rely on an explicit-rate calculation (using either a global centralized linearbased approach or a fully-distributed approach like in DATA-CENTRIC ). It is a well-known fact in the network community [18] that max-min fairness is generally achieved by explicit-rate calculation (e.g. in ATM networks) and rather hard to achieve in a fullydecentralized fashion. Yet, fully distributed algorithms are known to realize other kind of fairness (e.g. proportional fairness for some variants of TCP). Adapting such algorithms to our framework is however really challenging as both communications and computations are involved. A promising approach would be to adapt the decentralized multi-commodity ?ow of Awerbuch and Leighton [1, 2] to our framework. Last, as we have seen with the PBC heuristic, noncooperative approaches where each application optimizes its own throughput lead to a particularly unfair Nash equilibrium [19, 13]. An other approach could be a cooperative approach where several decision makers (each of them being responsible for a given application) cooperate in making the decisions such that each of them will operate at its optimum. This situation can be modeled as a cooperative game like in [15, 14]. However in our situation, hierarchical resource sharing is rather hard to model, which renders such an approach quite challenging.


[8] [9]


[11] [12]


[14] [1] B. Awerbuch and T. Leighton. A simple local-control approximation algorithm for multicommodity ?ow. In FOCS ’93: Proceedings of the 24th Conference on Foundations of Computer Science, pages 459–468. IEEE Computer Society Press, 1993. [2] B. Awerbuch and T. Leighton. Improved approximation algorithms for the multi-commodity ?ow problem and local competitive routing in dynamic networks. In STOC ’94: Proceedings of the 26h ACM symposium on Theory of Computing, pages 487–496. ACM Press, 1994. [3] C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and Y. Robert. Scheduling strategies for master-slave tasking on heterogeneous processor platforms. IEEE Trans. Parallel Distributed Systems, 15(4):319–330, 2004. [4] O. Beaumont, V. Boudet, A. Petitet, F. Rastello, and Y. Robert. A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers). IEEE Trans. Computers, 50(10):1052–1070, 2001. [5] O. Beaumont, L. Carter, J. Ferrante, A. Legrand, L. Marchal, and Y. Robert. Scheduling multiple bags of tasks on heterogeneous master-worker platforms: centralized versus distributed solutions. Research Report 2005-45, LIP, ENS Lyon, France, Sept.






2005. Available at graal.ens-lyon.fr/~yrobert/ rr2005-45.ps. O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and Y. Robert. Bandwidth-centric allocation of independent tasks on heterogeneous platforms. In International Parallel and Distributed Processing Symposium (IPDPS’2002). IEEE Computer Society Press, 2002. O. Beaumont, A. Legrand, L. Marchal, and Y. Robert. Independent and divisible tasks scheduling on heterogeneous star-schaped platforms with limited memory. In PDP’2005, 13th Euromicro Workshop on Parallel, Distributed and Network-based Processing, pages 179– 186. IEEE Computer Society Press, 2005. D. Bertsekas and R. Gallager. Data Networks. Prentice Hall, 1987. P. Bhat, C. Raghavendra, and V. Prasanna. E?cient collective communication in distributed heterogeneous systems. In ICDCS’99 19th International Conference on Distributed Computing Systems, pages 15–24. IEEE Computer Society Press, 1999. P. Bhat, C. Raghavendra, and V. Prasanna. E?cient collective communication in distributed heterogeneous systems. Journal of Parallel and Distributed Computing, 63:251–263, 2003. Berkeley Open Infrastructure for Network Computing. http://boinc.berkeley.edu. L. Carter, H. Casanova, J. Ferrante, and B. Kreaseck. Autonomous protocols for bandwidth-centric scheduling of independent-task applications. In International Parallel and Distributed Processing Symposium IPDPS’2003. IEEE Computer Society Press, 2003. F. Forg? o, Jen¨ o, and F. Szdarovsky. Introduction to the Theory of Games: Concepts, Methods, Applications. Kluwer Academic Publishers, 2 edition, 1999. D. Grosu and T. E. Carroll. A strategyproof mechanism for scheduling divisible loads in distributed systems. In I. C. S. Press, editor, Proc. of the 4th International Symposium on Parallel and Distributed Computing (ISPDC 2005), 2005. D. Grosu, A. T. Chronopoulos, and M. Y. Leung. Load balancing in distributed systems: An approach using cooperative games. In I. C. S. Press, editor, Proceedings of the 16th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2002), pages 501–510, 2002. R. Jay. The Art of Computer Systems Performance Analysis : Techniques for Experimental Design, Measurement, Simulation, and Modeling. John Wiley and Sons, Inc., 1991. A. Legrand, L. Marchal, and H. Casanova. Scheduling Distributed Applications: The SimGrid Simulation Framework. In Proceedings of the Third IEEE International Symposium on Cluster Computing and the Grid (CCGrid’03), May 2003. L. Massouli? e and J. Roberts. Bandwidth sharing: Objectives and algorithms. Transactions on Networking, 10(3):320–328, june 2002. J. F. Nash. Equilibrium points in n-person games. Proceedings of the National Academy of Sciences USA, 36:48–49, 1950.



All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。