IEEE TRANSACTIONS ON COMPUTERS, VOL. c32, NO. 1, JANUARY 1983
83
CostPerformance Bounds for Multimicrocomputer Networks
DANIEL A. REED, STUDENT MEMBER,
AbstractSeveral interconnection structures for a distributed multimicrocomputer messagepassing system are compared on the basis of cost and performance. Among the structures analyzed are buses, double rings, Ddimensional toroids, trees, cubeconnected cycles, and chordal rings. Network cost is defined in terms of the number of network nodes and the unit cost of communication links and their associated connections. Simple asymptotic performance bounds are derived based on the bottleneck analysis of a queueing network. In contrast to the usual assumption of uniform message routing, the technique permits the introduction of a reference locality notion to the message routing behavior of network nodes. Finally, the cost, performance, and performance/cost functions are examined as the number of network nodes becomes very large.
Index TermsInterconnection networks, large multicomputer systems, message routing, message traffic, network computers, performance analysis.
IEEE, AND
HERBERT D. SCHWETMAN
such structures. This, coupled with the large number of design parameters for parallel systems, has made comparison difficult.
INTRODUCTION I N recent years, many researchers have sought ways to exploit the rapid development of LSI/VLSI technology in the construction of powerful computer systems. Proposals for multiple processor systems containing up to 106 VLSI chips have been made [12], [13]. At first appearance, networks of thousands of processors may not seem justifiable. There are, however, at least two primary motivations for developing such systems. The most obvious is the need to overcome the fundamental physical limits on computation speed imposed by sequential processing. The need for performance increases of factors of 100 or even 1000 is painfully obvious to workers in such fields as speech analysis, weather modeling, and nuclear fusion research. Only by injecting parallelism into the solution of such problems can one realistically expect to obtain truly large performance increases. Second, it has been suggested that large multiple processor systems will provide appropriate architectural support for new language proposals. In particular, the functional programming languages proposed by Backus [2] and the communicating sequential processes of Hoare [6] seem ideally suited to multiple processor systems whose computational tasks communicate via message passing. Many ways to interconnect multiple processors have been proposed, but no real consensus on a best proposal has yet emerged. Not only is there a paucity of knowledge concerning the effect of various interconnection structures on performance, there is also no widely accepted method for modeling
Manuscript received December 18, 1981; revised July 23, 1982. The authors are with the Department of Computer Sciences, Purdue University, West Lafayette, IN 47907.
OVERVIEW The context of this discussion is Wittie's network computer [14], an MIMD (multiple instruction multiple data stream) system whose active computing nodes communicate by passing messages to one another over passive communication links. Nodes do not share any memory; all communication is performed by message passing. Each network node is assumed to consist of a processing element with some local memory, a communication processor capable of routing messages without delaying the processing element, and some (small) number of connections to communication links connecting the node to other nodes (see Fig. 1). On such a network computer, a parallel computation may require multiple processing elements that exchange messages while executing cooperating tasks. There is no global synchronization among processing elements. Instead, computation at each processing element proceeds independently of all others, except when the processing element passes a message to or receives a message from its communication processor. The interconnection networks over which messages are passed can be broadly classified as reconfigurable multistage switching networks and passivelink interconnections. There is a considerable body of literature comparing reconfigurable multistage switching networks such as banyans [5] and shuffleexchange [8]. In our analysis, passivelink structures whose nodes are embedded in the interconnection network are studied (see Fig. 2). The single bus, double ring, Ddimensional toroid, bus hypercube, cubeconnected cycles, chordal ring, and tree, among others, are compared on the basis of cost and performance. The cost of each structure is defined as a function of the number of network nodes and the unit cost of communication links and their associated connections. Cost is significant only because it permits comparison of performance/cost ratios for various interconnection networks. Many definitions of network performance have been proposed (e.g., average message delay, message density,_ and bus load). These notions are usually based on the assumption that the message routing distribution is uniform (i.e., the probability that node i sends messages to node j is the same for all i and j, i x I) and that nodes generate messages at some fixed rate. We present an alternative definition of network perfor
00189340/83/01000083$01.00 C 1983 IEEE
84
IEEE TRANSACTIONS ON COMPUTERS, VOL. C32, NO.
1,
JANUARY 1983
Processing Element
Memory
TO Node 4
ASYMPTOTIC PERFORMANCE FUNCTION Our performance analysis is based on asymptotic or botTo Node tleneck analysis of a single class, closed queueing network with Fig. 1. Network node. Processing element accesses local memory and operates independently of communication processor. load independent servers. Although its essentials are briefly mance based on the asymptotic or bottleneck behavior of a reviewed here, the reader should consult Denning and Buzen single class queueing network that relaxes this assumption. In [3] for complete details. The only requisite assumption for application of bottleneck mapping a distributed computation onto an interconnection analysis to a closed queueing network is job flow balance. That structure, one would hope that those tasks communicating with high frequency are placed physically close to one another in is, the rate of arrivals must equal the rate of departures at every the interconnection network. This clearly results in a message system device (i.e., the system must be in stochastic equilibrouting distribution that is significantly different from the rium). Each time a node sends a message to another node, the usual assumption of uniform routing. To reflect this nonunimessage must cross some number of communication links and formity, a notion of reference locality is introduced to the message routing distribution. Furthermore, spontaneous pass through the communication processors of some intergeneration of messages at nodes is not permitted. Rather, the mediate nodes before reaching its destination processing elerate of message departures at each node must equal the rate ment. At the destination, it causes some computation to take place. Each of these link crossings and destination processing of message arrivals from elsewhere in the network. element computations constitutes a visit to that link or proSince Wittie [ 14] recently analyzed a subset of the structures considered here under the uniform routing assumption cessing element. If all possible sourcedestination pairs and and provided order of magnitude values for the density of the probability that they exchange messages are considered, messages on links and the average number of links traversed the number of visits to each communication link and processing by a message, our results can be viewed as both a refinement element made by an average message can be calculated. Now consider such an average message and an arbitrary device i and an extension of his. To simplify the presentation, we first discuss the methods (either a node or a link). This average message will visit device used to derive cost and performance functions, and then apply i a certain number of times. This mean number of visits is these methods to several proposed networks. The notation called the visit ratio of device i and is denoted by Vi. Similarly, employed throughout the remainder of the paper is summa let Si denote the mean time required for device i to service a message (exclusive of queueing delays), Xi denote the mean rized in Table I. rate of message completions at device iC(Xi < 1/Si), and U, COST FUNCTION denote the utilization of device i (probability of being busy). As noted earlier, each node of the system is assumed to The following laws are then known to hold [3]: U, = XiSi utilization law consist of a processing element (PE), communication processor (CP), and some fixed number of link connections (LC) joining XO = v forced flow law the node to bidirectional communication links (CL). We define vi the following simple cost function: where X0 is the message completion rate of the entire system.
I~~~~
I
Conmnun icat ion P ssor
To Node
type of interconnection structure number of nodes in the structure unit cost of a PECP pair CPE unit cost of a link connection CLC unit cost of a communication link. CCL A word of caution is in order about the unit cost of communication links. Links can be of two types, dedicated links between two nodes or buses shared by two or more nodes. In the first case, CCL is simply the cost of each link. In the second case, CCL is assumed to be the cost of the bus divided by the number of connections to it. Cost function parameters for the interconnections discussed in the remainder of the paper can be found in Table II. These are based on the structural properties of each interconnection structure.
Nettype Netsize
Cost(Nettype, Netsize, CPE, CCL, CLC)
=
CPE * Netsize + CLC * Netsize * (number of connections per node) + CCL * (number of links)
(1)
Simple algebra yields
= Ui
vlsi
where the following definitions apply:
As the number of messages in the system becomes large, the utilization of the device with the largest ViS, product must
REED AND SCHWETMAN: COSTPERFORMANCE BOUNDS FOR NETWORKS
85
(a) Single global bus (K = 5).
(b) Complete connection (K = 5).
(c) Double ring (K = 5).
(d) Spanning bus hypercube (D = 3, w = 2).,
(e) Toroid (D = 3, w = 2).
(f) Cubeconnected cycle (D = 3).
(g) Chordal ring (K = 6, c  3).
(h) Snowflake (b = 3, n = 2).
(i) De]nse snowflake (b = 3, n = 2).
(j)Star (b =3,n =2). (k) Tree (b 2, n 3). Fig. 2. Interconnection networks for multimicrocomputer networks. Each node contains a processing element with local memory and a communication processor. Links are communication lines between nodes.
approach one (1). Hence, the maximum value of the system service requirement of a message at'device i. Summing the message completion rate is ViS, over all i gives the total service requirement of an average
< 1<
VbSb
where
VbSb
= max
i
V,S,.
In their general definition, the visit ratios are only unique up to a normalizing constant. To ensure their uniqueness in our analysis, we normalize the Vi for the nodes such that their sum is one (1). The ViS1 product can then be interpreted as the total
message in the system. To simplify analysis, we assume that all processing elements have the same mean service time SPE and all links have the same mean service time SCL. We also assume that each node has the same message routing distribution. By this, we mean that each node i has the same probability of sending a message to a node reachable by traversing I links for all i. Messages follow the path requiring the smallest number of link traversals to reach their destination. If there are multiple shortest paths, we assume that they are visited with equal probability unless otherwise specified. Message delays due to internal routing at
86
IEEE TRANSACTIONS ON COMPUTERS, VOL. c32, NO. 1, JANUARY 1983
TABLE I NOTATION b
D K L
n
w
c
/max
PE
SPE SCL VPE VCL xo
LC CL
L rsynmOmetric
LocSize (L, Nettype)
Branching factor for asymmetric structures Chord length Dimension of mesh or hypercube Number of network nodes Maximum distance to a node in the locality Maximum sourcedestination distance Number of levels in an asymmetric structure Lattice width of mesh or hypercube Probability of visiting locality Processing element Communication link connection Communication link Mean processing element service time Mean communication link service time Processing element visit ratio Communication link visit ratio System message completion rate Size of locality
Average number of links traversed in a symmetric structure with uniform routing Average number of links traversed in a symmetric structure with locality Average number of links traversed in an asymmetric structure with uniform routing Number of communication links in network of sizeK Number of nodes reachable by traversing I links
LsVymametric
L vuniform Vasymmetric
NumLinks
(K, Nettype) Reach (1, Nettype)
TABLE II SYSTEM SIZE
System Single Global Bus
Complete Connection
Nodes
K
Connections K
K(K  1)
4K
Links 1
K
K
(2
2K
)
Double Ring Spanning Bus Hypercube DDimensional Toroid CubeConnected Cycles Chordal Ring Snowflake
Dense Snowflake
FinkelSolomon Star
wD wD D2D
K
DwD 2DwD 3D2D
3K
DwD I DwD 3D2D1
2
bn
bn
2bn
2bn
bi 2bn  1
bn  I
b2 b2 bn I1 (b+ l)(bn1) Treebb b1 b1 Cost(Nettype, Netsize, CPE, CLC, CCL) = CPE * nodes + PLC * connections + CLC * links where the following definitions apply: Nettype type of interconnection structure unit cost of a node CPE unit cost of a link connection CLC unit cost of a communication link. CCL
b((b I )n  )
2b((b I )n )
b(b  1 )nb2
2
bnb
b1
the communication processors of intermediate nodes are ig maximum system message completion rate Xo for various innored. We model only the queueing delays and service times terconnection networks. This performance function Xo differs at the communication links and the destination processing in several significant ways from earlier performance metrics element. for distributed systems. The remainder of our analysis is devoted to derivation of the When designing a system, one would like to determine the
REED AND SCHWETMAN: COSTPERFORMANCE BOUNDS FOR NETWORKS
87
maximum system message completion rate attainable given a set of device speeds and capacities. Hence, rather than fixing the message completion rate at the nodes and then determining the minimum message density that must be supported by the links to attain this rate, it seems more natural to determine the message completion rate given the visit ratios and the mean service times for the processing elements and communication links. The; technique outlined below permits precisely this approach. As we shall see, one can also systematically determine the effect of varying the number of network nodes and device mean service times. UNIFORM MESSAGE ROUTINGSYMMETRIC STRUCTURES Messages sent by each node of a symmetric interconnection structure can reach the same number of nodes by traversing 1 communication links for all 1. A bidirectional ring system is a simple example of a symmetric interconnection since each message can always reach two nodes by crossing I links. Under uniform message routing, the probability of node i sending a message to node j is the same for all i and j, i # j. We assume that nodes do not send messages to themselves; hence, i X j. This assumption is not required, and can easily be eliminated if desired. Consider such a symmetric structure with K nodes obeying the uniform routing assumption. Since each processing element is visited with equal probability by an average message, the visit ratio for the processing elements is just
/max
Vsuynmmric
1=1
E, I Reach(/, Nettype)
KI
(3)
where 'max is the maximum number of links that must be crossed to reach any node. Now define Numlinks(K, Nettype) as the number of communication links in a network of size K and type Nettype. The link visit ratio is then simply
LVuniform L symmetric Numlinks(K, Nettype) We immediately have
VCL 
(4)
max
I VPESPE, VCLSCL1 =
VPESPE
C VCLSL
(5)
LOCAL MESSAGE ROUTINGSYMMETRIC STRUCTURES Now suppose that the assumption of a uniform message routing distribution is relaxed. Each node of the structure is allowed to have a symmetric locality surrounding it that is visited with some high probability sp, while the nodes outside the locality are visited with probability 1 p. Let LocSize(L, Nettype) be defined as
I VPE= 
(2)
LocSize(L, Nettype) = E Reach(l, Nettype).
1=1
L
(6)
The nature of the interconnection determines whether all communication links are visited with equal probability. If there are different types of communication links in the network, each type will most likely be visited with different probability. For example, the chord links of the chordal ring in Fig. 2 will be visited with different probability than the ring links. The visit ratios of each link type must be determined using the technique outlined below. Suppose one examines an arbitrary network node and the K  I possible destinations for messages sent from that node. Define Reach(l, Nettype) as the number of nodes reachable from an arbitrary node by using shortest paths composed of exactly 1 links in a network of type Nettype. The average vuiorm number of links traversed by a message is LIymme tric (uniform routing, symmetric structure) and is given by
o
Then the LocSize(L, Nettype) nodes reachable in L or fewer links from a node constitute its locality and are visited with probability s°, while the KLocSize(L, Nettype) I other nodes are visited with probability I s Since the interconnection network is symmetric, each node is contained in the localities of LocSize(L, Nettype) other nodes and is outside the localities of KLocSize(L, Nettype) 1 nodes. Thus, each node is still visited with equal probability, and the processing element visit ratio is just
VPE
K
(7)
To obtain link visit ratios, consider again an arbitrary source node and all K  1 possible message destinations. The mean number of communication links traversed by a message
LsVmetric iS
L
E I Reach(l, Nettype)
1
L
symmetric
+
L
(I s~o) I=L+ I I/Reach (/, Nettype) E
K  , Reach(l, Nettype) 1=1
/max
E Reach (l, Nettype)
Il=1
L lReach(l,Nettype)
LocSize(L, Nettype)
(I
+

'p)
LVsuynimfmretric (K1) 1=1 I Reach(I, Nettype),
K  LocSize(L, Nettype) 
(8)
88
IEEE TRANSACTIONS ON COMPUTERS, VOL. C32, NO. 1, JANUARY
1983
The first term is simply the product of the average number of and performance analyses are summarized in Tables IIIV and links traversed while visiting a node in the locality and the will be referred to frequently in the remaining discussion. probability of visiting the locality p. The second term has a SYMMETRIC STRUCTURES similar interpretation for nodes outside the locality. The link visit ratio is then Spanning Bus Hypercubes (SBH) rVlocal LVO symmetric The spanning bus hypercube [14] is a Ddimensional c Numlinks(K, Nettype) structure connecting each node to D buses in D orthogonal dimensions; w nodes share a bus in each dimension. This and the system message completion rate is bounded by structure is identical to a Ddimensional wwide toroid, except that the w connections in each dimension are replaced with a m VPESPE VCLSCL (10) single bus. This technique can easily be generalized to include locality Wittie [14] gives a simple distributed routing algorithm for definitions that are continuous functions. One might, for ex spanning bus hypercubes. Consider the routing of a message ample, make the probability of sending a message to a node from an arbitrary node A to some other node B. The node I links away inversely proportional to 1. To do this, one need addresses of A and B can be expressed as D, base w, coordionly include a function sp(l) in the summation above. In most nates in a wD lattice. Compare the ith coordinates of A and cases, this will make closed form solutions of the summation B. If they differ, route the message along the ith dimension bus impossible. For this reason, it is not considered further here. to the node whose ith coordinate is equal to that of B. Repeat this process until all D coordinate positions agree. Since each UNIFORM MESSAGE ROUTINGASYMMETRIC move brings the message closer to its destination in one diSTRUCTURES mension, the order in which the D coordinates are checked does In an asymmetric interconnection structure, the number of not matter. nodes reachable in L links from a given node depends on the Since each of the K = wD nodes has D connections, there location of the source node in the network. Primary examples are DWD total connections. Each bus is shared by w nodes, so are bary trees and snowflakes [4]. there are DwD1 buses. Recalling that the cost of a bus is Under uniform message routing, each node is visited with proportional to the number of connections to it, the cost equal probability, so the processing element visit ratio is function is again Cost (SBH, CPE, CLC, CCL) = WD(CPE + D(CLC + CCL)). 1 (1 1) VPE K (15) To derive link visit ratios for uniform message routing, To derive the link visit ratios, consider some interval during consider again the base w representation of an arbitrary which each node sends K  1 messages (each node receives K  1 messages) and the total number of messages sent is sourcedestination pair. Any two of the D coordinate positions K(K  1). For each communication link j, calculate the differ with probability (w  1)/w. Since each of these D number of messages that cross that link; call this number coordinate positions is independent, the average number of buses traversed by a message is Msg(j, Nettype). The visit ratio for link j is
VCLi = Msg(j, Nettype) K(K  1)
The maximum link visit ratio is
(12)
LVSBH
uniform
D(w 1)

WD
w
D
DwDI(W 1)


I
wD

I
(16)
cLax = max VCLj
and the system message completion rate is bounded by
X0 < minll
(13)
The correction factor WD/(WD  1) accounts for the fact that the source and destination must differ. Using (2)(4), the ViSi products are then
V S VPESPE
lVPESPE r"CL SCLJ
1
1
=SPE wD
=SCLDWD1(W 1))
DwD1(WD 1)
ni.__
(14)
and
VCLSCL
INTERCONNECTION STRUCTURES The techniques described above have been applied to 11 oftencited interconnection structures: seven symmetric ones and four asymmetric ones. An example of each structure is shown in Fig. 2. Space, unfortunately, does not permit detailed derivations of the results for each interconnection; for a complete exposition, see [11]. To provide some insight into the technique's application, the spanning bus hypercube, a symmetric structure, and the snowflake, an asymmetric structure, are analyzed in detail. For the other structures, only a simple description of salient points is provided. The results of the cost
SCL(WlI) WD 1
(17)
WD
0
_____*18
Xo minSPE SCL(W < (18) of fanout limitations, D must be fixed at a small Because constant and the system size increased by increasing w. If D is fixed and w increases, the buses become the performance bottlenecks, and performance increases at approximately the rate wD 1/SCL. To see the effect of locality on performance, consider the number of ways source and destination addresses can differ
REED AND SCHWETMAN: COSTPERFORMANCE BOUNDS FOR NETWORKS
89
in exactly I positions. Since there are w  1 ways each position been made [7], [9]. Typically, messages can pass in only dican differ and each position is independent, this number is rection around the ring. Performance improves if each node is connected to two counterrotating rings. A node sending a (w  1I). There are ('f)Jways to select / positions, so there message places it on the ring requiring the smallest number are of link traversals to reach its destination. After traversing a link, a message queues for service on the next link in the diReach (l, SBH) = (I) (w  1)' (1 9) rection of its travel until its destination is reached. Hence, no nodes reachable using exactly I buses. The size of the reference message ever needs to traverse more than LK/2J links in a K node system. locality is Since messages can travel varying distances along the cirL cumferences of a ring, it is possible to define a node's reference LocSize(L, SBH) = j Reach (l, SBH). (20) ll locality. In this case, a node's locality is just all nodes lying on (Recall that L is the maximum distance to any node in the an arc of length 2L centered at the node (i.e., the nearest 2L reference locality.) Using (8), the mean number of link visits nodes). by a message is DDimensional Toroid O E 1(§) (w  1)' The Ddimensional toroid (Ddimensional wwide lattice) 1 LViocal_ 1= connects each of its wD nodes to a ring of size w in each of the LocSize(L, SBH) D orthogonal dimensions. Because of this, no message need traverse more than Lw/2] links in any dimension. (1 ) [DWD I(W 1) '(0 (W  l)J Message routing in the Ddimensional toroid is very similar to that in spanning bus hypercubes. Instead of a single bus visit WD LocSize(L, SBH)1 ( where sp is the probability of visiting a node in the locality. The in each dimension in which source and destination addresses differ, several moves along the ring in each dimension are reV1S, products are quired. As with the spanning bus hypercube, the order in which Vrlzocal D and VCLSCL = SCLLr SBI (22) the coordinate differences are resolved does not matter. SPEVVE = DWDI Deriving a formula for the size of a node's reference locality and the bound on the system message processing rate is requires a closer examination of the nature of the interconnection. For the special case w = 2, Sullivan's CHoPP machine DL XO < min I w 5PESCLLvVocJ . (23) [12], the analysis is similar to that of spanning bus hypercubes. To reduce the analysis' complexity, consider the case w odd As w increases, the bound for the system message completion (w > 2). Then, without loss of generality, any node can be rate Xo increases at the rate wDI/SCL(  (p). If one com assumed to be at the center of the toroid. That is, the node is pares this to the uniform routing case, it becomes clear that at the center of a D  1 dimensional hyperplane and Lw/2J this definition of locality does not change the order of the hyperplanes of dimension D  1 are above it and below it. A performance bound, only the constant of proportinality. Of message going up or down 1 links can then traverse at most course, if sp = 1, the message routing distribution is completely L  I links in the D  1 dimensional hyperplane it has reached. local, and the first term of (21) is the value of LVs°CBaH in This leads to a fairly simple recurrence relation for the size of (23). the reference locality. The results of its solution for the cases D = 2, 3 are shown in Table IV. Single Global Bus The simplest possible interconnection drops all K nodes of CubeConnected Cycles (CCC) a system from a single global bus. One communication link The cubeconnected cycles (CCC) interconnection was traversal is required to route any message from source to recently proposed by Preparata and Vuillemin [10] as an efdestination. Because of this, no notion of a message routing distribution is relevant. Unfortutately, the single bus rapidly ficient topology for several types of parallel algorithms. A CCC with Ddimensions contains D2D nodes arranged as cycles of becomes the system bottleneck as the number of nodes in D nodes around each of the 2D vertices of a binary (w = 2) creases and bounds system performance by the reciprocal of hypercube of D dimensions (see Fig. 2). The ith node of a cycle its mean service time. is connected to the ith dimension link incident upon the vertex. Complete Connection Each node is connected to exactly three other nodes no matter The most expensive and best performing interconnection what the dimensionality of the system. Hence, fixed fanout provides direct bidirectional links between all pairs of the K nodes can be used to expand the system. system nodes. The prohibitive O(K2) interconnection cost Our analysis is based on thesimple, nonoptimal, distributed makes this approach unsuitable for large systems, but it pro message routing algorithm given by Wittie [14]. The address vides a useful point of reference. Since one link traversal suf of any node can be expressed as a cycle position followed by fices to reach any destination, no notion of message routing the binary coordinates of the cycle in Dspace: diistribution is relevant here either. Double Ring CdD,I *do 0SCSD1 0 Several proposals for cyclic or ring interconnections have To route a message toward its destination, traverse cycle links
90
IEEE TRANSACTIONS ON COMPUTERS, VOL.
c32, NO. 1, JANUARY 1983
in the clockwise direction until a di in the destination address is found that differs from the current address. Traverse that cross link to another vertex. Repeat this process until the correct position in Dspace has been reached. Then find the shortest distance, clockwise or counterclockwise, to the correct cycle position of the destination. This routing algorithm is far from optimal, and it would seem that performance could be increased significantly by improving it. The average number of cross link traversals cannot be reduced except by altering the message routing distribution, so any improvement must come from reducing the number of cycle link traversals. It can be shown that, asymptotically, the cycle link visit ratios are only 1.25 those of the cross links, but for all dimensions of practical interest (say, D < 15), the performance increase obtainable from a better routing algorithm could be significant. Since cross link traversals move one to a node with the same cycle position at another vertex, finding the shortest path from any source to any destination in the cubeconnected cycles interconnection is equivalent to solving the following optimization problem. 1) Consider a ring of K nodes. 2) Distinguish a start node and end node (possibly the same) and k intermediate nodes (O < k < K  1). 3) Find the shortest path from the start node to the end node that passes through all the intermediate nodes While it is also possible to derive formulas for the cubeconnected cycles under local message routing, the formulas are quite unwieldy. Details of this derivation can be found in
[11].
of nodes approaches infinity, the upper bound approaches a constant independent of the number of nodes. From another perspective, the message completion rate of the individual nodes decreases linearly as the number of nodes increases. This would seem to indicate the fundamental unsuitability of asymmetric interconnections for very large parallel asynchronous computations unless communication is constrained to have very high locality.
Snowflake Finkel and Solomon [4] describe a class of asymmetric structures they call snowflakes (see Fig. 2). A snowflake of n levels is recursively constructed as follows. 1) A level one snowflake is composed of b nodes connected to a bus. Each of these nodes is called a corner of the snowflake. 2) A level two snowflake connects one corner of b level one snowflakes to a new bus. Another corner of each level one snowflake is designated a corner of the level two snowflake. 3) In general, a level n snowflake connects the corners of b level n  1 snowflakes to a new bus. There are bn nodes, (bn  1 )/(b  1) buses, and 2bn connections if one assumes that all nodes are standard modules with a fixed number of connections. Since there is a unique path from every source to every destination, the message routing algorithm is straightforward and is detailed in [4]. To derive the link visit ratios for uniform message routing, consider the bus at level j:
Chordal Rings
Arden and Lee [ 1 ] proposed a variation of the simple bidirectional ring called a chordal ring. Each node of a ring is augmented with an additional connection to a link joining two ring nodes via a chord. To be precise, number the nodes O * K  1 where K is even, and select an odd chord length c (1 < c < K/2). Then each oddnumbered node i is connected to node (i + c) mod K and each evennumbered node j is connected to node (c) mod K in addition to the normal ring connections. The distributed routing algorithm presented by Arden and Lee finds a minimum length path from any source to any destination using both cycle links and chord links. It does not employ all shortest paths with equal probability, but tries to evenly distribute link traversals between the two types of links. An analysis of this routing algorithm is given in the Appendix. Unlike the simple ring, which has a constant bound on the system throughput for anything other than complete locality, the performance bound for the chordal ring can be increased by increasing the chord length as the number of nodes increases. ASYMMETRIC STRUCTURES All of the asymmetric structures discussed below have constant performance bounds. If one fixes all parameters of the system except the number of nodes, and examines the upper bound on the system message completion rate as the number
bi' nodes bn (b  )bi' nodes. b  1 of the connections are to level]  1, but one connects to the bth level j 1 snowflake and the rest of the structure. Now consider some interval during which each node sends a message to each of the other bn  1 nodes, and look at those messages that will cross the level j bus. The source and destination can be in one of two places. 1) Two different level j 1 snowflakes. There are 2b2(j ...
bi nodes
1) such messages. Since there are ( 2 ) ways to choose a pair
of level j 1 snowflakes,
2
b21 j 1)
21
messages cross the level j bus due to messages between level j  1 snowflakes. 2) Level j 1 snowflake and bn (b  1 )biI group. By an argument similar to the one above, there are
2bj'(b  1)(bn (b  I)bj1)
messages contributed by these combinations. Using (12), the VS for the level j bus is
VCLJSCLJ
=
ScLbj1(b  1)(2bn  bi)  1)
bn(bn
(24)
This clearly attains its maximum whenj = n. By (13) and (14), the system message completion rate is bounded by
REED AND SCHWETMAN: COSTPERFORMANCE BOUNDS FOR NETWORKS
91
Xo < min
S
cLbn'(b }
(25) of nodes, or performance), one can determine optimal values
specifying a subset of the system parameters (e.g., cost, number
As the number of levels becomes large, the system throughput rate approaches b/SCL(b  1). By way of comparison, the performance asymptote for a single bus system is 1/SCL. Notice that b = 2 maximizes the performance bound. In other words, a snowflake with many levels and a small branching factor b is preferable to one with a smaller'number of levels and a larger branching factor.
Dense Snowflake The dense snowflake attempts to alleviate the communication bottleneck of the snowflake by replacing the single bus at each level with b  1 buses. As with the snowflake, a simple distributed routing algorithm is presented by Finkel and Solomon [4]. As shown in Table III, the additional message paths result in a significant performance improvement over the snowflake. Interestingly, the performance of a dense snowflake is maximized by having a larger branching factor and a smaller number of levels, the opposite of the snowflake.
of the remaining parameters. The following are but a few of the many possibilities. 1) Given a desired performance level, determine the minimum number of nodes and type of interconnection necessary to attain it. 2) Given a system cost, determine the maximum performance attainable using any of the systems we have discussed. 3) Given two different systems with the same number of nodes, determine the ratio of SPE to SCL needed to equalize performance. As an extended example of the power of this technique, consider the spanning bus hypercube discussed earlier. Under uniform routing, we have
VPESPE = SPE WD
VCSCL VCLSCL
Recall that
1 1 Xo < min J VPESPE VCLSCLJ
SPE
WD (W
=
SCL(W 1)
WD 1
(26)
At this critical ratio, the communication links and the processing elements are equally the performance bottlenecks. If the ratio falls below this value, the communication links determine the upper bound on the system performance. Now suppose that the number of nodes is increased by increasing w, the width of the spanning bus hypercube. For the bound on the system message completion rate to increase linearly with the number of nodes, the ratio of processing element to communication link service times must increase linearly with w. In other words, as the number of nodes in the system becomes larger and larger, nodes must exchange messages less frequently if performance is to increase linearly with the number of nodes. Trees Under locality, we have The best known asymmetric interconnection is undoubtedly VS _SPE the nlevel bary tree. Message routing is simple since there VPESPE = WD is a unique path from any source to any destination. UnforSc LVIocal tunately, the b communication links below the root rapidly VCLSCL = SHCLLVSH (29) DWD1 become the performance bottlenecks. Like the dense snowflake, trees with a larger branching and SPE _ WLSVB)H (30) factor and a smaller number of levels give better performance than trees with a small branching factor and more levels. Of D SCL course, this is not asymptotically true since this would result where the mean number of link visits by a message L sBaH was in a star structure. One must be cognizant of the assumptions defined by (21). inherent in the analysis, n'amely, that switching delays internal If the size of the locality and the probability of visiting it to nodes are ignored. remain constant, then the nodes must exchange messages with less frequency as the number of nodes becomes larger if the APPLICATIONS performance bound is to increase linearly with the number of There is no single "best" system; depending on the intended nodes. Conversely, if the node and link service times remain application, one system may be preferred over another. By constant, the probability of a message visiting a node in the
FinkelSolomon Star Instead of connecting the sublevels of a snowflake by their corners, they can be connected by their centers to form a star as follows. 1) A level one substar has b  1 nodes connected to a single bus. 2) A level two substar introduces an additional bus with b  1 nodes attached. Each of these nodes is attached to the empty slot on the bus of a different level one substar. 3) In general, a level j substar introduces a new bus with b  1 nodes. Each of these is connected to a slot on the central bus of a different level j 1 substar. 4) Finally, a new bus with b nodes is used to connect b level n  1 substars to form a level n star. Finkel and Solomon [4] also present a distributed message routing algorithm for this structure. As can also be seen in Table III, the star has no better asymptotic performance than the snowflake.
(27)
(28)
Suppose we equate VPESPE and VCLSCL and solve for the ratio of processing element to link service times
1)
5CL
WD_ I
92
IEEE TRANSACTIONS ON COMPUTERS, VOL. C32, NO. 1, JANUARY 1983
System
TABLE III PERFORMANCE BOUNDSUNIFORM MESSAGE ROUTING VPESPE Xo Asymptote V1C'L SCL
I
Single Global Bus
Complete
SCL
SPE
K
SCL
2SCL K(K  1)
K even..
Connection
SPE
SCL
8
K
SPE K
Double Ring
SE
K
8(K  1)
1)
Cl
KoddSCL(K+ 8K
Spanning Bus Hypercube
DDimensional Toroid
wD1
SPE
wD
SCL(W 1)
wD 1
SCL
4wDI
SPE
wD
SCL
w even w
SCLW
1) odd4SC(D2 i) 4w(wD
Sc(WD2 1)
S
CrossD2D
CubeConnected Cycles
2D+2
SPE
D2D
5SCL
Cycle D odd
SCL2D(5D28D ) + 8D
4D2D(D2D 1)
Chordal Ring
Snowflake
Dense Snowflake
2(c+ 1) SCL b
SPE
ScL(b 1)
b
K SPE
bn
Cycle D even SCL2D(5D  8) + 8 4 2D(D2D 1) See Appendix
ScL(b 1)bnbnbnI
FinkelSolomon Star
Tree
SpL
ScL(b 1)
b_
SPE
SCLb'I
SpE(b  2) bl(b  l)n 1
SpE(b1)
ScL(b  l)(b I)n II b{(bl)n 11b +2
2SCLb.2(b 1) bn  I
1
2SCL (b 1m)
bSn _I
I
Note: The Xo asymptote is the limit on performance as the number of nodes becomes very large. For the single global bus, double ring, chlordal ring, snowflake, dense snowflake, star, and tree, it is the absolute upper bound on system performance as the number of nodes becomes infinite. For the other systems, it is the dominant turn of the performance bound.
XO<mnVPESPE VCL SCL.
I
locality must increase as the number of system nodes increases nodes. The smaller average number of link traversals required by a message in the spanning bus hypercube is more than offset if the performance bound is to increase linearly. This phenomenon is not unique to the spanning bus hyper by the additional number of links in the toroid. 2) Neglecting the complete connection, only the spanning cube. In general, as the number of nodes increases, the ratio of computation time to communication time must increase or bus hypercube, the Ddimensional toroid, and the cubeconthe locality of communication must increase if the performance nected cycle have nonconstant performance bounds if all pabound is to increase linearly with the number of network nodes. rameters are fixed and the number of nodes is made very The technique we have discussed permits us to quantify these large. 3) The cubeconnected cycle has, asymptotically, the best relationships (i.e., determine the amount of locality needed or the minimum computation timecommunication time performance of any interconnection. In fact, its performance bound differs from that of the binary hypercube with D diratio). mensions by only the factor D. Unfortunately, lower order COMPARISONS terms in the performance bound prevent the cubeconnected A look at Table III shows the following. cycle's performance from exceeding that of the 3D toroid until 1) Performance of the Ddimensional toroid is four times the number of nodes exceeds 500 000 (if the processing elethat of the spanning bus hypercube with the same number of ment and link service times are equal).
REED AND SCHWETMAN: COSTPERFORMANCE BOUNDS FOR NETWORKS TABLE IV SELECTED PERFORMANCE BOUNDSLOCAL MESSAGE ROUTING
93
80
70
T
1
o Single Global Be*
f (L + I ) o  I4L(L + 1 ))1 K odd VSSCL  4K +(I p)(K2 ) 8K(K2L1) < (L + I ) ( 1<)(K24L(L + 1))Ke SCL
Ar +, 4K o vr WIK"
Double Ring
K(KA2L1)
r
I
I
60 1
0 0
Xo asymptote is
)(
8
ScL(I )
L I (DW
D
A
0
50
40
Spanning Bus Hypercube
VCLSCL (1LC

o Complete Connection A Double ring + 2D Bus lypercube 0 3D Bus Hypercube * 2D Toroid xi 3D Toroid * CubeConaectedCycle * Chordal Ding (C  5)
IDWD I)L ED
2
SCL
1=1
v
id 'a
El
WD(p) I
+
IW
(w 1)
° 30
c)
.I.
wD_±('z)(w1Y1)1 J
20 1
TwoDimensional Toroid w Odd, L < LI)
VCLSCLC V 2w2
10 I
0
8 S. 8 IP B 9 0 8 iO 8 8 ', 8 8 IP 8 19 S
.
1)(2L SCL( (Q(2L3 + 1 )+ (1 s)[W(W2 1) 4L(L + I) + 2(w2  2L(L + 1)Xo asymptote is Sc(OI
) L<
9=
0
10
20
ThreeDimensional Toroid
vCsC
w Odd,
[gvrwJ)
30 50 Nuaber of Nodes
40
60
70
80
ScL( VCI:SCL=33)W3
=(
3(p(L+ 1)(L2+L+ 1) 2(L + 1)(2L + 1) + 6
Fig. 3. System performance asymptotes for symmetric interconnections with uniform message routing and unit service times.
+3(1

p)[3W2(W2 1) 4L(L + I1)(L2 + L + 1)]\ 4(3W32L(L + 1)(2L + 1)6L3)
Xo asymptote is SCL(I 2
3.
The values of VPESPE are the same as for the uniform message routing case.
2.5
0)
4) Of the asymmetric structures, the dense snowflake gives the best performance. Table IV shows that asymptotically, our definition of locality changes only the constant of proportionality, not the order of the system performance bound. As long as there exists any nonzero probability of a message traversing a distance proportional to the size of the structure, this must, in the limit, bound the system performance. Different locality functions will, of course, yield different bounds. It is possible to compare various interconnection networks under a wide variety of conditions and assumptions. One can examine throughput bounds, cost, and throughput/cost bounds for various node and link service times, component costs, and message routing distributions. As an example, Fig. 3 shows a plot of the system throughput bound versus the number of nodes in the system under the assumptions of uniform message routing and unit service times for both nodes and links (i.e., the quanta of computation and communication are equal) for the symmetric interconnections analyzed. Fig. 4 shows the asymmetric interconnections under the same set of assumptions. Finally, Figs. 5 and 6 show cost figures and performance/cost ratios under the assumption of unit cost for nodes, links, and link connections and the performance assumptions specified earlier. The choices one makes for device speeds, component costs, message routing distributions, and service times obviously
0
2.
U
Els
*'X
U 54
L
X
1.
40 60 80 30 50 70 Nusber of Nodes Fig. 4. System performance asymptotes for asymmetric interconnections
0.
0
10
20
with uniform message routing and unit service times.
affect the utility of the interconnection networks. If one intends to always run treestructured computations, then a tree interconnection is obviously optimal. However, it is rarely, if ever, the case that only one kind of message routing distribu
94
IEEE TRANSACTIONS ON COMPUTERS, VOL.
C32, NO. 1, JANUARY 1983
.2
1000
.18
.16
800
02
.14 1
Go 0 C) a)
0 0 qe.
o
c.)
600
.12 .
.1
aS
PC/ C/3
400
.08 1
.06 1 200
.04 1
.02
0
0
10
20
30 50 Number of Nodes
40
60
70
80
0.
0
10
20
Fig. 5. Performance/cost for symmetric interconnections with unit
40 60 30 50 Nuiber of Nodes
70
80
component costs.
Fig. 6. Performance/cost bounds for symmetric interconnections with unit component costs, uniform routing and unit service times.
tion will occur. With this fact in mind, the spanning bus hy things as cost, performance, reliability, broadcast delay, and percube or toroid seem to provide the best compromise for a expansion increments should provide a more precise method wide variety of communication patterns. of selection.
SUMMARY
APPENDIX
We have described a method for determining cost and performance bounds for a distributed message passing system. We introduced the notion of a message routing distribution, and we showed how it could be used to derive performance bounds under more realistic assumptions than uniform message routing. Finally, we applied the technique to several proposed interconnection structures. Several interesting areas remain to be investigated. The most obvious is the extension of the locality results to asymmetric structures. This is likely to be more difficult since locality in asymmetric structures invalidates the assumption that all nodes are visited with equal probability. Second, the locality result for symmetric structures can easily be extended to include nonconstant s. One extended locality definition might make the probability of sending a message to a node I links away inversely proportional to 1. Recently, Zahorjan et al. [15] proposed a technique for obtaining closed form bounds on the throughput rates of systems with finite message populations. With the requisite assumptions, one could use the VS products derived here to obtain such bounds for these networks. Finally, performance and cost are not the only figures of merit for distributed systems. A weighted function ofsuch
CHORDAL RING PERFORMANCE BOUNDS
Uniform Routing
Since the chordal ring is symmetric, one can, without loss of generality, assume that a message's source node is node 0 and the destination is some node i(l < i < K  1). Arden and Lee [ 1 ] give formulas for the number of chord links C(i) and ring links required to reach node i from node 0. Analysis of these formulas shows that for a fixed chord length c and increasing K, the number of ring link traversals required to reach all possible destination nodes is less than twice the number of chord link traversals needed. Because there are twice as many ring links as chord links, for large enough K, the chord links become the bottleneck. From the formulas given by Arden and Lee, it is apparent that
min
' 11 , [ + 7) < C(i) < min ([ 1 [K+ 1)
Furthermore, the ceiling case occurs much more frequently than the floor case. An upper bound on the mean number of chord traversals is then
REED AND SCHWETMAN: COSTPERFORMANCE BOUNDS FOR NETWORKS
95
A functional style and its algebra of programs," Commun. Ass. Comput. Mach., vol. 21, pp. 613641, Aug. 1978. P. J. Denning and J. P. Buzen, "The operational analysis of queueing network models," Comput. Surveys, vol. 10, pp. 225261, Sept. 1978. R. A. Finkel and M. H. Solomon, "Processor interconnection strategies," IEEE Trans. Comput., vol. C29, pp. 360371, May 1980. L. R. Goke and G. J. Lipovski, "Banyan networks for partitioning multiprocessor systems," in Proc. Ist Annu. Symp. Comput. Architecture, Dec. 1973, pp. 2128. C. A. R. Hoare, "Communicating sequential processes," Commun. Ass. Comput. Mach., vol. 21, pp. 666677, Aug. 1978. H. J. Jafari, J. Spragins, and T. Lewis, "A new modular loop architecture for distributed computer systems," in Trends and Applications 78: Distributed Processing, 1978, pp. 7277. T. Lang and H. S. Stone, "A shuffleexchange network with simplified control," IEEE Trans. Comput., vol. C25, pp. 5566, Jan. 1976. M. T. Liu, "Distributed loop computer networks," in Advances in Computers, vol. 17. New York: Academic, 1978, p. 163221. F. P. Preparata, and J. Viullemin, "The cubeconnected cycles: A versatile network for parallel computation," Commun. Ass. Comput. Mach., vol. 24, pp. 300309, May 1981. D. A. Reed, "Performance models of processor interconnection structures," Ph.D. dissertation, Dep. Comput. Sci., Purdue Univ., W. Lafayette, IN, in preparation. H. Sullivan and T. R. Bashkow, "A large scale, homogeneous, fully distributed parallel machine," in Proc. 4th Annu. Symp. Comput. Architecture, 1977, pp. 105124. L. D. Wittie, "Efficient message routing in megamicrocomputer networks," in Proc. 3rd Annu. Symp. Comput. Architecture, Jan. 1976, pp. 136140. "Communication structures for large multimicrocomputer systems," IEEE Trans. Comput., vol. C30, pp. 264273, Apr. 1981. J. Zahorjan, K. C. Sevick, D. L. Eager, and B. Galler, "Balanced job bound analysis of queueing networks," Commun. Ass. Comput. Mach., vol. 25, pp. 134141, Feb. 1982.
,
Upper Bound
''min(I'
I
[2] J. Backus, "Can programming be liberated from the von Neumann style?
[3]
[4]
2
K
(c
+

I
1)
2(c
1)1 ([2(c +1)
2(c
1)1+
)
[5] [6]
Since there are K/2 chord links, we have VCLSCL < 2SCL Upper Bound
K
+[2c
+
1)1
[7] [8]
[9]
Similarly, a lower bound on the mean number of chord link traversals is Lower Bound K1 Ici K 7i c c+ I c + 1 ,
KI
_
[10] [11] [12]
[131
[14]
K2
4(c+ 1)(K 1)
and
VCLSCL > 2ScLLower Bound
K
[15]
Both the lower and upper bounds are asymptotically exact and converge to a performance bound of 2(c + I)/SCL as K becomes large. Using these upper and lower bounds, one can trade accuracy with computational cost on an almost continuous spectrum by calculating the exact visit ratios until the difference between them and the estimated visit ratios falls below some desired error tolerance. Thereafter, the approximation may be employed.
Locality
Daniel A. Reed
(summa cum laude) in computer science from the University of Missouri, Rolla, in 1978, and the M.S. degree, also in computer science, from Purdue University, West Lafayette, IN, in 1980. Currently he is pursuing the Ph.D. degree in
computer science from Purdue University. His
search interests include computer system
mance
re
(S'81) received the B.S. degree
perfor
Unfortunately, we know of no closed form for the link visit ratios under locality. By exhaustively enumerating the K message destinations from node 0, they can be calculated in O(K) time.

Mr. Reed is a member of the Association for Computing Machinery and Phi Kappa Phi.
ing, interconnection topologies for large computer networks, and distributed operating system design.
evaluation, languages for parallel comput
ACKNOWLEDGMENT
We would like to thank the referees for their comments that improved the exposition of this paper.
REFERENCES
I ] B. W. Arden and H. Lee, "Analysis of a chordal ring network," in Proc. Workshop on Interconnection Networks for Parallel and Distributed Processing, H. J. Siegel, Ed., Purdue Univ., W. Lafayette, IN, Apr. 1980.
from Baylor University, Waco, TX, in 1961, the Sc.M. degree from Brown University, Providence, RI, in 1965, both in mathematics, and the Ph.D. degree in computer science from the University of Texas, Austin, in 1970. He has worked at IBM, the University of Texas Computing Center, and Boole and Babbage, Inc. He has been at Purdue University since 1972. Currently, he is Associate Professor in the Department of Computer Sciences. In 1979 he was a Fulbright/Hayes Lecturer at the University of Helsinki, Finland. Dr. Schwetman is a member of the Association for Computing Machinery and the IEEE Computer Society.
Herbert D. Schwetinan received the B.S. degree