9512.net

甜梦文库

甜梦文库

当前位置：首页 >> >> # Performing BMMC permutations efficiently on distributed-memory multiprocessors with MPI

Dartmouth College Computer Science Technical Report PCS-TR97-317 Performing BMMC Permutations E ciently on Distributed-Memory Multiprocessors with MPI

Thomas H. Cormen Dartmouth College Department of Computer Science thc@cs.dartmouth.edu

This paper presents an architecture-independent method for performing BMMC permutations on multiprocessors with distributed memory. All interprocessor communication uses the MPI function MPI_Sendrecv_replace(). The number of elements and number of processors must be powers of 2, with at least one element per processor, and there is no inherent upper bound on the ratio of elements per processor. Our method transmits only data without transmitting any source or target indices, which conserves network bandwidth. When data is transmitted, the source and target processors implicitly agree on each other's identity and the indices of the elements being transmitted. A C-callable implementation of our method is available from Netlib. The implementation allows preprocessing (which incurs a modest cost) to be factored out for multiple runs of the same permutation, even if on di erent data. Data may be laid out in any one of several ways: processor-major, processor-minor, or anything in between.

Abstract

1 Introduction

Suppose you were writing a multiprocessor application using MPI GLS94, SOHL+ 96], and you needed to transpose a fairly large matrix distributed over several processors. You could write a specialized matrix-transpose function. You might parameterize it enough to work on non-square matrices. If you planned the data movement carefully, you could arrange the MPI communication calls so that each processor knew implicitly exactly which entries every other processor was sending it, in order to avoid sending row and column numbers along with the data. Now suppose you were implementing a Fast Fourier Transform (FFT) using MPI, and you needed to perform the bit-reversal permutation used in some FFT methods. You could write a specialized bit-reversal permutation function and, like for matrix transpose, you could plan the data movement carefully in order to avoid sending indices along with the data.

Supported in part by the National Science Foundation under grant CCR-9625894.

1

Suppose further that you had implemented one or both of the above functions under the assumption that data was stored in processor-major order (the most signi cant bits of an element's index denote the number of the processor that it resides on) but that you decided to store the data instead in processor-minor order (processor numbers are in the least signi cant index bits). You would have to rewrite your functions, for the data-movement patterns would change. All the above cases, and many more, are speci c instances of a more general operation that this paper shows how to perform. In particular, matrix transpose (when all dimensions are powers of 2) and the bit-reversal permutation are BMMC (bit-matrix-multiply/complement) permutations. In this paper, we show how to perform BMMC permutations on distributed-memory multiprocessors using only the MPI function MPI_Sendrecv_replace() for communication, subject to certain technical conditions. Our method assumes that there is at least one element per processor, and there is no inherent upper bound on the ratio of elements per processor. One useful property of BMMC permutations makes it easy to adjust the actual permutation performed when the data layout is not processor-major. Our BMMC-permutation method is as fast as one could hope for in an MPI environment. It transmits only data without transmitting any source or target indices, thus conserving network bandwidth. We avoid transmitting indices by ensuring that during communication, source and target processors implicitly agree on each other's identity and the indices of the elements being transmitted. Whenever a processor has data to send to another processor, it sends it in one message, which reduces message overhead. The basic idea of our method is as follows. We decompose the BMMC permutation into two BMMC permutations that, when performed in sequence, give the original one. The rst permutation rearranges data within each processor's memory to get elements destined for the same target processor into contiguous memory locations. The second permutation transmits data among processors and places it into its correct location in the target processor's memory. By default, the method assumes that data is laid out in processor-major order, but we shall see that we can compensate for other orders by performing a modi ed BMMC permutation. A C-callable implementation of our method is available from Netlib Cor97]. The implementation allows the computation of the decomposition (which incurs a modest cost) to be factored out for multiple runs of the same permutation, even if on di erent data. To our knowledge, this paper represents the rst BMMC-permutation algorithm that is independent of the network architecture. Other authors have shown how to perform BMMC permutations (which are also known as a ne transformations) on speci c networks such as hypercubes BR90, EHJ94], meshes Sib92], Omega networks1 KS88], and expanded delta networks WCS96]. Many of the techniques used in the present paper are adapted from earlier work in performing BMMC permutations on parallel disk systems CSW94]. The remainder of this paper is organized as follows. Section 2 de nes the class of BMMC permutations, shows how we represent them, and gives the technical conditions required to use our method. Section 3 previews the matrix forms used in Section 4 to decompose a BMMC permutation as described above. Section 5 shows how to actually perform the two permutations produced by the decomposition method. Section 6 presents how to compensate for non-processor-major data layouts. Section 7 contains some concluding remarks, focusing on the implementation in Netlib. There are three appendices containing background material. Appendix A shows how to compute a

The algorithm in KS88] is for BPC (bit-permute/complement) permutations, which are a large subclass of BMMC permutations.

1

2

column basis for a matrix, which we need to do in Section 4. Appendix B presents the Gray-code technique of calculating source and target indices, which we use in Section 5. Appendix C lists a few examples of BMMC permutations.

2 BMMC permutations

In this section, we de ne the class of BMMC permutations and give some examples of commonly used ones. We then show how to represent them and interpret data indices in a multiprocessor context. A permutation on N elements is de ned by a one-to-one mapping of source indices to target indices. The class of permutations we consider in this paper is BMMC, or bit-matrix-multiply/complement , permutations. BMMC permutations are de ned only when the number N of elements is an integer power of 2, so that n = lg N is an integer. A BMMC permutation is speci ed by an n n characteristic matrix A = (aij ) whose entries are drawn from f0; 1g and is nonsingular (i.e., invertible) over GF(2).2 The speci cation also includes a complement vector c = (cn?1 ; cn?2; : : : ; c0 ) of length n. Treating each source index x = (xn?1 ; xn?2 ; : : : ; x0 ) as an n-bit vector, we perform matrix-vector multiplication over GF(2) and then form the corresponding n-bit target index y = (yn?1; yn?2; : : : ; y0) by complementing some subset of the resulting bits: y = A x c, or

2 6 6 6 6 6 6 4

De nition of BMMC permutations

y0 y1 y2

. . .

3

2

Note that we place the least signi cant bit at the top and left. Because we require the characteristic matrix A to be nonsingular, the mapping of source addresses to target addresses is one-to-one. The following lemma gives two useful properties of BMMC permutations. Lemma 1 The class of BMMC permutations is closed under composition and inverse. Proof: We rst show that BMMC permutations are closed under composition. Consider a BMMC permutation with characteristic matrix A and complement vector c, and another BMMC permutation with characteristic matrix A0 and complement vector c0 . We will show that their composition is a BMMC permutation with some characteristic matrix A00 and complement vector c00. The composition of the two BMMC permutations rst maps a source index x to x0 = A x c and then maps x0 to y = A0x0 c0 = A0(A x c) c0 = A0 A x A0 c c0 = (A0 A)x (A0 c c0) ;

2 Matrix multiplication over GF(2) is like standard matrix multiplication over the reals but with all arithmetic performed modulo 2. Equivalently, multiplication is replaced by logical-and, and addition is replaced by exclusive-or.

yn?1

7 6 7 6 7 6 7=6 7 6 7 6 5 4

a00 a10 a20

. . .

a01 a11 a21

. . .

a02 a12 a22

. . .

an?1;0 an?1;1 an?1;2

...

a0;n?1 a1;n?1 a2;n?1

. . .

32 76 76 76 76 76 76 54

x0 x1 x2

. . .

3 7 7 7 7 7 7 5

2 6 6 6 6 6 6 4

c0 c1 c2

. . .

3 7 7 7 7 7 7 5

:

an?1;n?1

xn?1

cn?1

3

and so A00 = A0 A and c00 = A0 c c0. Now we show that BMMC permutations are closed under inverse. If y = A x c, then we add c to both sides and multiply by A?1 , giving

x = A?1(y c) = A?1 y (A?1 c) ;

and so the inverse permutation has characteristic matrix A?1 and complement vector A?1 c.

Representation

Under the reasonable assumption that the word size of the underlying machine is at least n bits, we represent the characteristic matrix as an array of n columns, where each column is packed into a word with the top bit of the column as the least signi cant bit of the word. We denote the jth column of matrix A by Aj . Source and target indices are stored in the obvious way, and the complement vector is stored in the same way. With this representation, it takes only n +1 = lg N +1 words to represent a BMMC permutation. Moreover, we can compute y = A x c in O(n) word operations: 1 y c 2 for j 0 to n ? 1 3 do if xj = 1 4 then y y Aj

Data layouts

In the remainder of this paper, we let P denote the number of processors, and we assume that P is a power of 2. Let p = lg P . We refer to the number of elements per processor N=P as the virtual processor ratio, or VPR. Let us examine how to interpret the n bits of an index. A particular set of p bits determines which processor the element resides on, and the remaining n ? p = lg(N=P ) bits determine the o set of that element within its processor. Although it is possible to have a layout in which the p processor bits are not consecutive within an index, we do not consider such layouts in this paper. We let the parameter f denote the position of the least signi cant processor bit, so that 0 f n ? p. Figure 1 shows how layouts di er depending on the value of f . In processor-major layout, the most signi cant bits contain the processor number, so that f = n ? p. In processor-minor layout, the least signi cant bits contain the processor number, so that f = 0. In general, we can view a layout as consisting of \bands" striped across the processors. Each band consists of 2f P elements, and there are N=(2f P ) bands altogether. Until Section 6, we assume that the data layout is processor-major. In Section 6, we shall see how to express the conversion between layouts with f < n ? p and f = n ? p (processor-major) as a BMMC permutation. We will adjust the permutation actually performed according to this conversion permutation.

4

o o o o o o o o

set 0 set 1 set 2 set 3 set 4 set 5 set 6 set 7

processor-major f =3 P0 P1 P2 P3 0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27 4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31

f =2

P0 0 1 2 3 16 17 18 19 P1 4 5 6 7 20 21 22 23 P2 8 9 10 11 24 25 26 27 P3 12 13 14 15 28 29 30 31 P0 0 1 8 9 16 17 24 25

f =1

P1 2 3 10 11 18 19 26 27 P2 4 5 12 13 20 21 28 29 P3 6 7 14 15 22 23 30 31

processor-minor f =0 P0 P1 P2 P3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

layout shows which processor and o set each element index maps to. In this example, N = 32, P = 4, and the ith processor is Pi . Each layout can be viewed as consisting of bands of 2f P elements.

Figure 1: How data layouts di er depending on where the p processor bits start (the parameter f ). Each

3 Matrix forms

When factoring a characteristic matrix in Section 4, we will want certain submatrices to have particular forms. This section reviews a technique used in CSW94] to produce such desired forms.

Column-addition matrices

A column-addition matrix is a matrix M such that the product A0 = A M is a modi ed form of A in which speci ed columns of A have been added (elementwise) into others. We de ne the matrix M = (mij ) by 8 > 1 if i = j ; < mij = > 1 if column Ai is added into column Aj ; : 0 otherwise : For example,

2

CSW94]. In the remainder of this section, we present three matrix forms that we will use for factoring in Section 4. The matrices that we will use are of block forms. In particular, we partition a matrix according to the leftmost n ? p columns (the left section ), the rightmost p columns (the right section ), the top n ? p rows (the top section ), and the bottom p rows (the bottom section ). Certain 5

A0 Column-addition matrices must also obey a dependency restriction that if column i is added into column j , then column j must not be added into any column. In other words, if mij = 1, then mjk = 0 for all k 6= j. With the dependency restriction, any column-addition matrix is nonsingular A M

1 60 6 6 41 0

0 1 1 1

1 1 0 0

1 07 7 5 07 1

3 2

1 60 6 6 40 0

1 1 0 1

1 0 1 0

0 07 = 7 5 07 1

3

2 6 4

3

A0 A0 A1 A3 A0 A2 A3

7 5

1 60 = 61 6 4 0

2

0 1 0 0

0 1 1 0

1 07 : 7 5 07 1

3

submatrices will be either all 0 or an identity submatrix. We label such submatrices accordingly and place an asterisk (*) in all other submatrices. In Section 4, we shall transform a nonsingular matrix into one that has a nonsingular trailing p p submatrix. With the resulting matrix, each processor will be able to determine implicitly which processors it communicates with. To perform this transformation, we add columns from the left section into columns in the right section. A trailer matrix T is a column-addition matrix that does so. It has the following form: 0 I p Observe that because the bottom section of a trailer matrix is the bottom p rows of an identity matrix and because we assume processor-major layout, if a trailer matrix is taken as a characteristic matrix, the BMMC permutation it induces leaves every element in the processor it started in. That is, only o sets within processors change, rather than processor numbers. We call such a permutation an intraprocessor permutation.

Trailer matrix form

T =

" n?p

I

p

#

n?p

:

Reducer matrix form

We will be putting matrices into \reduced form," which we will precisely de ne in Section 4. Reduced form will make it so that we can order communication into rounds. We will put a matrix into reduced form by adding columns from the left section into other columns from the left section. Therefore, a reducer matrix R is a column-addition matrix with the following form: 0 I p Since R obeys the dependency restriction, it is nonsingular, even though we do not know the exact form of the leading (n ? p) (n ? p) submatrix. Like a trailer matrix, the BMMC permutation induced by a reducer matrix is intraprocessor.

R=

" n?p

0

p

#

n?p

:

Swapper matrix form

The nal matrix form we consider swaps pairs of columns from among the left section. This operation will help gather elements destined for the same target processor into consecutive memory locations. A swapper matrix has the form 0 I p where the leading (n ? p) (n ? p) submatrix is a permutation matrix, i.e., it contains exactly one 1 in each row and in each column. (A swapper matrix is not a column-addition matrix.) Once again, the BMMC permutation induced is intraprocessor. 6 =

" n?p

0

p

#

n?p

;

In fact, we will compose the BMMC permutations induced by matrices of the three forms above. The composition will have a characteristic matrix of the form

Composition

TR =

" n?p

p

#

0

I

n?p p

:

Observe that both this composition and its inverse are intraprocessor permutations.

4 Factoring a BMMC permutation

With the matrix forms in hand, we are now ready to factor a BMMC permutation. We will factor the characteristic matrix A into two factors:

A=V W :

The factor W will be the inverse of the composition described in Section 3, and so the BMMC permutation it characterizes is an intraprocessor permutation. This permutation gathers elements destined for the same target processor into contiguous memory locations so that they can be sent in one message. The BMMC permutation induced by the factor V , on the other hand, is an interprocessor permutation, i.e., elements may move among processors. Given these factors, Section 5 will show how to rst perform the permutation with the mapping

x0 = W x

followed by the permutation with the mapping

(1) (2)

y = V x0 c = A x c :

In this section, we rst present the transformations that produce the factoring. Then we examine desirable properties of the resulting factoring. We start by transforming the characteristic matrix A into a nonsingular matrix A0 that has a nonsingular trailing p p submatrix. One useful property is given by the following well-known theorem from linear algebra. The column (row) rank of a matrix is the size of a maximal set of linearly independent columns (rows), with linear independence here being over GF(2).

Creating a nonsingular trailing submatrix

Theorem 2 For any matrix, the row rank equals the column rank.

Thus, we can refer to either the column rank or the row rank as simply the rank, and we denote the rank of a matrix A by rank A. We call a maximal set of linearly independent columns (rows) a column (row) basis. 7

If we represent the matrix A as

A=

" n?p

p

#

n?p p

;

then we will add columns in into those in . Consider as a set of p columns and as a set of n ? p columns. Because A is nonsingular, all rows of the bottom section of A, consisting of and , are linearly independent. Hence, the bottom section has row rank p, and by Theorem 2, there exists a column basis of p columns within the bottom section. One such basis contains a set Y of rank columns of and a set Z of p ? rank columns of . (Appendix A shows how to nd this column basis.) Let Y be the set of p ? rank columns of not in Y . Having de ned these column sets, we create a nonsingular trailing submatrix by pairing up columns of Z with columns of Y and adding each column in Z into its counterpart in Y . Because jY j = rank , Y is a column basis for , and so the columns of Y depend only on columns of Y and not on columns of Z . Adding a column of Z into a column of Y must produce a column that is linearly independent of those in Y . Because each column of Y has a di erent column of Z added in, the resulting columns are linearly independent of each other, too. Let us express this operation as a column-addition matrix. Although we focused on adding columns of to columns of , we are in fact adding entire columns, and so we are also adding some columns of into columns of . Because we are adding columns of the left section into columns of the right section, we use a trailer matrix T of the form described in Section 3, giving us the matrix product

A0 =

AT =

" n?p

p

b b

#

n?p p

;

where b is nonsingular. Since the matrices A and T are nonsingular, the matrix A0 is nonsingular. The next step is to transform the submatrix into reduced form. For our purposes, a matrix is in reduced form when all non-basis columns are 0. To perform this transformation, we start by nding a column basis S for . For each column j not in S , we nd a set of columns of that add up to j . (Appendix A shows how to compute these column dependencies.) Because we are working over GF(2), adding these columns into j zeroes it out. Again, we are actually working with entire columns. We add basis columns from the left section into non-basis columns in the left section, and so as a column-addition matrix, the operation respects the dependency restriction. The column-addition matrix is a reducer matrix R, and we now have the product

Transforming the matrix into reduced form

A00 = A0 R =

" n?p b b

p

b b

#

n?p p

;

where b is nonsingular and b is in reduced form. Because the basis columns of do not change, rank b = rank . Because A0 and R are nonsingular, so is A00 . 8

Gathering basis columns

Our nal transformation is to gather the columns of the basis S into consecutive positions on the right end of b . That is, we permute columns in the left section so that the leftmost n ? p ? rank columns of b are 0 and the rightmost rank columns of b form the column basis S . The matrix that gathers the basis columns exchanges pairs of columns in the leftmost section, and so it is a swapper matrix. The resulting product will be the factor V , and it has the form

V = A00 =

" n ? p ? rank b0

0

rank b 00 b00

p

b b

#

n?p p

;

(3)

where b 00 consists of the basis columns of b (and hence of as well). Both and V are nonsingular. Since we want the factorization A = V W , we expand equation (3) to produce

Factorization

V = A00 = A0 R = AT R :

If we let W = (T R )?1, we have

A = V (T R )?1 = VW;

giving the desired factorization to use in performing the permutations given by equations (1) and (2). Why did we go to so much trouble to factor A in the above manner? The answer lies in the properties of the factors V and W . As we noted in Section 3, the factor W characterizes an intraprocessor permutation. We shall see in Section 5 that intraprocessor BMMC permutations are easy and fast. The key to the factorization lies in the properties of the interprocessor permutation characterized by the factor V . More speci cally, they lie in how elements map between processors. Consider a given processor k, where 0 k P ? 1. How many di erent processors do the elements that start on k map to when performing the permutation characterized by the original matrix A? The answer is given by the following lemma, whose proof appears in CSW94].3

Properties of the factors

Lemma 3 Let be the lower left p (n ? p) submatrix of A, and consider any processor k. There

3 The lemma in CSW94] is couched in terms of blocks on a parallel disk system, but it translates easily to the context of the present paper.

are exactly 2rank target processors that the elements of k map to, and for each such target processor, exactly N=(2rank P ) elements from processor k map to it.

9

Since W characterizes an intraprocessor permutation, any movement among processors must occur during the permutation characterized by V . In other words, the 2rank processors that processor k sends elements to when performing the permutation characterized by A are the same as the target processors for the permutation characterized by V . To determine exactly which 2rank processors k sends to, we write the matrix equation given by the bottom section of equation (3) in block form, where the notation \: :" indicates a range of bit indices. We also include the complement vector:

yn?p::n?1 = b00 x(n?p?rank

)::(n

?p?1)

b xn?p::n?1

cn?p::n?1 :

(4)

Here, xn?p::n?1 is the binary representation of k. As we vary the bits of x(n?p?rank )::(n?p?1) (we call these the basis bits ) among all 2rank combinations, yn?p::n?1 takes on all 2rank target processor numbers. Including the reducer matrix as a factor made it so that we only need to vary 2rank basis bits, and including the swapper matrix made it so that the basis bits are in predetermined locations. Another e ect of the swapper matrix is that all elements going from processor k to a given target processor reside in consecutive locations in processor k's memory. Consider a target processor l that processor k sends elements to. By Lemma 3, processor k sends exactly N=(2rank P ) elements to processor l. Observe that in the permutation characterized by V , if elements with source indices x and x0 both move from processor k to processor l, then x(n?p?rank )::(n?1) = x0(n?p?rank )::(n?1) , i.e., they must have the same bits in positions n ? p ? rank to n ? 1 of their source indices. Because these are the most signi cant bit positions, these elements reside in N=(2rank P ) consecutive locations in processor k's memory prior to being sent. Finally we come to the e ect of the trailer matrix, which makes it so that we can easily determine which processor is sending to processor k . That is, we can determine which processor is the source processor when processor k is the target. Let us rewrite equation (4) to compute the source processor xn?p::n?1 corresponding to a given target processor yn?p::n?1 and the basis bits x(n?p?rank )::(n?p?1):

xn?p::n?1 = ( b)?1 (yn?p::n?1

b 00 x(n?p?rank

)::(n

?p?1)

cn?p::n?1 ) :

(5)

Substituting k for yn?p::n?1 , equation (5) tells us which processor is sending to processor k for a given set of basis bits.

5 Performing the factor permutations

In this section, we see how to perform the BMMC permutations characterized by equations (1) and (2). Neither of these permutations is performed in-place. That is, both copy the data from one bu er into another. When we are done, however, each processor's original data is overwritten with the permuted data. We rst see how to perform the intraprocessor permutation given by equation (1), and then how to perform the interprocessor permutation given by equation (2). Recall that elements are stored in processor-major order, so that the most signi cant p bits of an index give an element's processor number and the least signi cant n ? p bits give the o set within the processor.

10

To perform the intraprocessor permutation given by equation (1), we assume that each processor starts with N=P elements stored in an N=P -element array data . The N=P elements are copied, in a permuted order, into the N=P -element array temp . We could perform the intraprocessor permutation in a straightforward manner: 1 for each processor k, in parallel 2 do xn?p::n?1 k 3 for x0::n?p?1 0 to N=P ? 1 4 do x0 W x 5 temp x00::n?p?1 ] data x0::n?p?1] Note that we are careful to index o sets using only the least signi cant n ? p bits. There is a more e cient way to perform this permutation, however. Observe that computing x0 in line 4 takes O(n) time using the matrix-vector multiplication method in Section 2. Over all N=P elements in each processor, we use O((N=P ) lg N ) word operations. We can reduce this count to O((N=P ) lg(N=P )) by realizing that the most signi cant p bits of x are xed at k: 1 for each processor k, in parallel 2 do xn?p::n?1 k 3 let W 0 be the left section (leftmost n ? p columns) of W 4 let W 00 be the right section (rightmost p columns) of W 5 z W 00 xn?p::n?1 6 for x0::n?p?1 0 to N=P ? 1 7 do x0 W 0 x0::n?p?1 z 8 temp x00::n?p?1 ] data x0::n?p?1] Now each matrix-vector multiplication in line 7 uses only O(n ? p) word operations. We can do even better: (N=P ) word operations over all N=P elements. The idea is to choose elements to move not in linear order but in Gray-code order. After all, we can choose any order we want for moving the elements, as long as we move them all. In a Gray code, each index di ers from the index before it in only one bit position. We use this property to compute each target index x0 in O(1) word operations on average. Appendix B presents the details of the Gray-code method.

Performing the intraprocessor permutation

Performing the interprocessor permutation

To perform the interprocessor permutation given by equation (2), we assume that each processor starts with its N=P elements placed into the array temp by the intraprocessor permutation. When the permutation concludes, the elements are placed into the appropriate positions of the array data on the correct processors. The computation proceeds in 2rank rounds, where each round uses a di erent value b for the rank basis bits. Data moves in each round according to some permutation of the processors; that is, in each round, each processor sends to a unique target processor and therefore receives from a unique source processor. Using the form of the matrix V given in equation (3), the pseudocode is as follows: 11

1 for each processor k, in parallel 2 do for b 0 to 2rank ? 1 3 do target proc b00 b b k cn?p::n?1 4 source proc (b)?1 (k b 00 b cn?p::n?1) 5 simultanenously send the N=(2rank P ) elements starting at temp b N=(2rank P )] to processor target proc and receive N=(2rank P ) elements from processor source proc , overwriting the bu er starting at temp b N=(2rank P )] with the received elements 6 for j 0 to N=(2rank P ) ? 1 7 do t b0 j b00 b b source proc c0::n?p?1 8 data t] temp j ] This code works as follows. The value of b used in line 2 determines which round the processor is on; because all processors choose values of b in the same order, they all agree on the round. Line 3 computes which processor will receive data from processor k in the current round, according to equation (4). Similarly, line 4 computes which processor will be sending data to processor k in the current round, according to equation (5). As we have seen, for a given value of the basis bits, each processor sends data to a unique target processor, so that the simultaneous send/receive of line 5 is well de ned. Moreover, once a processor sends data, it no longer needs it, so that the data it receives can overwrite the send bu er. Our implementation calls the MPI_Sendrecv_replace() function, which is a perfect t for this style of communication. The for-loop of lines 6{8 iterates through the N=(2rank P ) non-basis bits within the processor, copying the elements into their nal locations. The computation of the target o set t in line 7 is based on equation (3). It depends on the value j of the non-basis bits, the value b of the basis bits, the value source proc of the processor that just sent to processor k, and the least signi cant n ? p bits of the complement vector. Just as our implementation of the intraprocessor permutation copies data in Gray-code order, so does our implementation of the for-loop of lines 6{8. Each processor uses a total of (N=P ) word operations in copying received data from temp to data . Although one could process the basis bits in the outer for-loop in Gray-code order, our implementation does not. We expect rank usually to be much smaller than lg(N=P ) so that the savings from computing target proc and source proc in lines 3 and 4 with the Gray-code technique would be negligible. Our implementation does factor out common subexpressions where possible. We point out once more that the interprocessor communication in line 5 transmits only data. Because the receiving processor implicitly knows all the bits of each element's source index, it has enough information to compute the corresponding target index.

6 Adjusting for non-processor-major data layouts

The method for performing BMMC permutations given in Sections 4 and 5 assumes that data is laid out in processor-major order. In this section, we see how to adjust for other data layouts. We adjust by altering the characteristic matrix and complement vector of the given permutation. The methods for factoring and performing the permutations do not change at all once the characteristic matrix and complement vector have been altered. The only assumption we need to make is that the p bits indicating which processor an element resides on are the consecutive bits in positions 12

The key idea is that we can convert between the data layout actually used (which we will call the actual layout ) and processor-major layout via a BMMC permutation. To convert from processor-major layout to the actual layout, we use the characteristic matrix

2

f; f + 1; : : : ; f + p ? 1, where 0 f n ? p.

Q=

6 4

I

f

0 0

0 0

p

n?p?f

0

3

I

I

0

7 5 n?p?f

p

f

and a complement vector of 0. The inverse permutation, which converts from the actual layout to processor-major, uses the characteristic matrix

2

Q?1 =

6 4

I

f

n?p?f

0 0

I

0 0

I

0 0

p

3

7 5 p

f n?p?f

and again a complement vector of 0. Note that Q = Q?1 = I if f = n ? p, i.e., if the actual layout is processor-major. Given the characteristic matrices Q and Q?1 and the original characteristic matrix A and complement vector c, we can express the BMMC permutation as the composition of three BMMC permutations. Because our method assumes processor-major layout, we rst convert the actual layout to processor-major by performing the permutation x0 = Q?1 x. We next perform the original BMMC permutation, but on the processor-major layout: x00 = A x0 c = A (Q?1 x) c. Finally, we convert back to the actual layout by performing y = Q x00 = Q (A (Q?1 x) c) = (Q A Q?1 ) x (Q c). Applying Lemma 1, instead of performing these three permutations consecutively, we perform their composition. That is, we perform the BMMC permutation with characteristic matrix Q A Q?1 and complement vector Q c.

7 Conclusion

This paper has detailed the algorithm behind the software package libbmmc_mpi.a Cor97], which is available from Netlib. Although the factoring techniques in Section 4 are borrowed from the outof-core BMMC permutation algorithm CSW94], the application of these techniques in Section 5 is new. There are several ways to invoke the C-callable implementation in libbmmc_mpi.a. The basic call is of int BMMC_MPI(bit_matrix A, matrix_column c, int n, int p, int f, int rank, MPI_Comm comm, int size, void *data, void *temp), where a header le provides the types bit_matrix and matrix_column, the parameter rank is the calling processor's number (between 0 and P ? 1), comm is an MPI communicator, and size is the size in bytes of each element to permute. The return value is an error code. When performing the same BMMC permutation multiple times, the factoring procedure of Section 4 can be \factored out" with the results placed into an opaque type BMMC_MPI_factor_info, which is de ned in the header le. The factoring is performed 13

by calling

int factor_BMMC_MPI(bit_matrix A, matrix_column c, int n, int p, int f, BMMC_MPI_factor_info *info). Given the factoring, one may pass the opaque type into calls to int perform_BMMC_MPI(BMMC_MPI_factor_info *info, int n, int p, int rank, MPI_Comm comm, int size, void *data, void *temp). Again, the return values are error codes. factor_BMMC_MPI() may be used for rank, comm, size, data, and temp.

Because the factoring does not depend on the data or even the element sizes, a single call of several calls to perform_BMMC_MPI() with di ering values of

There are also modi ed versions of BMMC_MPI() and factor_BMMC_MPI() that are specialized for processor-major and processor-minor data layouts. These functions are simply wrappers that call BMMC_MPI() and factor_BMMC_MPI() with the parameter f set to n-p for processor-major and to 0 for processor-minor. This paper does not include experimental results. MPI is an interface standard. There are many conforming implementations. Performance depends on both the MPI implementation and the underlying hardware and software. The purpose of this paper has been to present the technique in the context of a portable implementation on top of MPI. Although our implementation performs all communication via the MPI_Sendrecv_replace() function, a slight modi cation allows the use of either MPI_Sendrecv() or MPI_Alltoallv(). Limited experiments using two networks of workstations and the MPICH4 implementation of MPI have shown no signi cant di erence among implementations using MPI_Sendrecv_replace(), MPI_Sendrecv(), or MPI_Alltoallv(). On a di erent platform with a di erent implementation of MPI, it may be the case that the all-to-all communication of MPI_Alltoallv() proves superior.

Acknowledgments

Thanks to Len Wisniewski for many helpful discussions on the factoring method. James Clippinger and Anna Poplawski provided valuable comments on an earlier draft of this paper, and James also checked over the software. Steve Heller pointed out the Gray-code technique for target-index calculation. Lars Paul Huse gave useful feedback on the software in Netlib.

References

BR90] CB95] Rajendra Boppana and C. S. Raghavendra. Optimal self routing of linear-complement permutations in hypercubes. In Proceedings of the 5th Distributed Memory Conference, pages 800{808, April 1990. Thomas H. Cormen and Kristin Bruhl. Don't be too clever: Routing BMMC permutations on the MasPar MP-2. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 288{297, July 1995. Revised version to appear in Theory of Computing Systems. Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, Massachusetts, 1990.

CLR90]

4

Available at http://www.mcs.anl.gov/mpi/mpich/.

14

Thomas H. Cormen. libbmmc_mpi.a: a library to perform fast BMMC permutations for any multiprocessor system that supports MPI. Available from Netlib at http:// www.netlib.org/mpi/contrib/bmmc.tar.gz, 1997. CSW94] Thomas H. Cormen, Thomas Sundquist, and Leonard F. Wisniewski. Asymptotically tight bounds for performing BMMC permutations on parallel disk systems. Technical Report PCS-TR94-223, Dartmouth College Department of Computer Science, July 1994. Preliminary version appeared in Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures. Revised version to appear in SIAM Journal on Computing. EHJ94] Alan Edelman, Steve Heller, and S. Lennart Johnsson. Index transformation algorithms in a linear algebra framework. IEEE Transactions on Parallel and Distributed Systems, 5(12):1302{1309, December 1994. GLS94] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, 1994. KS88] John Keohane and Richard E. Stearns. Routing linear permutations though the Omega network in two passes. In Proceedings of the 2nd Symposium on the Frontiers of Massively Parallel Computation, pages 479{482, October 1988. Sib92] Jop Frederik Sibeyn. Algorithms for Routing on Meshes. PhD thesis, Utrecht University, December 1992. SOHL+ 96] Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The Complete Reference. The MIT Press, 1996. WCS96] Leonard F. Wisniewski, Thomas H. Cormen, and Thomas Sundquist. Performing BMMC permutations in two passes through the expanded delta network and MasPar MP-2. In Proceedings of the Sixth Symposium on The Frontiers of Massively Parallel Computation, pages 282{289, October 1996.

Cor97]

A Finding basis columns and dependencies

Although one can look up in a linear algebra text how to nd a column basis for a matrix and determine the column dependencies, we include a method here for completeness. The input is an n-column matrix A = (aij ), stored as an array of columns so that each column Aj is packed into a word. There are two outputs. The set S f0; 1; : : : ; n ? 1g, where jS j n, indexes the columns of A in the column basis. The matrix D = (dij ) gives the column dependencies: dij = 0 if i 6= j and column Aj depends on column Ai . The method given here destroys the matrix A in the process; if we need to keep the matrix A intact, we assume that we are working with a copy of it. Here is pseudocode to produce S and D given A:

15

1 S ; 2 for j 0 to n ? 1 3 do Dj 0 4 for j n ? 1 downto 0 5 do if Aj 6= 0 6 then S S fjg 7 let i be the minimum row number such that aij = 1 8 for k 0 to j ? 1 9 do if aik = 1 10 then Ak Ak Aj 11 Dk Dk D j 12 djk djk 13 Dj 0 This code works as follows. Lines 1{3 initialize the basis S to be empty and the dependency matrix D to 0. The outermost for-loop of lines 4{13 goes in decreasing order so that we add columns to the basis from right to left. This order helps when we are trying to nd the column basis needed in Section 4 to make the trailing p p submatrix be nonsingular. All non-basis columns are eventually zeroed out. If we nd a column Aj in line 5 that has not been zeroed out, we add j to the basis in line 6. We nd the topmost 1 in column Aj in line 7, letting it be in row i. Then we check all columns Ak to the left of Aj in lines 8{12. Any such column with a 1 in row i depends on Aj , and so we add Aj into it in line 10. Because column Ak depends on any columns that Aj depends on, we add Aj 's dependencies into Ak 's dependencies in line 11. We know that Aj does not depend on itself, however, and so we complement djk in line 12 to record Ak 's dependence on Aj . Once we are done adding in Aj 's dependencies, we zero them out in line 13. With each column packed into a word, this code uses O(n2) word operations.

B Calculating indices in Gray-code order

This appendix shows how to calculate source and target indices in Gray-code order. As noted in Section 5, the advantage of doing so is that each index calculation takes only O(1) word operations on average. This technique was previously reported in CB95]. To simplify the discussion, let us consider the general case in which we are performing the permutation given by y = A x c. Let A have lg N columns, and assume that we wish to generate source indices varying from 0 to N ? 1 in Gray-code order along with the corresponding target index for each source index. The idea will translate to any speci c permutation in a straightforward manner. We start with source index x = 0, with the corresponding target index y = c. Now suppose that we have already computed y = A x c, and we wish to compute y 0 = A x0 c, where x and x0 di er only in the ith bit. Then x0i = xi 1 and x0j = xj for all j 6= i. We have

y 0 = x00 A0 = x 0 A0 = x 0 A0

x0iAi x0n?1 An?1 c (xi 1)Ai xn?1An?1 c x i A i Ai xn?1 An?1 c

16

= A x c Ai = y Ai : Once we know the bit position i in which x and x0 di er, we can compute y 0 = y Ai in only (1) word operations. How do we determine the bit position i? For the standard binary re ected Gray code, the j th value and the (j + 1)st value di er in the position of the rightmost 1 in the binary representation of j + 1. This value is easy to nd by simply examining bits from right to left. We can show that on average, we examine fewer than 2 bits to nd the rightmost 1. Of the N integers from 1 to N , exactly N=2i have the rightmost 1 in the ith bit examined. The total number of bits examined, therefore, is at most

lg N

? X1

i=0

1 X N i Ni < i 2i 2 i=0 = 2N

using the identity 1 iai = a=(1 ? a)2 CLR90] with a = 1=2. Thus, the total number of word i=0 operations over all N elements is (N ). In the application of this technique in Section 5, there are N=P index pairs to calculate, and so it takes (N=P ) word operations in total. In practice, one can optimize the process of nding the bit position i to reduce the constant factors. In our implementation, for example, we maintain a static array ips with entries indexed 1 to 15, where ips j ] holds the position of the rightmost 1 in the binary representation of j . We nd the position i of the rightmost 1 in j + 1 with the following pseudocode: 1 2 3 4 5 6 7 8

nibble

P

q j +1 i 0

while nibble = 0 do q q=16

i

nibble

least signi cant 4 bits of q

i

i+4

(shift q right by 4 bits) least signi cant 4 bits of q

i + ips nibble ]

Here we examine 4 bits (a nibble) at a time rather than just one bit. Only one time in 16 does the while-loop body execute.

C Examples of BMMC permutations

Although not all permutations on N elements are BMMC, several commonly performed permutations are. We list some of them in this appendix. It is easy to see that only a small fraction of all permutations are BMMC. There are N ! permutations on N elements. Because a characteristic matrix has lg2 N entries and a complement vector has lg N entries, there are (2lg2 N )(2lg N ) = N lg N +1 N ! possible combinations of characteristic matrix and complement vector. Moreover, not all of the possible characteristic matrices are nonsingular. 17

Even though relatively few permutations are BMMC, they occur frequently. Here are some common ones: Matrix transpose when all dimensions are powers of 2. If a matrix is q r, where q and r are powers of 2, the transpose permutation maps the (i; j ) entry to the (j; i) entry. Assuming that the matrix is stored in row-major order, the transpose permutation interchanges the most signi cant lg q bits (initially containing the row number) and the least signi cant lg r bits (initially containing the column number). The characteristic matrix has the form

" lg q

I

0

lg r #

I

0

lg r lg q

;

where I denotes identity submatrices. The complement vector is 0. Note that there are two matrices here. The data to be transposed forms a q r matrix, where qr = N , and the type of each entry is unspeci ed. The characteristic matrix is lg N lg N , and each entry is 0 or 1. Shu e and unshu e permutations. These are matrix transpose permutations where the matrix to transpose is N=2 2 or 2 N=2. Bit-reversal permutations. Here, yi = xlg N ?i?1 for i = 0; 1; : : : ; lg N ? 1. The corresponding characteristic matrix has 1s on the antidiagonal and 0s elsewhere, and the complement vector is 0. Vector-reversal permutations. Here, the ith input element maps to the (N ? i ? 1)st output element. This mapping corresponds to simply complementing all bits of the index: yi = xi for i = 0; 1; : : : ; lg N ? 1. Here, the characteristic matrix is the identity matrix and the complement vector is all 1s. Gray-code permutations. In the standard binary re ected Gray code, we have yi = xi xi+1 for i = 0; 1; : : : ; lg N ? 1, letting xlg N = 0 in the boundary case. The characteristic matrix for N = 64 is 2 1 1 0 0 0 03 6 0 1 1 0 0 0 7 7 6 6 0 0 1 1 0 0 7 7 6 (6) 7 6 6 0 0 0 1 1 0 7 7 6 4 0 0 0 0 1 1 5 0 0 0 0 0 1 and the complement vector is 0. By Lemma 1, all compositions and inverses of the above BMMC permutations are also BMMC.

18

赞助商链接

- Parallel K- Means Algorithm on Distributed Memory Multiprocessors Abstract
- Load Balancing for Extrapolation Methods on Distributed Memory Multiprocessors
- A data-clustering algorithm on distributed memory multiprocessors
- MPI Performance Comparison on Distributed and Shared Memory Machines
- Scheduling User-Level Threads on Distributed Shared Memory Multiprocessors
- Implementing Distributed Shared Memory on top of MPI the DSMPI Library
- Conjugate-gradients algorithms An MPI-OpenMP implementation on distributed shared memory sy
- Parallel K- Means Algorithm on Distributed Memory Multiprocessors Abstract
- Analysis of Multithreaded Multiprocessors with Distributed Shared Memory
- Compiler techniques for effective communication on distributed-memory multiprocessors
- Exact solution of linear equations on distributed-memory multiprocessors
- Shaman A distributed simulator for shared memory multiprocessors
- Performance Characterization of Shared- and Distributed-Memory Multiprocessors on a Tree Se
- A data-clustering algorithm on distributed memory multiprocessors
- PROGRAMME Irregular Loop Patterns Compilation on Distributed Shared Memory Multiprocessors

更多相关文章：
更多相关标签：