Array Processor with Multiple Broadcasting

In this paper, we consider a generalized broadcasting feature for Mesh Connected Computers. In the 2 dimensional case there are N ~ N1/2XN1/2 processors with broadcasting feature in each row and each column. This multiple broadcast allows parallel data transfers within rows and columns of processors. The proposed architecture is well suited for solution to many problems in Linear Algebra, Image Processing, Computational Geometry and Numerical Computations. We develop parallel algorithms for many problems in these areas: for example, we can find the maximum of N values in 0(N1/8), median in 0 (N 1/8 (log N)~'/s), extreme points of a convex polygon in O(N1/~), nearest neighbors in 0(N1/8), while these problems need fl(N l/a) on a ~ Xx/'N 2MCC with single broadcast. We also derive bounds on the speedups obtainable with multiple broadcast. I I N T R O D U C T I O N Several multi-processor architectures have been proposed for Parallel Processing [FLYN 72, ENSL 74], (for a bibliography see [SATY 80l). Of these, the Mesh Connected Computers (MCC) have been widely used [KAUT 68, HAMA 71, KUNG 77]; their regular structure and near neighbor connections are particularly suitable for VLSI implementation. They seem to be a natural structure for solving many problems in Matrix computations and Image processing. In parallel and distributed computations the solution times to problems are constrained by information flow rather than processing times within PE's [GENT 78]. Moreover, even if the problem is not constrained by large flow of information, the solution time can be constrained by the time required for moving a single piece of data over a long distance. For example, in a 2-dimensional MCC with N PE's in which the PE's are placed at the grid points in a plane, moving a data from one PE to another may take as much as ~ time in the worst case. * This research was supported by the NSF grant No. ECS-8307077 and DARPA/ARO Contract No. DAAG29-84K-0066. Given that a Mesh Connected Computer is a natural and realistic parallel architecture for efficient solution to many problems but solution times are constrained by long data movements, an obvious extension is to augment the network with a faster mechanism for moving data over long distances. Such a technique called broadcasting has been considered in [GENT 78, BOKH 81, STOU 82, STOU 83]. In broadcasting, a single PE can send data which is received by all the PE's simultaneously. Even though it is unrealistic to assume broadcasting to take constant time independent of the size of the network we may still be able to realize such a network in a practical situation [STOU 83]. Several problems have been considered in [BOKH 81, STOU 83] with substantial improvements in computation time compared to MCC without broadcasting. For example, we can find Max, Min of N numbers on a 2-MCC with broadcast in O(N 1/3) time while 12(N 1/2) time is required without broadcasting (We use O for order no greater, 13 for order atleast). Parallel algorithms to many problems including finding Closest pair, Geometric problems have been designed in [STOU 83]. Broadcasting cannot solve all data transfer problems. For example, sorting is essentially a problem constrained by information flow: at least lfl(V'N-) time is required with or without broadcasting on a 2-MCC. Also, broadcasting introduces some sequentiality into the parallel algorithms: since broadcasting is done over a shared global bus only one item can be communicated over the bus. Thus, trying to cover many "long' distances using broadcasts will increase the solution time. In fact, solution to many problems on the MCC with single global broadcast strike a balance between local and global communication [BOKH 84, STOU 83]. To overcome this problem and effectively use broadcasting we propose to augment the MCC with multiple broadcasting feature. In a 2-MCC with multiple broadcasting PE's can locally communicate to one of its 4 neighbors or broadcast along its row or column. With row and column broadcasting, clearly data in any single PE can be broadcast to all PE's in 2 steps: thus all the previous algorithms employing a single global broadcast bus can be adopted to the proposed architecture with at most a factor of 2 loss in running time. However, we should expect to do much better since the broadcast buses can be used to cover many "long' distances simultaneously. In fact, all our algorithms exploit multiple broadcasting to significantly improve the solution times. 2 0149-71111851000010002501.00 © 1985 IEEE In this paper we consider problems in Numerical linear algebra, semi group computat ions, Geometric problems, and Image Processing and derive parallel algorithms on our architecture. All our algorithms are substant ial ly faster than those running on MCC with a single broadcast feature and are opt imal on this architecture. Finally, we use information flow arguments to derive lower bounds on the solution times to problems on our architecture. This approach also settles an open problem posed in [STOU 83]. The rest of this paper is organized as follows: In the next section we briefly discuss mult iple broadcast ing in MCC. In section 3 we consider several problems in semigroup computat ions, Image Processing, and Geometric problems and derive fast parallel algorithms on our model. In section 4, we discuss bounds on speed ups obta inable with broadcast . I I M C C A N D B R O A D C A S T I N G For the sake of simplicity we discuss only the 2dimensional MCC with mult iple broadcast ing capabil i ty and present all our algorithms to run on this architecture. These ideas can be extended to higher dimensions. Figure 1 shows the proposed 2-dimensional ar ray architecture with broadcast ing capabil i ty in each row and each column. This is an SIMD architecture and consists of N processing elements arranged on a square array with 4 nearest neighbor connections for local da ta transfers. Each PE has a coilstant number of registers for local s torage and is capable of executing s imple ar i thmetic and logic operations. There are two types of da t a transfer instructions that can be executed by the PE's : route da ta to one of the four nearest neighbors; and broadcast da ta to a row of PE 's or a column of PE's . A t any time only one type of da ta routing instruction can be executed by the PE's . Fur ther , if broadcast instruction is executed only one PE per row (or column) can send a value to all PE 's in its row ( or column). Note that , in case of 2-MCC with a single broadcast bus [STOU 83, BOKH 84] there is global bus over which a PE can send a da ta which is received by all the PE ' s simultaneously. Other than the mult iple broadcast feature the control organization is same as the well known M C C ' s in l i terature [STOU 83, KUNG 77]. Notice that , sequencing of instructions is controlled by a central clock. Several comments on the proposed architecture are in order. Introducing row and column broadcast ing does increase the complexity of the communication hardware at each PE, but it is not substant ia l ly higher than the case when a single broadcast bus is present. Area of VLSI layout of such an architecture would increase, again by a small constant factor. However, the drivers needed for broadcast ing need not be as powerful since broadcasting reaches only ~ PE ' s as opposed to N PE's . Using the s tandard information flow argument [STOU 83], Proposition I: Suppose N da t a items are dis tr ibuted such tha t each PE has 1 da ta item. Then, any nontr ivial semi-group computa t ion (finding Max, Min or Sum) ' !7;7 / '{,/ / fi u V 2-MCC with Row and Column Broadcast ing

[1]  Quentin F. Stout,et al.  Mesh-Connected Computers with Broadcasting , 1983, IEEE Transactions on Computers.

[2]  C. Thomborson Area-Time Complexity for VLSI , 1979, STOC.