Collective operation

From testwiki
Revision as of 17:40, 9 March 2024 by imported>Afrozenator (add cost for broadcast, non-pipelined)
(diff) ← Older revision | Latest revision (diff) | Newer revision β†’ (diff)
Jump to navigation Jump to search

Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context. Hence, there is an interest in efficient realizations of these operations.

A realization of the collective operations is provided by the Message Passing Interface[1] (MPI).

Definitions

In all asymptotic runtime functions, we denote the latency α (or startup time per message, independent of message size), the communication cost per word β, the number of processing units p and the input size per node n. In cases where we have initial messages on more than one node we assume that all local messages are of the same size. To address individual processing units we use pi{p0,p1,,pp1}.

If we do not have an equal distribution, i.e. node pi has a message of size ni, we get an upper bound for the runtime by setting n=max(n0,n1,,np1).

A distributed memory model is assumed. The concepts are similar for the shared memory model. However, shared memory systems can provide hardware support for some operations like broadcast (Template:Section link) for example, which allows convenient concurrent read.[2] Thus, new algorithmic possibilities can become available.

Broadcast

Template:Main

There are three squares vertically aligned on the left and three squares vertically aligned on the right. A dotted line connects the high left and high right square. Two solid lines connect the high left square and the middle and low right square. The letter a is written in the high left square and in all right squares.
Information flow of Broadcast operation performed on three nodes.

The broadcast pattern[3] is used to distribute data from one processing unit to all processing units, which is often needed in SPMD parallel programs to dispense input or global values. Broadcast can be interpreted as an inverse version of the reduce pattern (Template:Section link). Initially only root r with id 0 stores message m. During broadcast m is sent to the remaining processing units, so that eventually m is available to all processing units.

Since an implementation by means of a sequential for-loop with p1 iterations becomes a bottleneck, divide-and-conquer approaches are common. One possibility is to utilize a binomial tree structure with the requirement that p has to be a power of two. When a processing unit is responsible for sending m to processing units i..j, it sends m to processing unit (i+j)/2 and delegates responsibility for the processing units (i+j)/2..j to it, while its own responsibility is cut down to i..(i+j)/21.

Binomial trees have a problem with long messages m. The receiving unit of m can only propagate the message to other units, after it received the whole message. In the meantime, the communication network is not utilized. Therefore pipelining on binary trees is used, where m is split into an array of k packets of size n/k. The packets are then broadcast one after another, so that data is distributed fast in the communication network.

Pipelined broadcast on balanced binary tree is possible in π’ͺ(αlogp+βn), whereas for the non-pipelined case it takes π’ͺ((α+βn)logp) cost.

Template:Clear

Reduce

Template:Main

There are three squares vertically aligned on the left and three squares vertically aligned on the right. A circle with the letter f inside is placed between the two columns. Three solid lines connect the circle with the left three squares. One solid line connects the circle and the high right square. The letters a, b and c are written in the left squares from high to low. The letter alpha is written in the top right square.
Information flow of Reduce operation performed on three nodes. f is the associative operator and Ξ± is the result of the reduction.

The reduce pattern[4] is used to collect data or partial results from different processing units and to combine them into a global result by a chosen operator. Given p processing units, message mi is on processing unit pi initially. All mi are aggregated by and the result is eventually stored on p0. The reduction operator must be associative at least. Some algorithms require a commutative operator with a neutral element. Operators like sum, min, max are common.

Implementation considerations are similar to broadcast (Template:Section link). For pipelining on binary trees the message must be representable as a vector of smaller object for component-wise reduction.

Pipelined reduce on a balanced binary tree is possible in π’ͺ(αlogp+βn).

Template:Clear

All-Reduce

There are three squares vertically aligned on the left and three squares vertically aligned on the right. A circle with the letter f inside is placed between the two columns. Three solid lines connect the circle with the left three squares. One solid line connects the circle and the high right square. The letters a, b and c are written in the left squares from high to low. The letter alpha is written in the top right square.
Information flow of All-Reduce operation performed on three nodes. f is the associative operator and Ξ± is the result of the reduction.

The all-reduce pattern[5] (also called allreduce) is used if the result of a reduce operation (Template:Section link) must be distributed to all processing units. Given p processing units, message mi is on processing unit pi initially. All mi are aggregated by an operator and the result is eventually stored on all pi. Analog to the reduce operation, the operator must be at least associative.

All-reduce can be interpreted as a reduce operation with a subsequent broadcast (Template:Section link). For long messages a corresponding implementation is suitable, whereas for short messages, the latency can be reduced by using a hypercube (Template:Section link) topology, if p is a power of two. All-reduce can also be implemented with a butterfly algorithm and achieve optimal latency and bandwidth.[6]

All-reduce is possible in π’ͺ(αlogp+βn), since reduce and broadcast are possible in π’ͺ(αlogp+βn) with pipelining on balanced binary trees. All-reduce implemented with a butterfly algorithm achieves the same asymptotic runtime.

Template:Clear

Prefix-Sum/Scan

Template:Main

There are three squares vertically aligned on the left and three rectangles vertically aligned on the right. A circle with the word scan inside is placed between the two columns. Three solid lines connect the circle with the left three squares. Three solid lines connect the circle with the three right square. The letters a, b and c are written in the left squares from high to low. In the high right square the letter a is written. In the mid right square the term a plus b is written. In the low right square the term a plus b plus c is written.
Information flow of Prefix-Sum/Scan operation performed on three nodes. The operator + can be any associative operator.

The prefix-sum or scan operation[7] is used to collect data or partial results from different processing units and to compute intermediate results by an operator, which are stored on those processing units. It can be seen as a generalization of the reduce operation (Template:Section link). Given p processing units, message mi is on processing unit pi. The operator must be at least associative, whereas some algorithms require also a commutative operator and a neutral element. Common operators are sum, min and max. Eventually processing unit pi stores the prefix sum i<=imi. In the case of the so-called exclusive prefix sum, processing unit pi stores the prefix sum i<imi. Some algorithms require to store the overall sum at each processing unit in addition to the prefix sums.

For short messages, this can be achieved with a hypercube topology if p is a power of two. For long messages, the hypercube (Template:Section link, Template:Section link) topology is not suitable, since all processing units are active in every step and therefore pipelining can't be used. A binary tree topology is better suited for arbitrary p and long messages (Template:Section link).

Prefix-sum on a binary tree can be implemented with an upward and downward phase. In the upward phase reduction is performed, while the downward phase is similar to broadcast, where the prefix sums are computed by sending different data to the left and right children. With this approach pipelining is possible, because the operations are equal to reduction (Template:Section link) and broadcast (Template:Section link).

Pipelined prefix sum on a binary tree is possible in π’ͺ(αlogp+βn).

Template:Clear

Barrier

Template:Main

The barrier[8] as a collective operation is a generalization of the concept of a barrier, that can be used in distributed computing. When a processing unit calls barrier, it waits until all other processing units have called barrier as well. Barrier is thus used to achieve global synchronization in distributed computing.

One way to implement barrier is to call all-reduce (Template:Section link) with an empty/ dummy operand. We know the runtime of All-reduce is π’ͺ(αlogp+βn). Using a dummy operand reduces size n to a constant factor and leads to a runtime of π’ͺ(αlogp).

Gather

There are three squares vertically aligned on the left and three rectangles vertically aligned on the right. A dotted line connects the high left square with the high right rectangle. Two solid lines connect the mid and low left squares with the high right rectangle. The letters a, b and c are written in the left squares from high to low. The letters a, b and c are written in the high right rectangle in a row.
Information flow of Gather operation performed on three nodes.

The gather communication pattern[9] is used to store data from all processing units on a single processing unit. Given p processing units, message mi on processing unit pi. For a fixed processing unit pj, we want to store the message m1m2mp on pj. Gather can be thought of as a reduce operation (Template:Section link) that uses the concatenation operator. This works due to the fact that concatenation is associative. By using the same binomial tree reduction algorithm we get a runtime of π’ͺ(αlogp+βpn). We see that the asymptotic runtime is similar to the asymptotic runtime of reduce π’ͺ(αlogp+βn), but with the addition of a factor p to the term βn. This additional factor is due to the message size increasing in each step as messages get concatenated. Compare this to reduce where message size is a constant for operators like min.

Template:Clear

All-Gather

There are three squares vertically aligned on the left and three rectangles vertically aligned on the right. Three dotted lines connect the high left square with the high right rectangle, the mid left square with the mid right rectangle and the low left square with the low right rectangle. Two solid lines connect the mid and low left squares with the high right rectangle. Two solid lines connect the high and low left squares with the mid right rectangle. Two solid lines connect the high and mid left squares with the low right rectangle. The letters a, b and c are written in the left squares from high to low. The letters a, b and c are written in all right rectangles in a row.
Information flow of All-Gather operation performed on three nodes.

The all-gather communication pattern[9] is used to collect data from all processing units and to store the collected data on all processing units. Given p processing units pi, message mi initially stored on pi, we want to store the message m1m2mp on each pj.

It can be thought of in multiple ways. The first is as an all-reduce operation (Template:Section link) with concatenation as the operator, in the same way that gather can be represented by reduce. The second is as a gather-operation followed by a broadcast of the new message of size pn. With this we see that all-gather in π’ͺ(αlogp+βpn) is possible.

Template:Clear

Scatter

There are three rectangles vertically aligned on the left and three squares vertically aligned on the right. A dotted line connects the high left rectangle with the high right square. Two solid lines connect the high left rectangle with the mid and low right squares. The letters c, b and a are written in the high left rectangle in a row. The letters a, b and c are written in the right right squares from high to low.
Information flow of Scatter operation performed on three nodes.

The scatter communication pattern[10] is used to distribute data from one processing unit to all the processing units. It differs from broadcast, in that it does not send the same message to all processing units. Instead it splits the message and delivers one part of it to each processing unit.

Given p processing units pi, a fixed processing unit pj that holds the message m=m1m2mp. We want to transport the message mi onto pi. The same implementation concerns as for gather (Template:Section link) apply. This leads to an optimal runtime in π’ͺ(αlogp+βpn).

Template:Clear

All-to-all

Template:Main

All-to-all[11] is the most general communication pattern. For 0i,j<p, message mi,j is the message that is initially stored on node i and has to be delivered to node j. We can express all communication primitives that do not use operators through all-to-all. For example, broadcast of message m from node pk is emulated by setting mi,j=m for i=k and setting ml,j empty for lk.

Assuming we have a fully connected network, the best possible runtime for all-to-all is in π’ͺ(p(α+βn)) . This is achieved through p rounds of direct message exchange. For p power of 2, in communication round k , node pi exchanges messages with node pj,j=ik .

If the message size is small and latency dominates the communication, a hypercube algorithm can be used to distribute the messages in time

π’ͺ(logp(α+βpn))

.

There are three rectangles vertically aligned on the left and three rectangles vertically aligned on the right. The rectangles are three time higher as wide. The terms a1, a2 and a3 are written in the high left rectangle one below the other. The terms b1, b2 and b3 are written in the mid left rectangle one below the other. The terms c1, c2 and c3 are written in the low left rectangle one below the other. The terms a1, b1 and c1 are written in the high right rectangle one below the other. The terms a2, b2 and c2 are written in the mid right rectangle one below the other. The terms a3, b3 and c3 are written in the low right rectangle one below the other. A dotted line connects a1 from the high left rectangle and a1 from the high right rectangle. A dotted line connects b2 from the mid left rectangle and b2 from the mid right rectangle. A dotted line connects c3 from the low left rectangle and c3 from the low right rectangle. Solid lines connect the other corresponding terms between the left and right rectangles.
Information flow of All-to-All operation performed on three nodes. Letters indicate nodes and numbers indicate information items.

Runtime Overview

This table[12] gives an overview over the best known asymptotic runtimes, assuming we have free choice of network topology.

Example topologies we want for optimal runtime are binary tree, binomial tree, hypercube.

In practice, we have to adjust to the available physical topologies, e.g. dragonfly, fat tree, grid network (references other topologies, too).

More information under Network topology.

For each operation, the optimal algorithm can depend on the input sizes n. For example, broadcast for short messages is best implemented using a binomial tree whereas for long messages a pipelined communication on a balanced binary tree is optimal.

The complexities stated in the table depend on the latency α and the communication cost per word β in addition to the number of processing units p and the input message size per node n. The # senders and # receivers columns represent the number of senders and receivers that are involved in the operation respectively. The # messages column lists the number of input messages and the Computations? column indicates if any computations are done on the messages or if the messages are just delivered without processing. Complexity gives the asymptotic runtime complexity of an optimal implementation under free choice of topology.

Name # senders # receivers # messages Computations? Complexity
Broadcast 1 p 1 no π’ͺ(αlogp+βn)
Reduce p 1 p yes π’ͺ(αlogp+βn)
All-reduce p p p yes π’ͺ(αlogp+βn)
Prefix sum p p p yes π’ͺ(αlogp+βn)
Barrier p p 0 no π’ͺ(αlogp)
Gather p 1 p no π’ͺ(αlogp+βpn)
All-Gather p p p no π’ͺ(αlogp+βpn)
Scatter 1 p p no π’ͺ(αlogp+βpn)
All-To-All p p p2 no π’ͺ(logp(α+βpn)) or π’ͺ(p(α+βn))

Template:Clear

Notes

Template:Reflist

References

Template:Cite book

  1. ↑ Intercommunicator Collective Operations. The Message Passing Interface (MPI) standard, chapter 7.3.1. Mathematics and Computer Science Division, Argonne National Laboratory.
  2. ↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 395
  3. ↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 396-401
  4. ↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 402-403
  5. ↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 403-404
  6. ↑ Template:Cite journal
  7. ↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 404-406
  8. ↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 408
  9. ↑ 9.0 9.1 Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 412-413
  10. ↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 413
  11. ↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 413-418
  12. ↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 394