Two-tree broadcast

Template:Orphan

The two-tree broadcast (abbreviated 2tree-broadcast or 23-broadcast) is an algorithm that implements a broadcast communication pattern on a distributed system using message passing. A broadcast is a commonly used collective operation that sends data from one processor to all other processors. The two-tree broadcast communicates concurrently over two binary trees that span all processors. This achieves full usage of the bandwidth in the full-duplex communication model while having a startup latency logarithmic in the number of partaking processors.^[1] The algorithm can also be adapted to perform a reduction or prefix sum.

Algorithm

A broadcast sends a message from a specified root processor to all other processors. Binary tree broadcasting uses a binary tree to model the communication between the processors. Each processor corresponds to one node in the tree, and the root processor is the root of the tree. To broadcast a message Template:Math, the root sends Template:Math to its two children (child nodes). Each processor waits until it receives Template:Math and then sends Template:Math to its children. Because leaves have no children, they don't have to send any messages. The broadcasting process can be pipelined by splitting the message into Template:Math blocks, which are then broadcast consecutively. In such a binary tree, the leaves of the tree only receive data, but never send any data themselves. If the communication is bidirectional (full-duplex), meaning each processor can send a message and receive a message at the same time, the leaves only use one half of the available bandwidth.

Two-tree broadcast with seven processors, including the edge coloring. Template:Math in red, Template:Math in blue. The last processor is the root.

The idea of the two-tree broadcast is to use two binary trees Template:Math and Template:Math and communicate on both concurrently.^[1] The trees are constructed so that the interior nodes of one tree correspond to leaf nodes of the other tree. The data that has to be broadcast is split into blocks of equal size. In each step of the algorithm, each processor receives one block and sends the previous block to one of its children in the tree in which it is an interior node. A schedule is needed so that no processor has to send or receive two messages in the same step. To create such a schedule, the edges of both trees are colored with 0 and 1 such that

no processor is connected to its parent nodes in Template:Math and Template:Math using edges of the same color
no processor is connected to its children nodes in Template:Math or Template:Math using edges of the same color.

Edges with color 0 are used in even steps, edges with color 1 are used in odd steps. This schedule allows each processor to send one message and receive one message in each step, fully utilizing the available bandwidth.^[1]
Assume that processor Template:Math wants to broadcast a message. The two trees are constructed for the remaining processors. Processor Template:Math sends blocks alternating to the roots of the two trees, so each tree broadcasts one half of the message.

Analysis

Let Template:Math be the number of processing elements (PE), numbered from Template:Math to Template:Math.

Construction of the trees

Two-trees of size 6, 12, 7, 9 using mirroring (top) and shifting (bottom). Template:Math in red, Template:Math in blue.

Let Template:Math. Template:Math and Template:Math can be constructed as trees of height Template:Math, such that both trees form an in-order numbering of the processors, with the following method:^[1]
Template:Math: If Template:Math, Template:Math is a complete binary tree of height Template:Math except that the rightmost leaf is missing. Otherwise, Template:Math consists of a complete binary tree of height Template:Math covering PEs Template:Math, a recursively constructed tree covering PEs Template:Math, and a root at PE Template:Math whose children are the roots of the left and the right subtree.
Template:Math: There are two ways to construct Template:Math. With shifting, Template:Math is first constructed like Template:Math, except that it contains an additional processor. Then Template:Math is shifted by one position to the left and the leftmost leaf is removed. With mirroring, Template:Math is the mirror image of Template:Math (with the mirror axis between processors Template:Math and Template:Math). Mirroring only works for even Template:Math.

It can be proven that a coloring with the desired properties exists for all Template:Math.^[1] When mirroring is used to construct Template:Math, each processor can independently compute the color of its incident edges in Template:Math time.^[1]

Communication Time

For this analysis, the following communication model is used: A message of size Template:Math has a communication time of Template:Math, independent on which processors communicate. Template:Math represents the startup overhead to send the message, Template:Math represents the transmission time per data element.^[2]

Suppose the message of size Template:Math is split into Template:Math blocks. Each communication step takes time Template:Math. Let Template:Math be the height of the communication structure with the root at processor Template:Math and the two trees below it. After Template:Math steps, the first data block has reached every node in both trees. Afterwards, each processor receives one block in every step until it received all blocks. The total number of steps is Template:Math resulting in a total communication time of Template:Math. Using an optimal Template:Math, the total communication time is Template:Math.

Comparison to similar algorithms

In a linear pipeline broadcast, the message is split into Template:Math blocks. In each step, each processor Template:Math receives one block from the processor Template:Math and sends one block to the processor Template:Math. Linear pipeline has optimal throughput, but has a startup time in Template:Math.^[3] For large Template:Math, the Template:Math startup latency of the two-tree broadcast is faster. Because both algorithms have optimal throughput, the two-tree algorithm is faster for a large numbers of processors.

A binomial tree broadcast communicates along a binomial tree. Each process receives the message that is broadcast (the root already has the message) and then sends the message to its children. A binomial tree broadcast has only half the startup time of the two-tree broadcast, but a factor of Template:Math more communication.^[4] The binomial tree broadcast is faster than the two-tree broadcast for small messages, but slower for large messages.

A pipelined binary tree broadcast splits the message into Template:Math blocks and broadcasts the blocks consecutively over a binary tree. By using a Fibonacci tree instead of a simple balanced binary tree, the startup latency can be reduced to Template:Math. ^[5] A Fibonacci tree of height Template:Math consists of a root that has a Fibonacci tree of height Template:Math as its left child and a Fibonacci tree of Template:Math as its right child. The pipelined Fibonacci tree broadcast has half the startup latency of the two-tree broadcast, but also only half of the throughput. It is faster for small messages, while the two-tree broadcast is faster for large messages.

Usage for other communication primitives

Reduction

A reduction (MPI_Reduce in the MPI standard) computes $⨁_{i = 0}^{p - 1} M_{i}$ where Template:Math is a vector of length Template:Math originally available at processor Template:Math and $⨁$ is a binary operation that is associative, but not necessarily commutative. The result is stored at a specified root processor Template:Math.

Assume that Template:Math or Template:Math. In this case the communication is identical to the broadcast, except that the communication direction is reversed.^[1] Each process receives two blocks from its children, reduces them with its own block, and sends the result to its parent. The root takes turns receiving blocks from the roots of Template:Math and Template:Math and reduces them with its own data. The communication time is the same as for the Broadcast and the amount of data reduced per processor is Template:Math.
If the reduce operation is commutative, the result can be achieved for any root by renumbering the processors.

Two-tree reduction with 13 processors. The sixth processor (dark grey) is the root. Template:Maths in red, Template:Maths in blue.

If the operation is not commutative and the root is not Template:Math or Template:Math, then Template:Math is a lower bound for the communication time.^[1] In this case, the remaining processors are split into two subgroups. The processors Template:Math perform a reduction to the root Template:Math and the processors Template:Math perform a reduction to the root Template:Math. Processor Template:Math receives blocks alternating from the two roots of the subgroups.

Prefix sum

A prefix sum (MPI_Scan) computes $⨁_{i = 0}^{j} M_{i}$ for each processor Template:Math where Template:Math is a vector of length Template:Math originally available at processor Template:Math and $⨁$ is a binary associative operation. Using an inorder binary tree, a prefix sum can be computed by first performing an up-phase in which each interior node computes a partial sum $⨁_{i = l}^{r} M_{i}$ for left- and rightmost leaves Template:Math and Template:Math, followed by a down-phase in which prefixes of the form $⨁_{i = 0}^{l - 1} M_{i}$ are sent down the tree and allow each processor to finish computing its prefix sum.Template:R The communication in the up-phase is equivalent to a reduction to processor Template:Math and the communication in the down-phase is equivalent to a broadcast from the processor Template:Math. The total communication time is about twice the communication time of the two-tree broadcast.^[1]

ESBT broadcast

If Template:Math is a power of two, there is an optimal broadcasting algorithm based on edge disjoint spanning binomial trees (ESBT) in a hypercube.^[6] The hypercube, excluding the root Template:Math, is split into Template:Math ESBTs. The algorithm uses pipelining by splitting the broadcast data into Template:Math blocks. Processor Template:Math cyclically distributes blocks to the roots of the ESBTs and each ESBT performs a pipelined binary tree broadcast. In step Template:Math, each processor sends and receives one message along dimension Template:Math. The communication time of the algorithm is Template:Math,^[6] so the startup latency is only one half of the startup latency of the two-tree broadcast.
The drawback of the ESBT broadcast is that it does not work for other values of Template:Math and it cannot be adapted for (non-commutative) reduction or prefix sum.

References

Template:Reflist

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 ^1.7 ^1.8 Cite error: Invalid <ref> tag; no text was provided for refs named SandersSpeck2009
↑ Cite error: Invalid <ref> tag; no text was provided for refs named Hockney1994
↑ Cite error: Invalid <ref> tag; no text was provided for refs named Pješivac-GrbovićAngskun2007
↑ Cite error: Invalid <ref> tag; no text was provided for refs named ChanHeimlich2007
↑ Cite error: Invalid <ref> tag; no text was provided for refs named BruckCypher1992
↑ ^6.0 ^6.1 Cite error: Invalid <ref> tag; no text was provided for refs named JohnssonHo1989

[SandersSpeck2009-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 ^1.7 ^1.8 Cite error: Invalid <ref> tag; no text was provided for refs named SandersSpeck2009

[Hockney1994-2] Cite error: Invalid <ref> tag; no text was provided for refs named Hockney1994

[Pješivac-GrbovićAngskun2007-3] Cite error: Invalid <ref> tag; no text was provided for refs named Pješivac-GrbovićAngskun2007

[ChanHeimlich2007-4] Cite error: Invalid <ref> tag; no text was provided for refs named ChanHeimlich2007

[BruckCypher1992-5] Cite error: Invalid <ref> tag; no text was provided for refs named BruckCypher1992

[JohnssonHo1989-6] 6.0 ^6.1 Cite error: Invalid <ref> tag; no text was provided for refs named JohnssonHo1989

[1]

[2]

[3]

[4]

[5]

[6]

Two-tree broadcast

Contents

Algorithm