[section:collectives Collective operations]

[link mpi.tutorial.point_to_point Point-to-point operations] are the
core message passing primitives in Boost.MPI. However, many
message-passing applications also require higher-level communication
algorithms that combine or summarize the data stored on many different
processes. These algorithms support many common tasks such as
"broadcast this value to all processes", "compute the sum of the
values on all processors" or "find the global minimum." 

[section:broadcast Broadcast]
The [funcref boost::mpi::broadcast `broadcast`] algorithm is
by far the simplest collective operation. It broadcasts a value from a
single process to all other processes within a [classref
boost::mpi::communicator communicator]. For instance, the
following program broadcasts "Hello, World!" from process 0 to every
other process. (`hello_world_broadcast.cpp`)

  #include <boost/mpi.hpp>
  #include <iostream>
  #include <string>
  #include <boost/serialization/string.hpp>
  namespace mpi = boost::mpi;

  int main()
  {
    mpi::environment env;
    mpi::communicator world;

    std::string value;
    if (world.rank() == 0) {
      value = "Hello, World!";
    }

    broadcast(world, value, 0);

    std::cout << "Process #" << world.rank() << " says " << value 
              << std::endl;
    return 0;
  } 

Running this program with seven processes will produce a result such
as:

[pre
Process #0 says Hello, World!
Process #2 says Hello, World!
Process #1 says Hello, World!
Process #4 says Hello, World!
Process #3 says Hello, World!
Process #5 says Hello, World!
Process #6 says Hello, World!
]
[endsect:broadcast]

[section:gather Gather]
The [funcref boost::mpi::gather `gather`] collective gathers
the values produced by every process in a communicator into a vector
of values on the "root" process (specified by an argument to
`gather`). The /i/th element in the vector will correspond to the
value gathered from the /i/th process. For instance, in the following
program each process computes its own random number. All of these
random numbers are gathered at process 0 (the "root" in this case),
which prints out the values that correspond to each processor. 
(`random_gather.cpp`)

  #include <boost/mpi.hpp>
  #include <iostream>
  #include <vector>
  #include <cstdlib>
  namespace mpi = boost::mpi;

  int main()
  {
    mpi::environment env;
    mpi::communicator world;

    std::srand(time(0) + world.rank());
    int my_number = std::rand();
    if (world.rank() == 0) {
      std::vector<int> all_numbers;
      gather(world, my_number, all_numbers, 0);
      for (int proc = 0; proc < world.size(); ++proc)
        std::cout << "Process #" << proc << " thought of " 
                  << all_numbers[proc] << std::endl;
    } else {
      gather(world, my_number, 0);
    }

    return 0;
  } 

Executing this program with seven processes will result in output such
as the following. Although the random values will change from one run
to the next, the order of the processes in the output will remain the
same because only process 0 writes to `std::cout`.

[pre
Process #0 thought of 332199874
Process #1 thought of 20145617
Process #2 thought of 1862420122
Process #3 thought of 480422940
Process #4 thought of 1253380219
Process #5 thought of 949458815
Process #6 thought of 650073868
]

The `gather` operation collects values from every process into a
vector at one process. If instead the values from every process need
to be collected into identical vectors on every process, use the
[funcref boost::mpi::all_gather `all_gather`] algorithm,
which is semantically equivalent to calling `gather` followed by a
`broadcast` of the resulting vector.

[endsect:gather]

[section:scatter Scatter]
The [funcref boost::mpi::scatter `scatter`] collective scatters
the values from a vector in the "root" process in a communicator into 
values in all the processes of the communicator.
 The /i/th element in the vector will correspond to the
value received by the /i/th process. For instance, in the following
program, the root process produces a vector of random nomber and send 
one value to each process that will print it. (`random_scatter.cpp`)

  #include <boost/mpi.hpp>
  #include <boost/mpi/collectives.hpp>
  #include <iostream>
  #include <cstdlib>
  #include <vector>
  
  namespace mpi = boost::mpi;
  
  int main(int argc, char* argv[])
  {
    mpi::environment env(argc, argv);
    mpi::communicator world;
    
    std::srand(time(0) + world.rank());
    std::vector<int> all;
    int mine = -1;
    if (world.rank() == 0) {
      all.resize(world.size());
      std::generate(all.begin(), all.end(), std::rand);
    }
    mpi::scatter(world, all, mine, 0);
    for (int r = 0; r < world.size(); ++r) {
      world.barrier();
      if (r == world.rank()) {
        std::cout << "Rank " << r << " got " << mine << '\n';
      }
    }
    return 0;
  }

Executing this program with seven processes will result in output such
as the following. Although the random values will change from one run
to the next, the order of the processes in the output will remain the
same because of the barrier.

[pre
Rank 0 got 1409381269
Rank 1 got 17045268
Rank 2 got 440120016
Rank 3 got 936998224
Rank 4 got 1827129182
Rank 5 got 1951746047
Rank 6 got 2117359639
]

[endsect:scatter]

[section:reduce Reduce] 

The [funcref boost::mpi::reduce `reduce`] collective
summarizes the values from each process into a single value at the
user-specified "root" process. The Boost.MPI `reduce` operation is
similar in spirit to the STL _accumulate_ operation, because it takes
a sequence of values (one per process) and combines them via a
function object. For instance, we can randomly generate values in each
process and the compute the minimum value over all processes via a
call to [funcref boost::mpi::reduce `reduce`]
(`random_min.cpp`):

  #include <boost/mpi.hpp>
  #include <iostream>
  #include <cstdlib>
  namespace mpi = boost::mpi;

  int main()
  {
    mpi::environment env;
    mpi::communicator world;

    std::srand(time(0) + world.rank());
    int my_number = std::rand();

    if (world.rank() == 0) {
      int minimum;
      reduce(world, my_number, minimum, mpi::minimum<int>(), 0);
      std::cout << "The minimum value is " << minimum << std::endl;
    } else {
      reduce(world, my_number, mpi::minimum<int>(), 0);
    }

    return 0;
  }

The use of `mpi::minimum<int>` indicates that the minimum value
should be computed. `mpi::minimum<int>` is a binary function object
that compares its two parameters via `<` and returns the smaller
value. Any associative binary function or function object will
work provided it's stateless. For instance, to concatenate strings with `reduce` one could use
the function object `std::plus<std::string>` (`string_cat.cpp`):

  #include <boost/mpi.hpp>
  #include <iostream>
  #include <string>
  #include <functional>
  #include <boost/serialization/string.hpp>
  namespace mpi = boost::mpi;

  int main()
  {
    mpi::environment env;
    mpi::communicator world;

    std::string names[10] = { "zero ", "one ", "two ", "three ", 
                              "four ", "five ", "six ", "seven ", 
                              "eight ", "nine " };

    std::string result;
    reduce(world, 
           world.rank() < 10? names[world.rank()] 
                            : std::string("many "),
           result, std::plus<std::string>(), 0);

    if (world.rank() == 0)
      std::cout << "The result is " << result << std::endl;

    return 0;
  } 

In this example, we compute a string for each process and then perform
a reduction that concatenates all of the strings together into one,
long string. Executing this program with seven processors yields the
following output:

[pre
The result is zero one two three four five six
]

[h4 Binary operations for reduce]
Any kind of binary function objects can be used with `reduce`. For
instance, and there are many such function objects in the C++ standard
`<functional>` header and the Boost.MPI header
`<boost/mpi/operations.hpp>`. Or, you can create your own
function object. Function objects used with `reduce` must be
associative, i.e. `f(x, f(y, z))` must be equivalent to `f(f(x, y),
z)`. If they are also commutative (i..e, `f(x, y) == f(y, x)`),
Boost.MPI can use a more efficient implementation of `reduce`. To
state that a function object is commutative, you will need to
specialize the class [classref boost::mpi::is_commutative
`is_commutative`]. For instance, we could modify the previous example
by telling Boost.MPI that string concatenation is commutative:

  namespace boost { namespace mpi {

    template<>
    struct is_commutative<std::plus<std::string>, std::string> 
      : mpl::true_ { };

  } } // end namespace boost::mpi

By adding this code prior to `main()`, Boost.MPI will assume that
string concatenation is commutative and employ a different parallel
algorithm for the `reduce` operation. Using this algorithm, the
program outputs the following when run with seven processes:

[pre
The result is zero one four five six two three
]

Note how the numbers in the resulting string are in a different order:
this is a direct result of Boost.MPI reordering operations. The result
in this case differed from the non-commutative result because string
concatenation is not commutative: `f("x", "y")` is not the same as
`f("y", "x")`, because argument order matters. For truly commutative
operations (e.g., integer addition), the more efficient commutative
algorithm will produce the same result as the non-commutative
algorithm. Boost.MPI also performs direct mappings from function
objects in `<functional>` to `MPI_Op` values predefined by MPI (e.g.,
`MPI_SUM`, `MPI_MAX`); if you have your own function objects that can
take advantage of this mapping, see the class template [classref
boost::mpi::is_mpi_op `is_mpi_op`].

[warning Due to the underlying MPI limitations, it is important to note that the operation must be stateless.]

[h4 All process variant]

Like [link mpi.tutorial.collectives.gather `gather`], `reduce` has an "all"
variant called [funcref boost::mpi::all_reduce `all_reduce`]
that performs the reduction operation and broadcasts the result to all
processes. This variant is useful, for instance, in establishing
global minimum or maximum values.

The following code (`global_min.cpp`) shows a broadcasting version of 
the `random_min.cpp` example:

  #include <boost/mpi.hpp>
  #include <iostream>
  #include <cstdlib>
  namespace mpi = boost::mpi;
  
  int main(int argc, char* argv[])
  {
    mpi::environment env(argc, argv);
    mpi::communicator world;
  
    std::srand(world.rank());
    int my_number = std::rand();
    int minimum;
  
    mpi::all_reduce(world, my_number, minimum, mpi::minimum<int>());
    
    if (world.rank() == 0) {
        std::cout << "The minimum value is " << minimum << std::endl;
    }
  
    return 0;
  }

In that example we provide both input and output values, requiring
twice as much space, which can be a problem depending on the size 
of the transmitted data. 
If there  is no need to preserve the input value, the output value 
can be omitted. In that case the input value will be overridden with 
the output value and Boost.MPI is able, in some situation, to implement
the operation with a more space efficient solution (using the `MPI_IN_PLACE`
flag of the MPI C mapping), as in the following example (`in_place_global_min.cpp`):

  #include <boost/mpi.hpp>
  #include <iostream>
  #include <cstdlib>
  namespace mpi = boost::mpi;
  
  int main(int argc, char* argv[])
  {
    mpi::environment env(argc, argv);
    mpi::communicator world;
  
    std::srand(world.rank());
    int my_number = std::rand();
  
    mpi::all_reduce(world, my_number, mpi::minimum<int>());
    
    if (world.rank() == 0) {
        std::cout << "The minimum value is " << my_number << std::endl;
    }
  
    return 0;
  }


[endsect:reduce]

[endsect:collectives]