Optimize State Copy #116

tkoskela · 2020-07-24T10:43:05Z

The point-to-point communications in State Copy can cause significant slowdown when going off-node. See example below. The algorithm should be optimized to avoid copies of duplicate particles. At present the algorithm works in roughly the following way

1. Work out what particles rank i has
2. Work out what particles rank i needs after resampling
3. Work out which rank has the particles rank i needs
4. Work out which ranks need the particles rank i has
5. Send each particle from rank i to each process that needs it
6. Receive each particle from the process that has it to rank i

https://github.com/Team-RADDISH/TDAC.jl/blob/90a318dbbbd4f80e88e96d2a80522150745705bb/src/TDAC.jl#L452

This is kind of a brute-force solution, it often happens in reality that a single particle has an extremely high weight and is copied many times to many processes. This algorithm will send the same particle over the network every time, even multiple times to the same process. A step should be added to the algorithm that identifies duplicate particles and copies them locally after a single copy has been received.

The text was updated successfully, but these errors were encountered:

tkoskela · 2020-08-13T15:42:48Z

The root of the issue is that state_copy creates wild load imbalances due to the nature of the algorithm that are hard to fix. They will only cause a slowdown at the next collective call, which happens to be the MPI_Reduce in get_mean_and_var. The idea above may help but is unlikely to completely fix the issue.

tkoskela · 2020-08-18T10:40:36Z

https://github.com/Team-RADDISH/TDAC.jl/blob/8c67e8c6c3e492267983628f56406daac0c29dfe/src/TDAC.jl#L458-L465 Could be done on master and only the results sent via MPI. This would probably save some time compared to broadcasting the whole resampling_indices vector.

matt-graham · 2025-01-14T17:32:44Z

Copying relevant part of discussions on Slack with @tkoskela related to this about further ideas for optimizing state copies in resampling step:

With regards to the slow down in communication when most of the weight is on one or a few particles: just to check is the bottleneck effectively that the one or small number of ranks in which these particles are located are having to sequentially copy the particles to many other ranks? If so would doing something like having the one rank send to one other rank, then both these ranks each send to to another rank, then these four ranks send to another four ranks and so on potentially help? This would reduce the number of send/receives that need to be performed sequentially from something like linear in the number of particles to logarithmic (though only if there negligible overhead to multiple pairs of ranks communicating in parallel). This idea is from https://arxiv.org/pdf/1812.01502.pdf

Another idea would be to use something like https://juliaoptimaltransport.github.io/ExactOptimalTransport.jl/stable/ to solve for the resampling plan which minimises on average the number of particles which need to be moved between ranks. The cost of solving this exactly is not trivial (in general solving for the optimal transport plan is roughly cubic in the number of particles though I think here as the transport cost matrix is binary this can probably be reduced with a more specific solver) so this would probably only be worthwhile when the communication overhead is significant.

Currently I don't think we exploit the fact that we can reorder which particles are assigned to which ranks arbitrarily to try to minimize the number of point to point communications we need to make. There are also methods that can be used to explicitly minimize the expected communication cost under the constraint that we resample according to the required distribution (solving an optimal transport problem https://en.wikipedia.org/wiki/Transportation_theory_(mathematics)) which we could further exploit to reduce communication costs.

tkoskela added the enhancement New feature or request label Jul 24, 2020

tkoskela self-assigned this Jul 24, 2020

tkoskela mentioned this issue Aug 11, 2020

Parallel performance #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize State Copy #116

Optimize State Copy #116

tkoskela commented Jul 24, 2020 •

edited

Loading

tkoskela commented Aug 13, 2020

tkoskela commented Aug 18, 2020

matt-graham commented Jan 14, 2025 •

edited

Loading

Optimize State Copy #116

Optimize State Copy #116

Comments

tkoskela commented Jul 24, 2020 • edited Loading

tkoskela commented Aug 13, 2020

tkoskela commented Aug 18, 2020

matt-graham commented Jan 14, 2025 • edited Loading

tkoskela commented Jul 24, 2020 •

edited

Loading

matt-graham commented Jan 14, 2025 •

edited

Loading