Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize State Copy #116

Open
tkoskela opened this issue Jul 24, 2020 · 3 comments
Open

Optimize State Copy #116

tkoskela opened this issue Jul 24, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@tkoskela
Copy link
Member

tkoskela commented Jul 24, 2020

The point-to-point communications in State Copy can cause significant slowdown when going off-node. See example below. The algorithm should be optimized to avoid copies of duplicate particles. At present the algorithm works in roughly the following way

1. Work out what particles rank i has
2. Work out what particles rank i needs after resampling
3. Work out which rank has the particles rank i needs
4. Work out which ranks need the particles rank i has
5. Send each particle from rank i to each process that needs it
6. Receive each particle from the process that has it to rank i

https://github.com/Team-RADDISH/TDAC.jl/blob/90a318dbbbd4f80e88e96d2a80522150745705bb/src/TDAC.jl#L452

This is kind of a brute-force solution, it often happens in reality that a single particle has an extremely high weight and is copied many times to many processes. This algorithm will send the same particle over the network every time, even multiple times to the same process. A step should be added to the algorithm that identifies duplicate particles and copies them locally after a single copy has been received.

image

@tkoskela tkoskela added the enhancement New feature or request label Jul 24, 2020
@tkoskela tkoskela self-assigned this Jul 24, 2020
@tkoskela
Copy link
Member Author

The root of the issue is that state_copy creates wild load imbalances due to the nature of the algorithm that are hard to fix. They will only cause a slowdown at the next collective call, which happens to be the MPI_Reduce in get_mean_and_var. The idea above may help but is unlikely to completely fix the issue.

@tkoskela
Copy link
Member Author

https://github.com/Team-RADDISH/TDAC.jl/blob/8c67e8c6c3e492267983628f56406daac0c29dfe/src/TDAC.jl#L458-L465 Could be done on master and only the results sent via MPI. This would probably save some time compared to broadcasting the whole resampling_indices vector.

@matt-graham
Copy link
Member

matt-graham commented Jan 14, 2025

Copying relevant part of discussions on Slack with @tkoskela related to this about further ideas for optimizing state copies in resampling step:

With regards to the slow down in communication when most of the weight is on one or a few particles: just to check is the bottleneck effectively that the one or small number of ranks in which these particles are located are having to sequentially copy the particles to many other ranks? If so would doing something like having the one rank send to one other rank, then both these ranks each send to to another rank, then these four ranks send to another four ranks and so on potentially help? This would reduce the number of send/receives that need to be performed sequentially from something like linear in the number of particles to logarithmic (though only if there negligible overhead to multiple pairs of ranks communicating in parallel). This idea is from https://arxiv.org/pdf/1812.01502.pdf

Another idea would be to use something like https://juliaoptimaltransport.github.io/ExactOptimalTransport.jl/stable/ to solve for the resampling plan which minimises on average the number of particles which need to be moved between ranks. The cost of solving this exactly is not trivial (in general solving for the optimal transport plan is roughly cubic in the number of particles though I think here as the transport cost matrix is binary this can probably be reduced with a more specific solver) so this would probably only be worthwhile when the communication overhead is significant.

Currently I don't think we exploit the fact that we can reorder which particles are assigned to which ranks arbitrarily to try to minimize the number of point to point communications we need to make. There are also methods that can be used to explicitly minimize the expected communication cost under the constraint that we resample according to the required distribution (solving an optimal transport problem https://en.wikipedia.org/wiki/Transportation_theory_(mathematics)) which we could further exploit to reduce communication costs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants