-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible bug when running on one MPI process #321
Comments
On my Ubuntu-22.04 laptop with One MPI process
Two MPI processes
Four MPI processes
The largest relative differences between these are in the order of 1e-9, so the develop branch seems to be ok. |
On myriad, with the current develop branch I'm getting a segfault in Also seems related to
Built with
and
|
TODO: @tkoskela to test if this happens on ARCHER2 |
Ran benchmarks/matrix_multiply and tests 001 002 on Archer2 with 1 and 2 MPI ranks. No segfaults. On Archer2 I build using |
I activated the CI for https://github.com/OrderN/CONQUEST-release/actions/runs/8815200505/job/24208939487 |
The bug arises from the use of FFTW in calculation of exact exchange. Bug reproduced with an old version of Conquest (prior Github creation). When switching to FFTE (Japanese implementation of FFT I think) no more problem. I suspect the problem is in init/destroy but I can't be sure. |
There's possibly a bug in the MPI communication which appears when running on one process. Collecting hints in this issue
In
test_004
off-exx-opt
we notice a difference in the order of 1e-5 in the Harris-Foulkes energy when running on one MPI process, compared to running on multiple processes. In conversation with @lionelalexandre it came up he has been aware of this for some time. Other tests in the testsuite have a tolerance of 1e-4, so they might be missing this.When running the code in the DDT debugger on myriad with one MPI process, we get a segfault in
CONQUEST-release/src/generic_comms.f90
Lines 1780 to 1782 in 6bf8f4a
I haven't yet found an obvious reason why.
MPI_alltoallv
is complicated. Obviously on 1 process it should be doing nothing.The text was updated successfully, but these errors were encountered: