Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug when running on one MPI process #321

Open
tkoskela opened this issue Feb 7, 2024 · 6 comments
Open

Possible bug when running on one MPI process #321

tkoskela opened this issue Feb 7, 2024 · 6 comments
Assignees
Labels
needs: diagnosis Diagnosis required for bug or issue type: bug

Comments

@tkoskela
Copy link
Contributor

tkoskela commented Feb 7, 2024

There's possibly a bug in the MPI communication which appears when running on one process. Collecting hints in this issue

In test_004 of f-exx-opt we notice a difference in the order of 1e-5 in the Harris-Foulkes energy when running on one MPI process, compared to running on multiple processes. In conversation with @lionelalexandre it came up he has been aware of this for some time. Other tests in the testsuite have a tolerance of 1e-4, so they might be missing this.

When running the code in the DDT debugger on myriad with one MPI process, we get a segfault in

call MPI_alltoallv( sendarray, sndcounts, snddisps, MPI_double_complex, &
recvarray, rcvcounts, rcvdisps, MPI_double_complex, &
MPI_comm_world, ierror )

I haven't yet found an obvious reason why. MPI_alltoallv is complicated. Obviously on 1 process it should be doing nothing.

@tkoskela tkoskela added needs: diagnosis Diagnosis required for bug or issue type: bug labels Feb 7, 2024
@tkoskela tkoskela self-assigned this Feb 7, 2024
@tkoskela
Copy link
Contributor Author

tkoskela commented Feb 7, 2024

On my Ubuntu-22.04 laptop with gcc version 11.4.0 on the develop branch commit 78325c9 I get

One MPI process

test_001_bulk_Si_1proc_Diag/Conquest_out:      |* Harris-Foulkes energy   =       -33.679289916138416 Ha
test_002_bulk_Si_1proc_OrderN/Conquest_out:      |* Harris-Foulkes energy   =       -33.569389500697085 Ha
test_003_bulk_BTO_polarisation/Conquest_out:      |* Harris-Foulkes energy   =      -136.657600397396351 Ha

Two MPI processes

test_001_bulk_Si_1proc_Diag/Conquest_out:      |* Harris-Foulkes energy   =       -33.679289916138359 Ha
test_002_bulk_Si_1proc_OrderN/Conquest_out:      |* Harris-Foulkes energy   =       -33.569389501217962 Ha
test_003_bulk_BTO_polarisation/Conquest_out:      |* Harris-Foulkes energy   =      -136.657600376430253 Ha

Four MPI processes

test_001_bulk_Si_1proc_Diag/Conquest_out:      |* Harris-Foulkes energy   =       -33.679289916138323 Ha
test_002_bulk_Si_1proc_OrderN/Conquest_out:      |* Harris-Foulkes energy   =       -33.569389497534459 Ha
test_003_bulk_BTO_polarisation/Conquest_out:      |* Harris-Foulkes energy   =      -136.657601071024885 Ha

The largest relative differences between these are in the order of 1e-9, so the develop branch seems to be ok.

@tkoskela
Copy link
Contributor Author

tkoskela commented Mar 27, 2024

On myriad, with the current develop branch I'm getting a segfault in test_001 with 1 MPI process, 2 runs fine

Also seems related to MPI_alltoallv

$ mpirun -np 1 ../../bin/Conquest
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
Conquest           000000000080EC7A  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B9D1547B630  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17CB2A28  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17A60A8B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17AAD56B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17A8C780  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17B8E36F  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B9D17A61A55  PMPI_Alltoallv        Unknown  Unknown
libmpifort.so.12.  00002B9D17650DDA  pmpi_alltoallv__      Unknown  Unknown
Conquest           00000000004CAEC3  Unknown               Unknown  Unknown
Conquest           00000000004E53C0  Unknown               Unknown  Unknown
Conquest           00000000004E4B80  Unknown               Unknown  Unknown
Conquest           00000000004CFF85  Unknown               Unknown  Unknown
Conquest           00000000004F9A45  Unknown               Unknown  Unknown
Conquest           00000000004F7FC0  Unknown               Unknown  Unknown
Conquest           00000000005F30AE  Unknown               Unknown  Unknown
Conquest           00000000005F8952  Unknown               Unknown  Unknown
Conquest           0000000000411522  Unknown               Unknown  Unknown
Conquest           0000000000411492  Unknown               Unknown  Unknown
libc-2.17.so       00002B9D1969B555  __libc_start_main     Unknown  Unknown
Conquest           00000000004113A9  Unknown               Unknown  Unknown
[cceaosk@node-d97a-005 test_001_bulk_Si_1proc_Diag]$ mpirun -np 2 ../../bin/Conquest
[cceaosk@node-d97a-005 test_001_bulk_Si_1proc_Diag]$ 

Built with

[cceaosk@node-d97a-005 src]$ module list
Currently Loaded Modulefiles:
  1) beta-modules                      5) libxc/6.2.2/intel-2022            9) userscripts/1.4.0                13) python3/3.11
  2) gcc-libs/10.2.0                   6) gerun                            10) openssl/1.1.1u
  3) compilers/intel/2022.2            7) git/2.41.0-lfs-3.3.0             11) python/3.11.4
  4) mpi/intel/2021.6.0/intel          8) emacs/28.1                       12) openblas/0.3.7-serial/gnu-4.9.2

and system.myriad.make:

# Set compilers
FC=mpif90
F77=mpif77

# OpenMP flags
# Set this to "OMPFLAGS= " if compiling without openmp
# Set this to "OMPFLAGS= -fopenmp" if compiling with openmp
OMPFLAGS= -fopenmp

# Set BLAS and LAPACK libraries
# MacOS X
# BLAS= -lvecLibFort
# Intel MKL use the Intel tool
# Generic
#BLAS= -llapack -lblas

# LibXC: choose between LibXC compatibility below or Conquest XC library

# Conquest XC library
#XC_LIBRARY = CQ
#XC_LIB =
#XC_COMPFLAGS =

# LibXC compatibility
# Choose LibXC version: v4 (deprecated) or v5/6 (v5 and v6 have the same interface)
# XC_LIBRARY = LibXC_v4
#XC_LIB = -L/shared/ucl/apps/libxc/4.2.3/intel-2018/lib -lxcf90 -lxc
#XC_COMPFLAGS = -I/shared/ucl/apps/libxc/4.2.3/intel-2018/include
XC_LIBRARY = LibXC_v5
XC_LIB = -lxcf90 -lxc
XC_COMPFLAGS = -I/usr/local/include

# Compilation flags
# NB for gcc10 you need to add -fallow-argument-mismatch
COMPFLAGS= -O3 -g $(OMPFLAGS) $(XC_COMPFLAGS) -I"${MKLROOT}/include"
COMPFLAGS_F77= $(COMPFLAGS)

# Set FFT library
FFT_LIB=-lmkl_rt
FFT_OBJ=fft_fftw3.o

# Full library call; remove scalapack if using dummy diag module
# If using OpenMPI, use -lscalapack-openmpi instead.
#LIBS= $(FFT_LIB) $(XC_LIB) -lscalapack $(BLAS)
LIBS= $(FFT_LIB) $(XC_LIB)

# Linking flags
LINKFLAGS= -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -ldl $(OMPFLAGS) $(XC_LIB)
ARFLAGS=

# Matrix multiplication kernel type
MULT_KERN = ompGemm_m
# Use dummy DiagModule or not
DIAG_DUMMY =
# Use dummy omp_module or not.
# Set this to "OMP_DUMMY = DUMMY" if compiling without openmp
# Set this to "OMP_DUMMY = " if compiling with openmp
OMP_DUMMY = 

@tkoskela
Copy link
Contributor Author

TODO: @tkoskela to test if this happens on ARCHER2

@tkoskela
Copy link
Contributor Author

Ran benchmarks/matrix_multiply and tests 001 002 on Archer2 with 1 and 2 MPI ranks. No segfaults.

On Archer2 I build using cray-mpich so possibly this is an Intel MPI related bug?

@tkoskela
Copy link
Contributor Author

I activated the CI for f-exx-opt and in test_004 there is a 1e-5 relative difference in the output when running on 1 MPI process, which is causing the test to fail.

https://github.com/OrderN/CONQUEST-release/actions/runs/8815200505/job/24208939487

@lionelalexandre
Copy link
Contributor

The bug arises from the use of FFTW in calculation of exact exchange. Bug reproduced with an old version of Conquest (prior Github creation). When switching to FFTE (Japanese implementation of FFT I think) no more problem. I suspect the problem is in init/destroy but I can't be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: diagnosis Diagnosis required for bug or issue type: bug
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants