Welcome, Guest
Username: Password: Remember me

TOPIC: MPI error with Intel 2021.3

MPI error with Intel 2021.3 3 years 3 months ago #38973

  • nicogodet
  • nicogodet's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 157
  • Thank you received: 39
Hello,

I'm currently trying to use openTelemac with Intel compiler.
Compilation finish without errors but when I run an example in parallel, I have this error :
Abort(604611844) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7f006aea4080, count=1000, MPI_BYTE, src=1, tag=524288, MPI_COMM_WORLD, request=0x7f01468608e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Abort(269067524) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7f7ecbf8d0c0, count=1000, MPI_BYTE, src=0, tag=524288, MPI_COMM_WORLD, request=0x7f7fa79378e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Abort(403285252) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7f9b1b6f5600, count=336, MPI_BYTE, src=3, tag=524288, MPI_COMM_WORLD, request=0x7f9bf70a48e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Abort(632068) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7f59ffb97040, count=592, MPI_BYTE, src=4, tag=524288, MPI_COMM_WORLD, request=0x7f5adb54b8e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Abort(134849796) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7fd01d4b2080, count=592, MPI_BYTE, src=3, tag=524288, MPI_COMM_WORLD, request=0x7fd0f8e648e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Abort(671720708) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7fe712c57a00, count=256, MPI_BYTE, src=6, tag=524288, MPI_COMM_WORLD, request=0x7fe7ee62e8e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Abort(873047300) on node 6 (rank 6 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7f9a7a257040, count=464, MPI_BYTE, src=8, tag=524288, MPI_COMM_WORLD, request=0x7f9b55c248e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Abort(873047300) on node 7 (rank 7 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7ff29902b040, count=504, MPI_BYTE, src=9, tag=524288, MPI_COMM_WORLD, request=0x7ff3749eb8e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Abort(201958660) on node 8 (rank 8 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7f39be3feb00, count=752, MPI_BYTE, src=9, tag=524288, MPI_COMM_WORLD, request=0x7f3a99db88e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Abort(269067524) on node 9 (rank 9 in comm 0): Fatal error in PMPI_Irecv: Invalid tag, error stack:
PMPI_Irecv(162): MPI_Irecv(buf=0x7fa2c96e1600, count=752, MPI_BYTE, src=8, tag=524288, MPI_COMM_WORLD, request=0x7fa3a50958e0) failed
PMPI_Irecv(96).: Invalid tag, value is 524288
Traceback (most recent call last):
File "/home/telemac/telemac-mascaret/v8p2r1/scripts/python3/telemac2d.py", line 7, in <module>
main('telemac2d')
File "/home/telemac/telemac-mascaret/v8p2r1/scripts/python3/runcode.py", line 271, in main
run_study(cas_file, code_name, options)
File "/home/telemac/telemac-mascaret/v8p2r1/scripts/python3/execution/run_cas.py", line 157, in run_study
run_local_cas(my_study, options)
File "/home/telemac/telemac-mascaret/v8p2r1/scripts/python3/execution/run_cas.py", line 65, in run_local_cas
my_study.run(options)
File "/home/telemac/telemac-mascaret/v8p2r1/scripts/python3/execution/study.py", line 612, in run
self.run_local()
File "/home/telemac/telemac-mascaret/v8p2r1/scripts/python3/execution/study.py", line 440, in run_local
run_code(self.run_cmd, self.sortie_file)
File "/home/telemac/telemac-mascaret/v8p2r1/scripts/python3/execution/run.py", line 182, in run_code
raise TelemacException('Fail to run\n'+exe)
utils.exceptions.TelemacException: Fail to run
mpirun -np 10 /home/telemac/telemac-mascaret-benchmark/modeles/malpasset/Q100.cas_2021-08-12-14h25min06s/out_user_fortran

It happens always at the same time regardless of ncsize value.

System info :
  • CPU : Dual Intel Xeon Gold 6230R
  • OS : Ubuntu 20.04
  • opentelemac version : v8p2r1
  • Systel and pysource : gitlab.nicodet.fr/-/snippets/7 or see attachment

openmpi config run just fine.
Attachments:
The administrator has disabled public write access.

MPI error with Intel 2021.3 3 years 3 months ago #39019

  • nicogodet
  • nicogodet's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 157
  • Thank you received: 39
Any tips ?
The administrator has disabled public write access.

MPI error with Intel 2021.3 3 years 3 months ago #39020

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
it seems an Intel Issue

see https://doku.lrz.de/display/PUBLIC/Intel+MPI
MPI Tag's Upper bound exceeded

Intel MPI commonly changes the value of MPI_TAG_UB which sometimes results in errors like this:

Fatal error in PMPI_Issend: Invalid tag, error stack:PMPI_Issend(156): MPI_Issend(buf=0x7fff88bd4518, count=1, MPI_INT, dest=0, tag=894323, MPI_COMM_WORLD, request=0x7fff88bd4438) failed
The reason for this error is that programs used more than the maximum number of tags than Intel MPI allows by default and much higher than what MPI standard guarantees (32k). In the past, Intel kept the maximum number of tags pretty high but never releases significantly reduced it. The maximum number of MPI tags value was reduced from 2G to 1G and finally to 0.5M, on Intel MPI 2018.4, 2019.6, and in 2019.7, respectively. The real solution is to adapt your program to what standard guarantees, 32k, however, there is a workaround available to change default values.

It is possible to change, redistribute to be more precise, the number of bits being used for MPI tags but one must reduce the maximum number of MPI ranks. The sum of 39 bits is used for both, MPI tags and MPI ranks.

For example, one could get 2G, 2^31, MPI tags but the number of processes will be reduced to 256, 2^8 by exporting the following environment variables:

export MPIR_CVAR_CH4_OFI_TAG_BITS=31

export MPIR_CVAR_CH4_OFI_RANK_BITS=8
Christophe
The administrator has disabled public write access.
The following user(s) said Thank You: nicogodet, Kurum

MPI error with Intel 2021.3 3 years 3 months ago #39021

  • nicogodet
  • nicogodet's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 157
  • Thank you received: 39
I looked for hours how to solve this.
I never found this site...

Exporting those 2 variables solved the issue.

Using Intel suite seems to improve performance a lot
From 2min18s using GNU compilers and OpenMPI to 1min43 with Intel on 36 cpus on malpasset-fine
The administrator has disabled public write access.
The following user(s) said Thank You: Kurum

MPI error with Intel 2021.3 2 years 10 months ago #39657

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
This discussion helped me as well thank you for sharing!

Would it possible for you to share other cfg files (and the pysource files if possible) for different combinations. I think I can sort things out if I see some examples. The cfg files that comes with the source code seem quite old.

I am working in a similar environment:

CPU : Dual Xeon E5-2680
OS : Ubuntu 20.04
opentelemac version : v8p2r1

I got the intel oneAPI 2022.0.1 compilers (base kit + hpc kit)

Basically I want to compile telemac with:
mpiifort
openmpi built with intel compilers
gnu openmpi

I will then try to connect 3 workstations with same specs as above together with mellanox infiniband cards and compile and run telemac for in that environment. If anyone has any experience on this kind of application I'd like to hear your advice!

Cheers
Onur
The administrator has disabled public write access.

MPI error with Intel 2021.3 2 years 10 months ago #39661

  • nicogodet
  • nicogodet's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 157
  • Thank you received: 39
Hi,

I linked both cfg and pysource.

You should look into SLURM to perform computation on multiple nodes.
EDF systel.cfg gives examples of SLURM config.
Attachments:
The administrator has disabled public write access.
The following user(s) said Thank You: Kurum

MPI error with Intel 2021.3 2 years 10 months ago #39666

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
thanks a lot! Cheers
The administrator has disabled public write access.

MPI error with Intel 2021.3 2 years 9 months ago #39697

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
I noticed something peculiar. Using the intel mpiifort reading the TPXO, specifically:
INITIALISATION BASED ON TPXO:
 C_ID(           1 ) = m2
 C_ID(           2 ) = s2
 C_ID(           3 ) = n2
 C_ID(           4 ) = k2
 C_ID(           5 ) = k1
 C_ID(           6 ) = o1
 C_ID(           7 ) = p1
 C_ID(           8 ) = q1
 C_ID(           9 ) = mm
 C_ID(          10 ) = mf
 C_ID(          11 ) = m4
 C_ID(          12 ) = mn4
 C_ID(          13 ) = ms4
 C_ID(          14 ) = 2n2
 C_ID(          15 ) = s1
  - ACQUIRING LEVELS
  - INTERPOLATING LEVELS
  - ACQUIRING VELOCITIES
  - INTERPOLATING VELOCITIES
 END OF TPXO INITIALISATION

This part before the run starts took almost 2 hours vs openmpi/gnu 10 minutes.

Did anyone notice such behavior? I guess this may have something to do with the compiler flags used for intel? I am using the cfg file attached by nicogodet in this thread.
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.