TELEMAC-MASCARET Forum: MPI Error: btl_openib

MPI Error: btl_openib_component.c 4 years 5 months ago #36052

Ehsan

Hey

I am writing this message on behlaf of one of our TELEMAC2D users. He has submitted a compute job with 5 (Intel Skylake) nodes, and after about 34 hours, he is receiving the following error. This has happened more than once.

|runCode: Fail to run
   |mpirun --hostfile $PBS_NODEFILE -np 180 /data/Only_HFs99/T2d_testing02.cas_2020-05-30-19h18min52s/out_telemac2d --map-by core
   |~~~~~~~~~~~~~~~~~~
   |mlx5: r26i27n07: got completion with error:
   |00000000 00000000 00000000 00000000
   |00000000 00000000 00000000 00000000
   |00000006 00000000 00000000 00000000
   |00000000 00008813 0800547d 9ad7f8d3
   |[[59434,1],67][btl_openib_component.c:3529:handle_wc] from r26i27n07 to: r26i27n08 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 272c500 opcode 1  vendor error 136 qp_idx

r26i27n07 and r26i27n08 are two of the five compute nodes involved in the job. If I need to provide additional information, just let me know please.

I would like to know:
- had you seen this already before? (on the forum I had no access to an old ticket containing similar keywords!)
- what is causing this?
- is it possible to avoid this by more relaxed MPI communication timeout setting?

Kind regards
Ehsan

The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 5 months ago #36053

yugi OFFLINE openTELEMAC Guru Posts: 851 Thank you received: 244	Hi, You should have a look at the listing the real error might be there. The listing can be found in the temporary folder PE00000-00000.LOG one for each core, one of them might be larger. Could you post the listing before the Python message error as well. Hope it helps.
	There are 10 types of people in the world: those who understand binary, and those who don't. The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 5 months ago #36073

Ehsan	@yugi: thanks for the pointer. I have attached PE00179-00157.LOG which is slighly larger than other similar files, but, I hardly come across an error message there. Do you spot something? File Attachment: File Name: PE00179-00157.log File Size: 509 KB
	The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 5 months ago #36118

pham OFFLINE Administrator Posts: 1559 Thank you received: 602	Hello Ehsan, Please send the same file with a different extension (e.g.: .txt). It may help yugi, I also cannot read your file at the moment. Chi-Tuan
	The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 5 months ago #36134

Ehsan	The .LOG file is already an ASCII file, but, I renamed it to .txt. Please find it enclosed. File Attachment: File Name: PE00179-00157.LOG.txt File Size: 509 KB
	The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 5 months ago #36136

yugi OFFLINE openTELEMAC Guru Posts: 851 Thank you received: 244	Yes indeed, this is because the forum does not allow to open .log files. I did not spot anything in the listing. Here are a couple of questions to get more info: How is your job submitted to your cluster ? Could you post the command you used ? Could this be a timeout from the job scheduler ? Could you post your configuration file (systel.cfg) and your environement file ?
	There are 10 types of people in the world: those who understand binary, and those who don't. The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 5 months ago #36135

sardar.ateeq
OFFLINE
Fresh Boarder
Posts: 21

Hello Chi-Tuan,

Thank you so much for your reply.

Please find the text files in attachment.

You can see the error message in Overall_output file at line no. 15339 which is:

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[35185,1],48]
Exit code: 1

_____________
runcode::main:
:
|runCode: Fail to run
|mpirun --hostfile $PBS_NODEFILE -np 144 /vsc-hard-mounts/leuven-data/335/vsc33539/Only_HFs69/T2d_testing02.cas_2020-06-07-11h25min19s/out_telemac2d --map-by core
|~~~~~~~~~~~~~~~~~~
|Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
|STOP 1
|~~~~~~~~~~~~~~~~~~

Once we checked the output of each processor, we found that out of 143, 48th number was compiling ahead than others and showed the following error message at the end of simulations:

OF THE FILE OF LIQUID BOUNDARIES
NUMBER OF LINES : 1028
SOME COMPILERS REQUIRE AN
EMPTY LINE AT THE END OF THE FILE

PLANTE: PROGRAM STOPPED AFTER AN ERROR
RETURNING EXIT CODE: 2

It looks that this processor was ahead of others or? Rest processors are showing a similar output such as 1st of 49th.

Attachments:

The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 5 months ago #36154

yugi OFFLINE openTELEMAC Guru Posts: 851 Thank you received: 244	There is your answer. there seems to be something wrong with your liquid boundary file. It is normal when running in parallel that all the listing are not synchronized. The one ahead is the one to take into account because most of the time it the one that generated the error.
	There are 10 types of people in the world: those who understand binary, and those who don't. The administrator has disabled public write access. The following user(s) said Thank You: sardar.ateeq

MPI Error: btl_openib_component.c 4 years 5 months ago #36171

sardar.ateeq
OFFLINE
Fresh Boarder
Posts: 21

Dear Yugi,
Thank you so much for your support!
We run simulation with a modified liquid boundary condition files but our job terminated again after 86 hours. I am using 180 processors and following error message appears:
runcode::main:
:
|runCode: Fail to run
|mpirun --hostfile $PBS_NODEFILE -np 180 /Data/T2d_testing02.cas_2020-06-15-10h20min35s/out_telemac2d --map-by core
|~~~~~~~~~~~~~~~~~~
|mlx5: r26i27n13: got completion with error:
|00000000 00000000 00000000 00000000
|00000000 00000000 00000000 00000000
|00000006 00000000 00000000 00000000
|00000000 00008813 0803e3d6 aef23fd2
|mlx5: r26i27n13: got completion with error:
|00000000 00000000 00000000 00000000
|00000000 00000000 00000000 00000000
|00000005 00000000 00000000 00000000
|00000000 00008813 0803e3d3 f21781d2
|[[35347,1],140][btl_openib_component.c:3529:handle_wc] from r26i27n13 to: r26i27n18 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 22c7c00 opcode 1 vendor error 136 qp_idx 1
|~~~~~~~~~~~~~~~~~~
========================================================================
Simulating further after merging previous files doesn’t show any error. If the error would be due to the liquid boundary conditions, should it not be stopped again or? We don’t know, why it is happening. I am also uploading the listing file as well, if you can please look over it.

Other information you asked us previously are below:
How is your job submitted to your cluster ?
We use the PBS/Torque resource manager
- Could you post the command you used ?
#!/bin/bash -l
#PBS -l walltime=05:00:00:00
#The request amount of RAM
#PBS -l nodes=5:ppn=36
#PBS -l pmem=5gb
cd $PBS_O_WORKDIR
#Set the correct environment
module purge
module load TELEMAC-MASCARET/v7p2r0-intel-2018a
- Could this be a timeout from the job scheduler ?
If the error occured before the walltime, then answer NO, else you answer Yes, and the problem is solved!
-- Could you post your configuration file (systel.cfg) and your environement file?
Ehsan will upload it.

We wish you a nice day and weekend form Leuven

Ateeq

Attachments:

Listing.txt (469KB)

The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 5 months ago #36227

sardar.ateeq OFFLINE Fresh Boarder Posts: 21	Dear Yugi, Please find our systel.cfg file as an attachment. We wish you a nice day from Leuven! Ateeq Attachments: systel.txt (1KB)
	The administrator has disabled public write access.

open TELEMAC-MASCARET The mathematically superior suite of solvers

Nav view search

Navigation

Search

TOPIC: MPI Error: btl_openib_component.c

MPI Error: btl_openib_component.c 4 years 5 months ago #36052

MPI Error: btl_openib_component.c 4 years 5 months ago #36053

MPI Error: btl_openib_component.c 4 years 5 months ago #36073

File Attachment:

MPI Error: btl_openib_component.c 4 years 5 months ago #36118

MPI Error: btl_openib_component.c 4 years 5 months ago #36134

File Attachment:

MPI Error: btl_openib_component.c 4 years 5 months ago #36136

MPI Error: btl_openib_component.c 4 years 5 months ago #36135

MPI Error: btl_openib_component.c 4 years 5 months ago #36154

MPI Error: btl_openib_component.c 4 years 5 months ago #36171

MPI Error: btl_openib_component.c 4 years 5 months ago #36227

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.

Latest News

Latest forum posts