Hey
I am writing this message on behlaf of one of our TELEMAC2D users. He has submitted a compute job with 5 (Intel Skylake) nodes, and after about 34 hours, he is receiving the following error. This has happened more than once.
|runCode: Fail to run
|mpirun --hostfile $PBS_NODEFILE -np 180 /data/Only_HFs99/T2d_testing02.cas_2020-05-30-19h18min52s/out_telemac2d --map-by core
|~~~~~~~~~~~~~~~~~~
|mlx5: r26i27n07: got completion with error:
|00000000 00000000 00000000 00000000
|00000000 00000000 00000000 00000000
|00000006 00000000 00000000 00000000
|00000000 00008813 0800547d 9ad7f8d3
|[[59434,1],67][btl_openib_component.c:3529:handle_wc] from r26i27n07 to: r26i27n08 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 272c500 opcode 1 vendor error 136 qp_idx
r26i27n07 and r26i27n08 are two of the five compute nodes involved in the job. If I need to provide additional information, just let me know please.
I would like to know:
- had you seen this already before? (on the forum I had no access to an old ticket containing similar keywords!)
- what is causing this?
- is it possible to avoid this by more relaxed MPI communication timeout setting?
Kind regards
Ehsan