Welcome, Guest
Username: Password: Remember me

TOPIC: MPI Error: btl_openib_component.c

MPI Error: btl_openib_component.c 2 years 6 months ago #40293

  • Htun Pyae Sone
  • Htun Pyae Sone's Avatar
  • OFFLINE
  • Junior Boarder
  • Posts: 38
  • Thank you received: 2
Hi,

I am writing this post to request help regarding the error:
[[50369,1],98][btl_openib_component.c:3556:handle_wc] from g153 to: g152 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 246c580 opcode 1  vendor error 136 qp_idx 0
We encountered this error while running telemac2d. I have read discussion regarding this error in #36052 under Telemac2d System module. However, it seems that the solution was not clarified in that discussion.
The error
  • occurs in some particular cases (not in all simulations).
  • occurs only when running on multiple nodes. We simulated the same simulation with one node and the error does not happen.
  • started when the HPC from our university updated the slurm. According the previous discussion, @c.coulet mentioned in #36268 that it might be an indicator of a memory leak. Our old slurm system used to take the whole node without allocating the memory capacity. However, we have to specify the mem-per-cpu in our new system. The cluster has enough memory for our simulation, the memory usage is less than 1%.
In this case, we are still using v7p3r1 since it is a follow-up project of an old project. I have tried different things and currently run out of knowledge. The error is also very expensive to test since it only occurs after at least 20 hours of simulation time. Our compilation used the following systems.
  • AlmaLinux 8.5 (Linux 4.18.0-348.7.1.el(-5.x86_64)
  • slurm 21.08.4
  • gcc 7.5.0
  • openmpi 3.0.0
I would be very glad if someone could help me debug or find the cause of the problem. Thank you very much in advance.
:(
Kind regards,
Htun
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.