Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1
  • 2

TOPIC: MPI Error: btl_openib_component.c

MPI Error: btl_openib_component.c 4 years 4 months ago #36263

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Did you check the listing for all processor ?

I see that your are limiting the ram maybe this could be the issue ? Could you try increasing it.

By the way do you know that you can automatise the generation of the PBS file (You can have a look at eole.intel in systel.edf.cfg)
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: sardar.ateeq

MPI Error: btl_openib_component.c 4 years 4 months ago #36264

  • Ehsan
  • Ehsan's Avatar
I see that your are limiting the ram maybe this could be the issue ? Could you try increasing it.

@yugi: Normally, when a process runs out of memory, we get a relevant error message in stderr. This is not the case, but worth examining.

@Sardar: You have two options to increase the memory per process, but I am unaware of how much RAM you really anticipate for your specific experiment:
  1. using thin nodes (188 gb per node), you would lower ppn and increase pmem, so that pmem x pmem <= 188gb. I leave it to you to experiment
  2. alternatively, you can use the bigmem nodes (760 gb per node), and keep ppn=36 and raise pmem to 21gb. If that is still not enough, you can switch to using superdome machine which offers up to 50gb per core/process

Have you ever run similar tasks on other machines/clusters successfully to the end? If so, how much memory did you typically need?
The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 4 months ago #36267

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
We usually take the whole node so we do not have to specify the memory.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

MPI Error: btl_openib_component.c 4 years 4 months ago #36268

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Nevertheless, the amount of available memory is huge.
So maybe this is an indicator of a memory leak...
Christophe
The administrator has disabled public write access.
The following user(s) said Thank You: sardar.ateeq

MPI Error: btl_openib_component.c 4 years 4 months ago #36269

Dear Yugi and Coulet,

Thank you very much for your support! I will check simulations by allocating more memory. Let see what happens.

@Yugi I also checked the listing for all processors, there is no error message except some processors are listing ahead than others.
The administrator has disabled public write access.

MPI Error: btl_openib_component.c 2 years 7 months ago #40224

  • Htun Pyae Sone
  • Htun Pyae Sone's Avatar
  • OFFLINE
  • Junior Boarder
  • Posts: 38
  • Thank you received: 2
Dear all,

I am sorry to reopen this topic since I am encountering this problem again and cannot solve it yet. We use "slurm" in our university's HPC. Ihis problem occurs not in all simulation but only in very few simulations. We updated the slurm version and the new slurm system needs memory allocation. Previously, we used to take the whole node's resource. Do you find any solution or work around for this? Thank you very much in advance.

Best regards,
Htun.
The administrator has disabled public write access.
  • Page:
  • 1
  • 2
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.