Welcome, Guest
Username: Password: Remember me

TOPIC: error polling LP CQ with status REMOTE ACCESS ERROR status number 10

error polling LP CQ with status REMOTE ACCESS ERROR status number 10 1 year 3 months ago #43029

  • tbe
  • tbe's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 16
  • Thank you received: 5
Hello, when running quite a heavy telemac3d simulation on our Linux cluster, the run is crashing at about the same point after several hours of computation, with the following error message...

"error polling LP CQ with status REMOTE ACCESS ERROR status number 10"

It seems that the number of MPI commands has reached a maximum equal to an int32 value (approximately 2 billion). We are fairly certain that it is related to Infiniband, and it might be fixed if certain compiler options are used when compiling open MPI, but we currently do not know how to fix it.

Has anyone else ever had this issue?

Thanks in advance
Tom
The administrator has disabled public write access.

error polling LP CQ with status REMOTE ACCESS ERROR status number 10 1 year 3 months ago #43042

  • tbe
  • tbe's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 16
  • Thank you received: 5
Further to this post. We have done some simple tests whereby we call P_SUM in a while loop. The program crashes after about 2 billion iterations. This seems to be a problem with int32 number of mpi commands.

Any help appreciated!
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.