Welcome, Guest
Username: Password: Remember me

TOPIC: Method of characteristics failing using multi-node partition on CRAY H

Method of characteristics failing using multi-node partition on CRAY H 2 weeks 1 day ago #46288

  • chrisold
  • chrisold's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 7
  • Thank you received: 1
I have encounter an issue when running Telemac3d on the University of Edinburgh Archer2 HPC (CRAY architecture) across multiple nodes.

My models run without error when using all cores on a single node (128 cores/node), but when I partition across mutliple nodes I get an FPE error that traces back to the MPI_ALLTOALLV call for the streamline solver when using the method of characteristics to solve the advection equations.

The same models run succesfully on the University of Edinburgh CIRRUS HPC system across multiple nodes.

If I change the solver the issue goes away, so it is related in some way to the streamline solver and how the message passing is behaving.

Has anyone else encountered this problem running Telemac3d on a CRAY architecture?

Are there any quirks related to the streamline solver that could lead to an FPE error in the MPI_ALLTOALLV call?

The attached file gives an example of the backtrace generated by the error.

Thanks,
Chris
Attachments:
The administrator has disabled public write access.

Method of characteristics failing using multi-node partition on CRAY H 1 week 3 days ago #46341

  • pham
  • pham's Avatar
  • OFFLINE
  • Administrator
  • Posts: 1622
  • Thank you received: 619
Hello,

How many triangles are there in your 2D mesh?
How many horizontal planes do you use?

What are the characteristics of Edinburgh CIRRUS HPC system (number of cores/node) and on how many cores do you run the model?

Can you upload the steering file you use?

Chi-Tuan
The administrator has disabled public write access.

Method of characteristics failing using multi-node partition on CRAY H 17 hours 53 minutes ago #46406

  • chrisold
  • chrisold's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 7
  • Thank you received: 1
Hi Chi-Tuan, Archer support tracked the issue down to how the MPI operation differs between nodes compared to between cores on a node. We needed to turn on a separate transport layer to manage the MPI between nodes. This solved the problem. Thanks Chris
The administrator has disabled public write access.
The following user(s) said Thank You: Renault
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.