Welcome, Guest
Username: Password: Remember me

TOPIC: Partioning across multiple nodes on an HPC system

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45457

  • chrisold
  • chrisold's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 5
I am trying to run telemac3d on the University of Edinburgh CIRRUS HPC systems. I have run into an issue with partitioing a model across multiple compute nodes.

Each compute node has 36 cores, I can run a model with 36 partitions across multiple nodes where I distribute the 36 partitions across the nodes, but when I try to use more than 36 partitions, the partitioning fails producing partitions with with one or no nodes in them causing the memory allocation to generate munmap_chunk(): invalid pointer error.

Has anyone come across this issue when running telemac3d in an HPC system?
The administrator has disabled public write access.

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45458

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi
Di you try to run a test case?
Could you give us more detail on the command and maybe join some log file ...

Regards
Christophe
The administrator has disabled public write access.

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45462

  • chrisold
  • chrisold's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 5
Hi Christophe,

I have not run a test case, but my model ran ok on a single node on the cluster using 36 cores.

I also ran the model using 36 partitions across 3 nodes using 12 cores on each node.

The problem arises when I ask for more than 36 partitions, e.g. 40 partitions on 2 nodes with 20 tasks per node fails to partition correctly.

I am using metis as the partitioner.

I have added the config file used to compile and run telemac3d on cirrus.

To check what has been happening I use the commmand

telemac3d.py -w temp --ncsize 72 --ncnode 2 --nctile 36 obm_lr_20141008.cas --split

This generates the message shown in the file partition_error.txt (I have sanitised the file to remove accounts, usernames, etc.)

Cirrus support have suggested trying parmetis instead.

Let me know if you need more information.

Thanks,
Chris
Attachments:
The administrator has disabled public write access.

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45464

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Could we have a chance to look the partel_T3DGEO.log file?
This is the log file of the partitioning step...

Regards
Christophe
The administrator has disabled public write access.

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45465

  • chrisold
  • chrisold's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 5
Log file attached - it has a large number of isolated boundary points. These correspond to the invalid partition files (T3DPAR00071-000**).

Thanks
Attachments:
The administrator has disabled public write access.

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45466

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
change extension from log to txt
Christophe
The administrator has disabled public write access.

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45467

  • chrisold
  • chrisold's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 5
Sorry - knew I should do this but forgot...
Attachments:
The administrator has disabled public write access.

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45468

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Even there is some mention of isolated point, we saw the partition is well achieved.
As some writing depends of memory flush, I would say the problem is elsewhere in the ... splitting / copying other input files step...
This step depends on hte various files you have in your case...
Try to look in the temp directory to see which file is not fully partitionned

Hope this helps
Christophe
The administrator has disabled public write access.

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45469

  • chrisold
  • chrisold's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 5
Thanks, I will have a deeper dig into what is happening at each step through the process.
The administrator has disabled public write access.

Partioning across multiple nodes on an HPC system 2 months 3 weeks ago #45475

  • manojkg
  • manojkg's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 5
  • Thank you received: 2
Piggy backing on the question here. Is it possible to do multi-node parallel computation with the precompiled version of telemac for windows?
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.