Welcome, Guest
Username: Password: Remember me

TOPIC: Simulation with source regions crashes in parallel

Simulation with source regions crashes in parallel 3 days 6 hours ago #46184

  • N_Strahl
  • N_Strahl's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 17
I have a strange case where my simulation with source regions crashes when I increase the parallel processor count. I am observing this in v8p5 but I just tried the most recent version and I'm having the same problem.

Anything below 5 Processors will run without problems. Using 6 Processors will run sometimes and crash othertimes. Anything over 6 processors will definitely crash. (on my machine, which has 12 cores. This is what I get:
CORFON (TELEMAC2D): NO BOTTOM SMOOTHINGS

 INITIALISATION IREG=           1
 INITIALISATION IREG=           2
 INITIALISATION IREG=           3
 INITIALISATION IREG=           4
 INITIALISATION IREG=           5
 INITIALISATION IREG=           6
 INITIALISATION IREG=           7
 INITIALISATION IREG=           8
 INITIALISATION IREG=           9
 SOURCE REGION:    1 CONTAINS THE NODE:    7796
 SOURCE REGION:    2 CONTAINS THE NODE:     700
 SOURCE REGION:    6 CONTAINS THE NODE:     181

================================================================================
 ITERATION        0    TIME:   0.0000 S
 TELEMAC2D INITIALIZED

                  ********************************
                  *                              *
                  *        HLLC SCHEME           *
                  *     FIRST ORDRE IN SPACE     *
                  *                              *
                  ********************************

job aborted:
[ranks] message

[0-3] terminated

[4] application aborted
aborting MPI_COMM_WORLD (comm=0x44000000), error 2, comm rank 4

[5] terminated

---- error analysis -----

[4] on
test.cas_2025-02-02-16h26min17s\out_telemac2d.exe aborted the job. abort code 2 


What I have also noticed is that when I run on one Processor I get a complete listing of all the nodes in the source region.
 INITIALISATION IREG=           1
 INITIALISATION IREG=           2
 INITIALISATION IREG=           3
 INITIALISATION IREG=           4
 INITIALISATION IREG=           5
 INITIALISATION IREG=           6
 INITIALISATION IREG=           7
 INITIALISATION IREG=           8
 INITIALISATION IREG=           9
 SOURCE REGION:    1 CONTAINS THE NODE:    7796
 SOURCE REGION:    2 CONTAINS THE NODE:     700
 SOURCE REGION:    3 CONTAINS THE NODE:     557
 SOURCE REGION:    4 CONTAINS THE NODE:    2190
 SOURCE REGION:    5 CONTAINS THE NODE:    6385
 SOURCE REGION:    6 CONTAINS THE NODE:     181
 SOURCE REGION:    7 CONTAINS THE NODE:    1138
 SOURCE REGION:    8 CONTAINS THE NODE:     402
 SOURCE REGION:    9 CONTAINS THE NODE:    1641

================================================================================

But if I run on 2 Processors I only get:
 INITIALISATION IREG=           1
 INITIALISATION IREG=           2
 INITIALISATION IREG=           3
 INITIALISATION IREG=           4
 INITIALISATION IREG=           5
 INITIALISATION IREG=           6
 INITIALISATION IREG=           7
 INITIALISATION IREG=           8
 INITIALISATION IREG=           9
 SOURCE REGION:    3 CONTAINS THE NODE:     557
 SOURCE REGION:    4 CONTAINS THE NODE:    2190
 SOURCE REGION:    7 CONTAINS THE NODE:    1138
 SOURCE REGION:    8 CONTAINS THE NODE:     402
 SOURCE REGION:    9 CONTAINS THE NODE:    1641

================================================================================

And then the simulation starts. Only half of the node listing is printed. I don't know if this might be related to the issue. I have attached the full example



Thanks in advance.
The administrator has disabled public write access.

Simulation with source regions crashes in parallel 3 days 6 hours ago #46185

  • N_Strahl
  • N_Strahl's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 17
The case was not loaded properly somehow.
The administrator has disabled public write access.

Simulation with source regions crashes in parallel 3 days 6 hours ago #46186

  • N_Strahl
  • N_Strahl's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 17

File Attachment:

File Name: test.zip
File Size: 732 KB
The administrator has disabled public write access.

Simulation with source regions crashes in parallel 2 days 5 hours ago #46187

  • pham
  • pham's Avatar
  • OFFLINE
  • Administrator
  • Posts: 1604
  • Thank you received: 612
Hello,

I have an error when I run your simulation with 1 core:
SOURCE REGION: 1 CONTAINS THE NODE: 7796
SOURCE REGION: 2 CONTAINS THE NODE: 700
SOURCE REGION: 3 CONTAINS THE NODE: 557
SOURCE REGION: 4 CONTAINS THE NODE: 2190
SOURCE REGION: 5 CONTAINS THE NODE: 6385
SOURCE REGION: 6 CONTAINS THE NODE: 181
SOURCE REGION: 7 CONTAINS THE NODE: 1138
SOURCE REGION: 8 CONTAINS THE NODE: 402

THE REGION 9IS OUTSIDE THE DOMAIN
OR IT DOES NOT CONTAIN ANY NODE OF THE MESH
For region #9, the polygon does not seem to include any node, this is an error for me.
Why do you use sources defined by regions if you only need 1 punctual source each time? The feature sources by regions is useful when you want to manage several sources located on different nodes and close to them each other.

Anyway, it is good to know that the listing file is written for every subdomain with "local" information (except for balance). It can appear strange when you are not aware. You can read them in the temporary folder with name PE***.LOG with the number of subdomain.
E.g. if an error message appears, it is only written in the subdomain(s) where the error appears, which can help to find for which subdomain the error is located. You can have a look at the PE*LOG files, in particular if 1 one of them has a different size than the others.

Hope this helps,

Chi-Tuan
The administrator has disabled public write access.

Simulation with source regions crashes in parallel 1 day 11 hours ago #46195

  • N_Strahl
  • N_Strahl's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 17
I agree with your point about just defining the source points individually. I have done this and the error no longer occurs. However, for this model I am also going to have to add some source regions (alongside the points) in the future. This is why I am providing the points as a regions file. Is there maybe a way to define both point sources and source regions? I believe it is only possible to define either or.

I checked the region 9 polygon and it does in fact contain a mesh point, so I cannot understand why it is not being captured in parallel.
I showed my listing with 1 processor where it clearly says:
SOURCE REGION: 9 CONTAINS THE NODE: 1641

I tried expanding the rectangle and also changing the orientation, to no avail.
The administrator has disabled public write access.

Simulation with source regions crashes in parallel 1 day 10 hours ago #46196

  • N_Strahl
  • N_Strahl's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 17
Just to add a little more context:
 INITIALISATION IREG=           1
 INITIALISATION IREG=           2
 INITIALISATION IREG=           3
 INITIALISATION IREG=           4
 INITIALISATION IREG=           5
 INITIALISATION IREG=           6
 INITIALISATION IREG=           7
 INITIALISATION IREG=           8
 INITIALISATION IREG=           9
 SOURCE REGION:    3 CONTAINS THE NODE:     557

================================================================================
 ITERATION        0    TIME:   0.0000 S
 TELEMAC2D INITIALIZED

                  ********************************
                  *                              *
                  *        HLLC SCHEME           *
                  *     FIRST ORDRE IN SPACE     *
                  *                              *
                  ********************************
 WAIT_PARACO:
 MPI ERROR           17



 PLANTE: PROGRAM STOPPED AFTER AN ERROR

I can see WAIT_PARACO MPI ERROR 17 being output in one of the log files. I believe the problem could lie in an issue with MPI.
I am using the current version of msmpi on Win11 by the way.
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.