Welcome, Guest
Username: Password: Remember me

TOPIC: Parallel Inconsistency

Parallel Inconsistency 8 years 1 week ago #24335

  • kingja.x
  • kingja.x's Avatar
Hi,

I am running v6p3r2 on my university cluster and have experienced a few strange issues which some of you may be able to shed some light on.

Initially I was running my model using 8 cores. I began to refine my mesh incrementally to track any problems and after introducing a couple of piers (my domain is coastal) hit a problem whereby the model would not converge (see example section from log file below). After trying the suggestions in post #14821 I still go the same problem.
ITERATION    71736    TIME:    3 D  6 H  3 MN  45.3001 S   (   281025.3001 S)
--------------------------------------------------------------------------------
                          ADVECTION STEP
--------------------------------------------------------------------------------
                    DIFFUSION-PROPAGATION STEP
 GRACJG (BIEF) :        2 ITERATIONS, RELATIVE PRECISION:   0.2467060E-06
--------------------------------------------------------------------------------
                          K-EPSILON MODEL
 GRACJG (BIEF) : EXCEEDING MAXIMUM ITERATIONS:      50 RELATIVE PRECISION:             NaN
 GRACJG (BIEF) : EXCEEDING MAXIMUM ITERATIONS:      50 RELATIVE PRECISION:   0.7667694E+14
--------------------------------------------------------------------------------
                       BALANCE OF WATER VOLUME
     VOLUME IN THE DOMAIN :   0.5383677E+34 M3
     FLUX BOUNDARY    1:   -0.4003055E+42 M3/S  ( >0 : ENTERING  <0 : EXITING )
     RELATIVE ERROR IN VOLUME AT T =       0.2810E+06 S :   -0.6102714E-08
     MAXIMUM COURANT NUMBER:    0.2548289E-04
     TIME-STEP                 :   0.2602325E-11
 GRACJG (BIEF) : EXCEEDING MAXIMUM ITERATIONS:      50 RELATIVE PRECISION:   0.1189718E+38
 GRACJG (BIEF) : EXCEEDING MAXIMUM ITERATIONS:      50 RELATIVE PRECISION:    30573.14
 GRACJG (BIEF) : EXCEEDING MAXIMUM ITERATIONS:      50 RELATIVE PRECISION:             NaN
 GRACJG (BIEF) : EXCEEDING MAXIMUM ITERATIONS:      50 RELATIVE PRECISION:   0.1713843E+18
 GRACJG (BIEF) : EXCEEDING MAXIMUM ITERATIONS:      50 RELATIVE PRECISION:   0.8415477E+39
 GRACJG (BIEF) : EXCEEDING MAXIMUM ITERATIONS:      50 RELATIVE PRECISION:    120754.4

However when I run the same model using 2 cores, 4 cores or 10 cores I do not get this problem and the model runs to completion without an issue. I think this may be due to a boundary problem and how the domain is split up. I haven't yet looked into the source code to find the cause of the problem as for now I'm happy to just accept that the model doesn't work with 8 cores. Does anyone have a suggestion as to what the cause may be?

More interestingly however, the job submissions using 2 cores, 4 cores and 10 cores were identical other than the number of processors used but give slightly different results. When loading in to Blue Kenue:
10 cores:
This appears to be a byte swapped file.
2D Grid Geometry:
     142010 nodes, 280638 elements
Scanning Variable: VELOCITY UV
   Min: 0, Max: 3.1713
Scanning Variable: WATER DEPTH
   Min: -0.11113, Max: 65
Scanning Variable: FREE SURFACE
   Min: -4.9323, Max: 6.1287
Scanning Variable: BOTTOM
   Min: -62.037, Max: 5.4512

4 cores:
This appears to be a byte swapped file.
2D Grid Geometry:
     142010 nodes, 280638 elements
Scanning Variable: VELOCITY UV
   Min: 0, Max: 3.1615
Scanning Variable: WATER DEPTH
   Min: -0.3174, Max: 65
Scanning Variable: FREE SURFACE
   Min: -4.9336, Max: 6.1279
Scanning Variable: BOTTOM
   Min: -62.037, Max: 5.4512

2 cores:
This appears to be a byte swapped file.
2D Grid Geometry:
     142010 nodes, 280638 elements
Scanning Variable: VELOCITY UV
   Min: 0, Max: 3.1834
Scanning Variable: WATER DEPTH
   Min: -0.063878, Max: 64.999
Scanning Variable: FREE SURFACE
   Min: -4.9327, Max: 6.1274
Scanning Variable: BOTTOM
   Min: -62.037, Max: 5.4512

On the whole there isn't an issue with the results as the differences are so small but inspecting my results there is consistently a variation of 1-2cm in water levels throughout my domain. Does anyone know what may be causing this and how to resolve?

Many thanks
Jonathan
The administrator has disabled public write access.

Parallel Inconsistency 8 years 1 week ago #24344

  • jmhervouet
  • jmhervouet's Avatar
Hello,

Minor differences between runs with different numbers of processors is possible due to truncation errors, especially in case of bifurcations like in the case of von Karman eddies behind bridge piers. Your case with 10 processors rather looks like a bug in parallelism, so first question before going further, do you have programmed something special in your Fortran file ? In which case we would need to see it. It would be also safer to check with more recent versions.

With best regards,

Jean-Michel Hervouet
The administrator has disabled public write access.

Parallel Inconsistency 8 years 5 days ago #24351

  • kingja.x
  • kingja.x's Avatar
Hi Jean-Michel,

Okay, thanks for that. As I said the differences are relatively small so don't make a massive difference.

No, I haven't make any modifications to the code at all. May it be that I am using an older version of the model and it is just a small bug which has since been corrected?

I haven't moved to v7p0 or v7p1 yet as they both seem to have a bug preventing the the date/time from displaying correctly in Blue Kenue in animations (there is a thread on this somewhere).

Many thanks
Jonathan
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.