Welcome, Guest
Username: Password: Remember me

TOPIC: Parallel - Telemac fails when more than 5 processors used

Parallel - Telemac fails when more than 5 processors used 12 years 6 months ago #4524

  • SteveHaynes
  • SteveHaynes's Avatar
Hi Everyone

I'm running Telemac2d in parallel on both my 8-core windows workstation and and a multi-node Linux cluster and I'm getting the same problem with each setup:
When using 5 or less parallel processors Telemac2d completes the run fine but when 6 or more processors are used the job is aborted. Here is a part of the screen-readout for one of the aborted jobs (on windows):

ITERATION 0 TIME: 0.0000 S
STREAMLINE: USING PARALLEL VERSION OF CHARACTERISTICS 6.1
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)

job aborted:
rank: node: exit code[: error message]
0: localhost: 123
1: localhost: 123
2: localhost: 157: process 2 exited without calling finalize
3: localhost: 123
4: localhost: 123
5: localhost: 123
Duree du calcul : 0 secondes ( 0:0:0 ) (systeme=0 sec)

The processor (or processors for more than 8 processors) that fails is always related to the same general area of the mesh but I don't understand why any problems with the mesh (e.g. ill conditioned elements) would become a problem only for more than 5 processors.

Any help would be much appreciated,
Thanks
Steve
The administrator has disabled public write access.

Re: Parallel - Telemac fails when more than 5 processors used 12 years 6 months ago #4531

  • jmhervouet
  • jmhervouet's Avatar
Hello,

Parallelism has been tested up to more than 10000 processors, so it is probably either a problem of installation or something in your specific programming. The error message is not from Telemac.
You can try:

* to have DEBUGGER : 1 in your parameter file (to see where approximately it crashes)
* to remove any option that uses the method of characteristics (Thompson and advection schemes). A user already reported a problem with one compiler, as we seem to rely on the way Fortran structures are stored in memory.

With best regards,

Jean-Michel Hervouet
The administrator has disabled public write access.

Re: Parallel - Telemac fails when more than 5 processors used 12 years 6 months ago #4620

  • SteveHaynes
  • SteveHaynes's Avatar
Hi Jean-Michel, thanks for your reply.

It turns out there were two separate problems limiting our use of parallel processors.

The first limited us to 5 or less processors:
You were right about the problem being related to our specific programming.
Bord was trying to extract data from the formatted data file for subdomain segments that did not border any open boundaries and this was causing a problem so we included the line "IF(NPTFR.GT.0.AND.ANY(LIHBOR==5)) THEN" in our code to make sure the file is only opened if necessary.

The second problem stopped us using more than 23 processors and was also related to the formatted data file:
Telemac (presumably partel) was copying the 24th copy of the formatted data file incorrectly and was replacing phase values with zero values for some reason. We solved this by hardcoding the tidal data into a fortran module at the start of the princi file and then USEd it in the BORD subroutine. This makes inputting harmonic tidal data a little but more tricky but solved our problem.

These fixes removed the processor ceiling and we're now running parallel jobs with up to 96 processors, success!

Thanks again
Steve
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.