Hi Everyone
I'm running Telemac2d in parallel on both my 8-core windows workstation and and a multi-node Linux cluster and I'm getting the same problem with each setup:
When using 5 or less parallel processors Telemac2d completes the run fine but when 6 or more processors are used the job is aborted. Here is a part of the screen-readout for one of the aborted jobs (on windows):
ITERATION 0 TIME: 0.0000 S
STREAMLINE: USING PARALLEL VERSION OF CHARACTERISTICS 6.1
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)
job aborted:
rank: node: exit code[: error message]
0: localhost: 123
1: localhost: 123
2: localhost: 157: process 2 exited without calling finalize
3: localhost: 123
4: localhost: 123
5: localhost: 123
Duree du calcul : 0 secondes ( 0:0:0 ) (systeme=0 sec)
The processor (or processors for more than 8 processors) that fails is always related to the same general area of the mesh but I don't understand why any problems with the mesh (e.g. ill conditioned elements) would become a problem only for more than 5 processors.
Any help would be much appreciated,
Thanks
Steve