Hello all,
I am trying to setup telemac in a cluster that consists of Windows8 x64 PCs that have multicore CPUs. I am using a Windows/mpich2/gfortran/python setup. I can run parallel jobs in localmachine mode, but when I try to split the computation across the network, the computation fails. I can run the cpi.exe mpich2 example program without problem across any combination of hosts and number of processes, so communication/authentication between the hosts is not an issue. (
I have restricted the ports used by mpich2 and opened the relevant incoming ports in all hosts.)
I have made sure that I run the computation from a network share. I have used the -map mpich2 option to map an new network drive up to the case folder. I suspect that mpich2 is more sensitive in the network share definition and I have paid attention to the configuration.
This is the command I use to run the computation:
mpiexec.exe -env MPICH_PORT_RANGE 10000:11000 -map t:\\atlas\Company\DataDisk\Work\opentelemac\201_malpasset -wdir <wdir> -machinefile t:\mpi_hosts.txt -n <ncsize> <exename>
When I run a computation across the network, including the local host, I manage to get the following error (if all the hosts are network, I get no specific error message):
job aborted:
rank: node: exit code[: error message]
0: thor: 123
1: hal9000: -1073741515: process 1 exited without calling init while other processes have called init
thor is the local host, hal9000 is a network host.
The rest of the run output is attached:
Am I missing something? Has anybody else managed to run a parallel job in a Windows cluster? Any help would be appreciated.
Regards,
Costas