Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1
  • 2

TOPIC: Unable to run parallel jobs in a Windows cluster

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8722

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
Hello all,

I am trying to setup telemac in a cluster that consists of Windows8 x64 PCs that have multicore CPUs. I am using a Windows/mpich2/gfortran/python setup. I can run parallel jobs in localmachine mode, but when I try to split the computation across the network, the computation fails. I can run the cpi.exe mpich2 example program without problem across any combination of hosts and number of processes, so communication/authentication between the hosts is not an issue. (I have restricted the ports used by mpich2 and opened the relevant incoming ports in all hosts.)

I have made sure that I run the computation from a network share. I have used the -map mpich2 option to map an new network drive up to the case folder. I suspect that mpich2 is more sensitive in the network share definition and I have paid attention to the configuration.

This is the command I use to run the computation:
mpiexec.exe -env MPICH_PORT_RANGE 10000:11000 -map t:\\atlas\Company\DataDisk\Work\opentelemac\201_malpasset -wdir <wdir> -machinefile t:\mpi_hosts.txt -n <ncsize> <exename>

When I run a computation across the network, including the local host, I manage to get the following error (if all the hosts are network, I get no specific error message):
job aborted:
rank: node: exit code[: error message]
0: thor: 123
1: hal9000: -1073741515: process 1 exited without calling init while other processes have called init
thor is the local host, hal9000 is a network host.

The rest of the run output is attached:

File Attachment:

File Name: run_output.txt
File Size: 2 KB

Am I missing something? Has anybody else managed to run a parallel job in a Windows cluster? Any help would be appreciated.
Regards,
Costas
The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8726

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi

Could you post your systel.cfg file as well?
Could you try to add the line:
DEBUGGER : 1
in your case file and rerun it.

Thanks,
Yoann
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8727

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
I uncommented the DEBUGGER option (it was already in the cas file) but I didn't notice any difference in the output. I attach the new output and the configuration file as well.

File Attachment:

File Name: systel-gfortran.cfg_2013-05-16.txt
File Size: 2 KB


File Attachment:

File Name: run_output_2.txt
File Size: 2 KB


Thank you for your help.
Costas
The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8753

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
Hi,

Has anybody managed to run telemac in a Windows cluster?

Regards,
Costas
The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 5 months ago #8795

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
May I add the following error message I get when trying to run a computation between 2 hosts (the local and a remote one). Parallel in localonly works OK.
[0] PMI_Init failed: FAIL - init called when another process has exited without calling init
This message refers to the process assigned to the other host, not the local one.
When I try to run the computation with the intel fortran compiled version, everything works smoothly, which means that the problems lies with the gfortran version.
Any idea what this message means?
Costas
The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10181

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
Just an update to the thread:

With version v6p3r1, I can now distribute Telemac2D computations within my Windows cluster.

Since Tomawac is now parallelised, I gave it a try, but it fails the same way Telemac2D did in version v6p2.

Costas
The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10193

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Could you post your systel.cfg for the v6p3r1.
Could you post your machine file as well.
Just to check.

Thanks,
Yoann
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10194

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
Hi Yoann,

Here are the files:

File Attachment:

File Name: hdf772c4.txt
File Size: 2 KB


File Attachment:

File Name: h05d8b66.txt
File Size: 0 KB


Regards,
Costas
The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10206

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
How is your cluster working do you have a job sheduler or are you directly using the nodes?

Could you try to run the case with a basic mpirun command like:
mpirun -np 10 ./out.exe

It could be worth a shot.

Hope it helps,
Yoann
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10207

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
Dear Yoann,

I don't really understand what you mean with a 'job scheduler'.

I have installed mpich2 in the workstations I want to run computations with. The workstations are connected with Gigabit Ethernet. I am able run in parallel the program cpi.exe (using mpiexec not mpirun) included in the mpich2 installation folder.

Since ethernet has a high latency, my aim is to assign remotely discrete computations in each workstation and not distribute one between them.

Costas
The administrator has disabled public write access.
  • Page:
  • 1
  • 2
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.