TELEMAC-MASCARET Forum: Unable to run parallel jobs in a Windows cluster (1/2)

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8722

cyamin
OFFLINE
openTELEMAC Guru
Posts: 997
Thank you received: 234

Hello all,

I am trying to setup telemac in a cluster that consists of Windows8 x64 PCs that have multicore CPUs. I am using a Windows/mpich2/gfortran/python setup. I can run parallel jobs in localmachine mode, but when I try to split the computation across the network, the computation fails. I can run the cpi.exe mpich2 example program without problem across any combination of hosts and number of processes, so communication/authentication between the hosts is not an issue. (I have restricted the ports used by mpich2 and opened the relevant incoming ports in all hosts.)

I have made sure that I run the computation from a network share. I have used the -map mpich2 option to map an new network drive up to the case folder. I suspect that mpich2 is more sensitive in the network share definition and I have paid attention to the configuration.

This is the command I use to run the computation:

mpiexec.exe -env MPICH_PORT_RANGE 10000:11000 -map t:\\atlas\Company\DataDisk\Work\opentelemac\201_malpasset -wdir <wdir> -machinefile t:\mpi_hosts.txt -n <ncsize> <exename>

When I run a computation across the network, including the local host, I manage to get the following error (if all the hosts are network, I get no specific error message):

job aborted:
rank: node: exit code[: error message]
0: thor: 123
1: hal9000: -1073741515: process 1 exited without calling init while other processes have called init

thor is the local host, hal9000 is a network host.

The rest of the run output is attached:

File Attachment:

File Name: run_output.txt
File Size: 2 KB

Am I missing something? Has anybody else managed to run a parallel job in a Windows cluster? Any help would be appreciated.
Regards,
Costas

The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8726

yugi OFFLINE openTELEMAC Guru Posts: 851 Thank you received: 244	Hi Could you post your systel.cfg file as well? Could you try to add the line: DEBUGGER : 1 in your case file and rerun it. Thanks, Yoann
	There are 10 types of people in the world: those who understand binary, and those who don't. The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8727

cyamin OFFLINE openTELEMAC Guru Posts: 997 Thank you received: 234	I uncommented the DEBUGGER option (it was already in the cas file) but I didn't notice any difference in the output. I attach the new output and the configuration file as well. File Attachment: File Name: systel-gfortran.cfg_2013-05-16.txt File Size: 2 KB File Attachment: File Name: run_output_2.txt File Size: 2 KB Thank you for your help. Costas
	The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8753

cyamin OFFLINE openTELEMAC Guru Posts: 997 Thank you received: 234	Hi, Has anybody managed to run telemac in a Windows cluster? Regards, Costas
	The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 5 months ago #8795

cyamin
OFFLINE
openTELEMAC Guru
Posts: 997
Thank you received: 234

May I add the following error message I get when trying to run a computation between 2 hosts (the local and a remote one). Parallel in localonly works OK.

[0] PMI_Init failed: FAIL - init called when another process has exited without calling init

This message refers to the process assigned to the other host, not the local one.
When I try to run the computation with the intel fortran compiled version, everything works smoothly, which means that the problems lies with the gfortran version.
Any idea what this message means?
Costas

The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10181

cyamin OFFLINE openTELEMAC Guru Posts: 997 Thank you received: 234	Just an update to the thread: With version v6p3r1, I can now distribute Telemac2D computations within my Windows cluster. Since Tomawac is now parallelised, I gave it a try, but it fails the same way Telemac2D did in version v6p2. Costas
	The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10193

yugi OFFLINE openTELEMAC Guru Posts: 851 Thank you received: 244	Could you post your systel.cfg for the v6p3r1. Could you post your machine file as well. Just to check. Thanks, Yoann
	There are 10 types of people in the world: those who understand binary, and those who don't. The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10194

cyamin OFFLINE openTELEMAC Guru Posts: 997 Thank you received: 234	Hi Yoann, Here are the files: File Attachment: File Name: hdf772c4.txt File Size: 2 KB File Attachment: File Name: h05d8b66.txt File Size: 0 KB Regards, Costas
	The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10206

yugi OFFLINE openTELEMAC Guru Posts: 851 Thank you received: 244	How is your cluster working do you have a job sheduler or are you directly using the nodes? Could you try to run the case with a basic mpirun command like: mpirun -np 10 ./out.exe It could be worth a shot. Hope it helps, Yoann
	There are 10 types of people in the world: those who understand binary, and those who don't. The administrator has disabled public write access.

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10207

cyamin
OFFLINE
openTELEMAC Guru
Posts: 997
Thank you received: 234

Dear Yoann,

I don't really understand what you mean with a 'job scheduler'.

I have installed mpich2 in the workstations I want to run computations with. The workstations are connected with Gigabit Ethernet. I am able run in parallel the program cpi.exe (using mpiexec not mpirun) included in the mpich2 installation folder.

Since ethernet has a high latency, my aim is to assign remotely discrete computations in each workstation and not distribute one between them.

Costas

The administrator has disabled public write access.

open TELEMAC-MASCARET The mathematically superior suite of solvers

Nav view search

Navigation

Search

TOPIC: Unable to run parallel jobs in a Windows cluster

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8722

File Attachment:

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8726

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8727

File Attachment:

File Attachment:

Unable to run parallel jobs in a Windows cluster 11 years 6 months ago #8753

Unable to run parallel jobs in a Windows cluster 11 years 5 months ago #8795

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10181

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10193

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10194

File Attachment:

File Attachment:

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10206

Unable to run parallel jobs in a Windows cluster 11 years 2 months ago #10207

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.

Latest News

Latest forum posts