Welcome, Guest
Username: Password: Remember me

TOPIC: Unable to start a simulation on a Windows7 mini-cluster

Unable to start a simulation on a Windows7 mini-cluster 9 years 9 months ago #15673

  • yanrousseau
  • yanrousseau's Avatar
Dear Telemac users,

I installed Telemac v6p2 and v7p0r0 on Windows7 workstations. On each one of these, I can run a simulation in parallel with mpich2. However, I tried to create a cluster using two of these workstations, and the simulation worked but is much slower than if using the cores of only one of these two computers. I posted a similar question last week for my Linux setup (opentelemac.org/index.php/kunena/12-linu...ng-if-using-16-cores) but the origin of the problem seems to be different under Windows7 since I'm unable to run a single simulation involving more than one node. I suspect that my problem with Windows7 is related to security.

Here are the main facts about my configuration:
- The two workstations that I'm using are identical twins (except their name). They have a 64-bit architecture, are hex-cores with 24Gb memory, and run Windows7.
- I compiled Telemac v7.0 using Intel Visual Fortran Compiler v11.1.072 and I'm using Python v2.7.9 (I think this version includes numpy, scipy, and matplotlib) and Metis v5. For the Metis library, I ignore the exact sub-version number, but it was created on September 12th, 2012 and is 1181Kb. I have attached my configuration file.
- The two computers belong to the same network and recognize my user name (which happens to be an administrator of both machines). Therefore, there is a single domain, user and password involved.
- MPICH2 is installed on both machines, at the same location, and the SMPD service is running.
- I created a 'case' directory on the primary machine (the one that holds the Telemac binaries), shared this directory to my user (with full control), and mapped a drive toward this directory (n:\) on both machines.
- Since, at this point, my multi-node calculation was not starting, I registered the hosts (using c:\mpich2\bin\mpiexec.exe -register -host) so that each computer recognizes the other participating machine.
- I also disabled the firewall on both workstations.
- Finally, I modified runcode.py to build the host file (i.e. the file MPI_HOSTFILE in the temporary directory) according to the content of the field 'mpi_hosts' in config.cfg.

I then run my simulation using:
cd n:\t2dwork\
python c:\telemac\v7p0\scripts\python27\runcode.py telemac2d -s T2DCAS -c wintelmpi -w tmp

If I declare a single host in config.cfg (the workstation on which the Telemac binaries are located), my parallel simulation works fine with these settings. However, if I declare two workstations (as shown in the file config.cfg that is attached to this post), the last line displayed in the standard output is:
although the requested cores are working at 100% according to Windows Task Manager.

My questions are the following:
1) Is there something wrong with my configuration?
2) Are there other security settings (not related to the firewall) that need to be modified?


The administrator has disabled public write access.

Unable to start a simulation on a Windows7 mini-cluster 9 years 9 months ago #15680

  • yugi
  • yugi's Avatar
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244

You might wanna try running a basic mpi hello world first.
Something like that:
program hello_parallel

  ! Include the MPI library definitons:
  include 'mpif.h'

  integer numtasks, rank, ierr, rc, len, i
  character*(MPI_MAX_PROCESSOR_NAME) name

  ! Initialize the MPI library:
  call MPI_INIT(ierr)
  if (ierr .ne. MPI_SUCCESS) then
     print *,'Error starting MPI program. Terminating.'
     call MPI_ABORT(MPI_COMM_WORLD, rc, ierr)
  end if

  ! Get the number of processors this job is using:
  call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)

  ! Get the rank of the processor this thread is running on.  (Each
  ! processor has a unique rank.)
  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

  ! Get the name of this processor (usually the hostname)
  call MPI_GET_PROCESSOR_NAME(name, len, ierr)
  if (ierr .ne. MPI_SUCCESS) then
     print *,'Error getting processor name. Terminating.'
     call MPI_ABORT(MPI_COMM_WORLD, rc, ierr)
  end if

  print "('hello_parallel.f: Number of tasks=',I3,' My rank=',I3,' My name=',A,'')",&
       numtasks, rank, trim(name)

  ! Tell the MPI library to release all resources it is using:
  call MPI_FINALIZE(ierr)

end program hello_parallel

It will be easier to test your network with something more basic first.

Hope it helps.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: yanrousseau

Unable to start a simulation on a Windows7 mini-cluster 9 years 9 months ago #15738

  • yanrousseau
  • yanrousseau's Avatar
Hi Yugi,

Thank you for replying and for providing a sample MPI application.

I compiled this application using the same flags that are present in my systel.cfg file. This created the executable file mpihw.exe. I shared on the network the directory in which this file is located and mapped a network drive to that directory on two machines, i.e. the one holding the file mpihw.exe (let's call it A) and a second one (let's call it B). This simple MPI application was able to run and complete successfully using the following command line: c:\mpich2\bin\mpiexec -binding auto -n 8 -machinefile hosts.txt mpihw.exe

I then modified the source code of mpihw.exe (and recompile) to attempt reading a file located in the same directory as mpihw.exe (on machine A). That's basically the same as does the code in sources\utils\parallel\p_init.f when trying to read the file PARAL at the beginning of a simulation. The cores of machine A were able to read the file, but not those of machine B. It seems that machine A can order both machines to execute the application, but that machine B does not have the permission to read a file located on machine A. I'm a little bit surprised about the outcome of this test. The firewall is disabled, I gave full control to the user, and I registered both machines (i.e. using mpiexec -register -host). I will talk to the system admin tomorrow.

The administrator has disabled public write access.

Unable to start a simulation on a Windows7 mini-cluster 6 years 9 months ago #28663

  • Yunhao Song
  • Yunhao Song's Avatar
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Dear Yugi,

Could you tell me what does the error below mean?
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 1

It popped up during a T3D simulation before which everything seemed pretty good, I searched the whole forum and find that this is the only post with information about MPI_Abort.

Best regards & looking forward to your idea,
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.