Welcome, Guest
Username: Password: Remember me

TOPIC: Unable to start a simulation on a Windows7 mini-cluster

Unable to start a simulation on a Windows7 mini-cluster 9 years 9 months ago #15673

  • yanrousseau
  • yanrousseau's Avatar
Dear Telemac users,

I installed Telemac v6p2 and v7p0r0 on Windows7 workstations. On each one of these, I can run a simulation in parallel with mpich2. However, I tried to create a cluster using two of these workstations, and the simulation worked but is much slower than if using the cores of only one of these two computers. I posted a similar question last week for my Linux setup (opentelemac.org/index.php/kunena/12-linu...ng-if-using-16-cores) but the origin of the problem seems to be different under Windows7 since I'm unable to run a single simulation involving more than one node. I suspect that my problem with Windows7 is related to security.

Here are the main facts about my configuration:
- The two workstations that I'm using are identical twins (except their name). They have a 64-bit architecture, are hex-cores with 24Gb memory, and run Windows7.
- I compiled Telemac v7.0 using Intel Visual Fortran Compiler v11.1.072 and I'm using Python v2.7.9 (I think this version includes numpy, scipy, and matplotlib) and Metis v5. For the Metis library, I ignore the exact sub-version number, but it was created on September 12th, 2012 and is 1181Kb. I have attached my configuration file.
- The two computers belong to the same network and recognize my user name (which happens to be an administrator of both machines). Therefore, there is a single domain, user and password involved.
- MPICH2 is installed on both machines, at the same location, and the SMPD service is running.
- I created a 'case' directory on the primary machine (the one that holds the Telemac binaries), shared this directory to my user (with full control), and mapped a drive toward this directory (n:\) on both machines.
- Since, at this point, my multi-node calculation was not starting, I registered the hosts (using c:\mpich2\bin\mpiexec.exe -register -host) so that each computer recognizes the other participating machine.
- I also disabled the firewall on both workstations.
- Finally, I modified runcode.py to build the host file (i.e. the file MPI_HOSTFILE in the temporary directory) according to the content of the field 'mpi_hosts' in config.cfg.

I then run my simulation using:
cd n:\t2dwork\
python c:\telemac\v7p0\scripts\python27\runcode.py telemac2d -s T2DCAS -c wintelmpi -w tmp

If I declare a single host in config.cfg (the workstation on which the Telemac binaries are located), my parallel simulation works fine with these settings. However, if I declare two workstations (as shown in the file config.cfg that is attached to this post), the last line displayed in the standard output is:
USING STREAMLINE VERSION 7.0 FOR CHARACTERISTICS
although the requested cores are working at 100% according to Windows Task Manager.

My questions are the following:
1) Is there something wrong with my configuration?
2) Are there other security settings (not related to the firewall) that need to be modified?

Regards,

Yannick
Attachments:
The administrator has disabled public write access.

Unable to start a simulation on a Windows7 mini-cluster 9 years 9 months ago #15680

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi,

You might wanna try running a basic mpi hello world first.
Something like that:
program hello_parallel

  ! Include the MPI library definitons:
  include 'mpif.h'

  integer numtasks, rank, ierr, rc, len, i
  character*(MPI_MAX_PROCESSOR_NAME) name

  ! Initialize the MPI library:
  call MPI_INIT(ierr)
  if (ierr .ne. MPI_SUCCESS) then
     print *,'Error starting MPI program. Terminating.'
     call MPI_ABORT(MPI_COMM_WORLD, rc, ierr)
  end if

  ! Get the number of processors this job is using:
  call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)

  ! Get the rank of the processor this thread is running on.  (Each
  ! processor has a unique rank.)
  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

  ! Get the name of this processor (usually the hostname)
  call MPI_GET_PROCESSOR_NAME(name, len, ierr)
  if (ierr .ne. MPI_SUCCESS) then
     print *,'Error getting processor name. Terminating.'
     call MPI_ABORT(MPI_COMM_WORLD, rc, ierr)
  end if

  print "('hello_parallel.f: Number of tasks=',I3,' My rank=',I3,' My name=',A,'')",&
       numtasks, rank, trim(name)

  ! Tell the MPI library to release all resources it is using:
  call MPI_FINALIZE(ierr)

end program hello_parallel

It will be easier to test your network with something more basic first.

Hope it helps.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: yanrousseau

Unable to start a simulation on a Windows7 mini-cluster 9 years 9 months ago #15738

  • yanrousseau
  • yanrousseau's Avatar
Hi Yugi,

Thank you for replying and for providing a sample MPI application.

I compiled this application using the same flags that are present in my systel.cfg file. This created the executable file mpihw.exe. I shared on the network the directory in which this file is located and mapped a network drive to that directory on two machines, i.e. the one holding the file mpihw.exe (let's call it A) and a second one (let's call it B). This simple MPI application was able to run and complete successfully using the following command line: c:\mpich2\bin\mpiexec -binding auto -n 8 -machinefile hosts.txt mpihw.exe

I then modified the source code of mpihw.exe (and recompile) to attempt reading a file located in the same directory as mpihw.exe (on machine A). That's basically the same as does the code in sources\utils\parallel\p_init.f when trying to read the file PARAL at the beginning of a simulation. The cores of machine A were able to read the file, but not those of machine B. It seems that machine A can order both machines to execute the application, but that machine B does not have the permission to read a file located on machine A. I'm a little bit surprised about the outcome of this test. The firewall is disabled, I gave full control to the user, and I registered both machines (i.e. using mpiexec -register -host). I will talk to the system admin tomorrow.

Yannick
The administrator has disabled public write access.

Unable to start a simulation on a Windows7 mini-cluster 6 years 9 months ago #28663

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Dear Yugi,

Could you tell me what does the error below mean?
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 1

It popped up during a T3D simulation before which everything seemed pretty good, I searched the whole forum and find that this is the only post with information about MPI_Abort.

Best regards & looking forward to your idea,
Yunhao
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.