Welcome, Guest
Username: Password: Remember me

TOPIC: Simulations on a cluster not always working if using > 16 cores

Simulations on a cluster not always working if using > 16 cores 9 years 9 months ago #15638

  • yanrousseau
  • yanrousseau's Avatar
Dear Telemac users,

I installed Telemac v7p0r0 on a Linux CentOS v6.4 cluster. With a reduced number of cores (<16), my TELEMAC-2D (and TELEMAC-2D coupled with SISYPHE) simulations complete successfully. The simulation domain in my test case contains 13,046 nodes (4781 nodes are wet). However, when using more than 16 cores, the simulation becomes very slow and seems to be in a scalar mode. The other problem encountered is that a few simulations launched (for the same exact case) were successful with up to 32 processors. Indeed, I launched a simulation multiple times, and I noticed a range of outcomes, including crashing, running verrrrrrry slowly, and completing successfully. I noticed the same behaviour under v6p2, which is installed on the same cluster. I see two broad potential explanations: 1) the nodes on the cluster are unstable, and/or 2) there is a problem with my Telemac setup. The former explanation is likely since it has happened that a simulation crashes with a message indicating that certain nodes are not answering. The latter is also possible since, in the majority of my attempts, the simulations are simply very slow, although they eventually complete. The problematic simulations take much longer to run than a normal scalar simulation.

I have attached the configuration file that I used with v6p2 and v7p0. Although the files define both the mpich2 and openmpi configurations, I did most of my tests using openmpi. Since there is a queuing system on the cluster that I'm using (SharcNet.ca), I slightly modified the Python scripts to allow for the key hpc_cmdexec to be interpreted and the simulations to run on the nodes assigned by the queuing system. The script also merges TELEMAC-2D and SISYPHE files at the end, in addition of cleaning up the directory from temporary files. This works fairly well if fewer than 16 cores are requested.

With v6p2, I used gcc/4.8.2, openmpi/gcc/1.8.3, and python/2.6.6.
With v7p0, I used gcc/4.8.2, openmpi/gcc/1.8.3, and python/2.7.5.
In both cases, I linked using metis/5.1.0, which was compiled on the cluster. Since I encountered different posts about the compatibility of Telemac with metis v4/5 (I'm still not sure which one I should use), I also tried compiling Telemac using metis/4.0.3 without greater success. When compiling metis/5.1.0, I specified IDXTYPEWIDTH = 32. I also did tests using mpich2/1.4p1 (compiled on the cluster), but that did not work out better.

My main question is the following: is this a cluster problem or an installation problem? Also, should the number of cores divided evenly between nodes or it does not matter? Any comment/idea/cheer will be greatly appreciated.

Regards,

Yannick
Attachments:
The administrator has disabled public write access.

Simulations on a cluster not always working if using > 16 cores 9 years 9 months ago #15659

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi Yannick

That's sound strange!
And we need more details to well understand the problem...
Could you describe a little more the cluster structure (nodes, connection ...)?

Another point, as you use a small model, increasing the number of cores increase the communication between cores and this reduce the speed of computation. By experience, we thought that 1000 nodes/core gives the "optimal"
speed-up.

Best regards
Christophe
The administrator has disabled public write access.

Simulations on a cluster not always working if using > 16 cores 9 years 9 months ago #15662

  • yanrousseau
  • yanrousseau's Avatar
Hi Christophe,

Thank you for replying.

The cluster that I'm using has 392 nodes. I'm only using the first 320 ones since they have the fastest processors. The flag '-f opteron' in the 'sqsub' command line allows to select the fast nodes. Each one of these 320 nodes is a AMD Opteron 2.2Ghz unit with 24 cores (2 sockets x 12 cores per socket), and 32.0 Gb of memory. This means that I could, in theory, run a simulation using 7680 cores. The reality is that there are other users on the cluster and that I cannot use more than 256 processors simultaneously. The nodes are connected via QDR InifniBand.

The most recent simulation that I launched, requesting 32 cores in the .cas file) was sent to 19 nodes (and was very slow). Other simulations completed successfully with 20 cores (spread amongst 11 nodes), with 24 cores (in 13 nodes), and with 32 cores (in 15 nodes). I was not able to use more than 16 cores since then.

Would you say that my setup seems fine for a Telemac v7p0 setup on Linux CentOS v6.4? I used gcc/4.8.2, openmpi/gcc/1.8.3, python/2.7.5, and metis/5.1.0.

Regards,
Yannick
The administrator has disabled public write access.

Simulations on a cluster not always working if using > 16 cores 9 years 8 months ago #16187

  • jstark
  • jstark's Avatar
Hey,

I have noticed a similar problem with v7p0 on the HPC cluster at our university.
I've been running Telemac2D-simulations on 1 node with 20 cores (so simulating on 20 parallel processors) using v6p3 and v7p0.

With v6p3 everything runs smoothly. However, when I run exactly the same simulation on v7p0, I get strange results. The simulation itself seems to be running fine. The listing file does not show any unexpected messages and ends just as the v6p3 simulation:

END OF TIME LOOP
EXITING MPI
CORRECT END OF RUN
ELAPSE TIME : 3 HOURS
54 MINUTES
18 SECONDS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
... merging separated result files
+> cas_jul2012
recollectioning: T2DRES
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
... handling result files
+> cas_jul2012
moving: RES_SC_EMB1000
My work is done

So far nothing wrong.. But if I open the Result-file in BlueKenue it appears to give corrupted/incorrect results. The first timesteps are still fine but at a certain moment it seems that the stitching between the parallel result files went wrong, giving random patterns, zeros or extremely high values (infinite numbers) for the water surface elevation (see attachment for an example). This happened for several simulations on our cluster.

BK_example.png


The strange thing is that when I run the same simulation with the same TELEMAC-2D-configuration on the same cluster using only 12 processors (instead of the available 20 per node), everything runs fine again.

As I'm no expert, Im not sure if this might have to do with the problems Yannick described above, but I thought it was worth mentioning.

Regards,

Jeroen
The administrator has disabled public write access.

Simulations on a cluster not always working if using > 16 cores 9 years 8 months ago #16194

  • yanrousseau
  • yanrousseau's Avatar
Hi Jeroen,

From your description, I don't think the issue is exactly the same I experienced. My simulations failed to launch because of a network/configuration (perhaps temporary as well) problem. I ended up receiving authentication error messages, which means that my request to launch an application was accepted on certain nodes, but refused on others. This reflected in my simulations failing to reach the first iteration. No result file was produced since the calculations did not even start. I observed the same issue with SharcNet using v6.2 and v7.0, but inconsistently. I could launch a simulation once with 32 processors and it would calculate what it is asked to. When attempting to run the same simulation a second time, it would sometimes fail with as low as 4 processors. After communicating with the admins of the cluster, I was told that my problem might have been caused by network problems that were noticed early this year on the cluster.

Did you try running the same exact simulation with v7p0 a second time with 20 cores? If not, I would be curious to see if it works. Perhaps there was a connection problem at some point during your simulation and one core was not able to make the required calculations or communicate the result. If you already tried, my guess would be that there is something wrong with one of the machines. Determining which core causes the problem could allow you to contact the admins of your HPC cluster and restart the problematic machine. This can be done by looking which additional machine/core is requested when using 20 processors (rather than 19). Alternatively, if you are able to specify the machines/cores on which to launch a simulation (perhaps with a single iteration), you could attempt running a simulation with 20 cores, varying the combination of requested machines/cores. In my case, I noticed that when specific machines where involved, my simulations were paused (and not able to start), no matter the number of processors requested.

I'm not sure if my answer is really helpful.

Yannick
The administrator has disabled public write access.

Simulations on a cluster not always working if using > 16 cores 9 years 8 months ago #16198

  • riadh
  • riadh's Avatar
Hi Jeroen,

This is really interesting topic !
can you send me your case. I will try to run it with the same number of procs in our cluster. If the same problem remains this means that there is a problem in the parallelism of Telemac, otherwise, we need more investigation to understand this machine-dependance

with my best regards
Riadh
The administrator has disabled public write access.

Simulations on a cluster not always working if using > 16 cores 9 years 8 months ago #16203

  • jstark
  • jstark's Avatar
Dear Riadh and Yannick,

Thanks for your replies.

Riadh, I'll send you an email with the input files for the specific simulation (the result file itself is to large to send, around 10GB).

I just submitted several versions of this simulation to our cluster:
- 1 node 20 proc for v7p0
- 1 node 19 proc for v7p0
- 1 node 16 proc for v7p0
- 1 node 20 proc for v6p3
Just to check whether there is any difference between the number of processors and to double-check if I still get good results with v6p3, which has always worked fine in lots of simulations using a variety of parallel processors. Hopefully, I have the results of these simulations tomorrow.

Again, a large part of the Result-file seems to be fine and I just heard that a colleague who does shorter/smaller simulations had no problems so far with v7p0 using 20 paprallel processors. However in my case, it is from a certain timestep in the simulation that the erroneous results start to appear. From that moment the results sort of switch every output-timestep between being correct in some parts of the model, being correct in other parts of the model or being wrong everywhere, etc.

Im curious as well what is causing this.

Regards,

Jeroen
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.