Welcome, Guest
Username: Password: Remember me

TOPIC: Centos + OpenMPI

Centos + OpenMPI 9 years 5 months ago #17296

  • konsonaut
  • konsonaut's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 413
  • Thank you received: 144
Hello,

we have access to a small cluster with Linux Centos 6.6 as OS, Gfortran 4.4.7, OpenMPI 1.8.1.
According to $ lscpu the machine has 4 sockets with 4 cores per socket, 2 threads per core and 8 NUMA nodes.
I installed Telemac v7p0r0 and running e.g. the malpasset large case using up to 8 cores, everything is fine. However when using more than 8 cores, e.g. 12 cores, the simulation time increases a lot and something must be wrong. Maybe because it simulates on 8 cores but based on 12 partioned meshes?!
Apparently it has something to do with my configuration when adressing more than one socket / node. I have to say that also this socket and NUMA concept is not very clear to me since I tried mpirun options like --npersocket etc. which didn't help.
Attached one can find my my systel file.
As far as I know, on the machine no HPC queuing systems are installed.
So my basic questions: what do I need to adress more than one node / socket?

I would be glad for any hints!
Clemens
Attachments:
The administrator has disabled public write access.

Centos + OpenMPI 9 years 5 months ago #17312

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
Hello Clemens,

How many nodes does this cluster have? Is it 8 nodes, each one with 4 sockets? It is not clear in your post.

Costas
The administrator has disabled public write access.

Centos + OpenMPI 9 years 5 months ago #17314

  • konsonaut
  • konsonaut's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 413
  • Thank you received: 144
Hi Costas,

$ lscpu gives me this:
_________________________________________
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 4
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 21
Model: 2
Stepping: 0
CPU MHz: 3200.000
BogoMIPS: 6399.62
Virtualization: AMD-V
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s): 4-7
NUMA node2 CPU(s): 8-11
NUMA node3 CPU(s): 12-15
NUMA node4 CPU(s): 16-19
NUMA node5 CPU(s): 20-23
NUMA node6 CPU(s): 24-27
NUMA node7 CPU(s): 28-31
_________________________________________

So it has 4 sockets, each with 4 cores and with hyperthreading enabled each sockets has 8 logical processors.

Still we couldn't solve the problem but according to our IT experts it is not a Telemac configuration issue but highly probable is related to the fact, that on the machine also some Ansys CFX simulations are running with no explicit socket reservation. So we have to adapt that.

Best regards,
Clemens
The administrator has disabled public write access.

Centos + OpenMPI 9 years 5 months ago #17317

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
Hi Clemens,

Yes, since each node has 4 sockets, it is most certainly an affinity issue. Since any computation that spans across multiple nodes will get some communication penalty, it crucial that affinity is defined when applicable. Even more so if another computation is 'stealing' your cpu power!

Regards,
Costas
The administrator has disabled public write access.

Centos + OpenMPI 9 years 5 months ago #17334

  • cyamin
  • cyamin's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 997
  • Thank you received: 234
Hello Clemens,

I just realised from your $ lscpu listing that we are talking about one machine that has 4 sockets. The operating system is seeing 32 cpus (4sockets x 4cores x 2threads = 32 logical cpus). It treats them as 8 NUMA nodes, not physical nodes. It treats each core as a NUMA node, because each core has its own memory channel with the fastest possible memory access.

So, you have 8 memory channels. Assuming that Telemac is a memory intensive application (which it is usually not far from the truth), i.e. memory access speed is the bottleneck, not CPU power. That means each CPU is consuming all the channel's bandwidth and using more CPUs would create 'congestion' in the memory bus and potentially slowdown. That would explain why 8 cores is optimum in your case.

This is a likely theory. If you manage to sort it out, I would be interested in your findings because I am planning to build a multisocketed workstation and I am trying to figure out the optimum (CPU core count-speed)/(memory channel) ratio for telemac applications.

Best Regards,
Costas
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.