Hi all,
I used the MPICH2 ability to distribute telemac2d malpasset-large computation in PCs in my windows cluster. There exist 3 different CPU architectures in the cluster:
- Intel Core i7 860 @2.80GHz (4 cores)
- Intel Core2 Duo @3.00GHz(2 cores)
- Intel Xeon E3 1220L V2 @2.30GHz (2 cores)
Unfortunately, I am unable to distribute computation using the gfortran compiler, so the benchmark results are based only on the Intel fortran compiler.
The optimization flags used with the intel compiler for all the machines are:
/O3 /QaxSSE4.2 /arch:SSE4.1
There are, of course, more aggressive optimizations that can be employed when targeting a single CPU, especially the newer Xeon E3.
The results show that the benefit of using more than 2 cores on the core i7 is of little benefit. Unfortunately, the Intel Xeon CPU is installed on a low power compact machine and it has only 2 cores. I would be very keen to see whether Xeon quad core CPUs, or the latest i7's CPUs experience the same bottleneck when using more than 2 cores.
Some attention should paid to the 'Intel Turbo Boost' technology (Core i7 and Xeon E3). It changes the core clock rate differently when using 1, 2 and 3 or 4 cores, which adds a small penalty to computations times when using more than 1 core. However, this penalty cannot explain the poor results that i7 gives, which must be an issue with the CPU architecture.
It would also be interesting to see how AMD multicore CPUs are doing on this aspect.
Regards,
Costas