Dear Jean-Michel,
I am just testing Telemac-3D with the Wesel (3D) example for the runs on a MIC (vel Xeon Phi) (co-)processor in the so-called native modus, i.e. executing completely on the (separate) PCIe-card. MIC has 60 x86-cores which are "weak" -- with limited instruction set compared to Xeon and each with only 512KB cache -- but each equipped with twice so long SIMD vector unit as a Xeon core, 512 Bits, i.e. 8 single precision, or 4 double precision vector length. Because each card supports an operating system, you have an impression you have logged in to a Linux shared memory (very) parallel machine with 60 (weak) vector processors. You can immediately compile and run any parallel legacy program you already have.
However, Telemac is badly optimized for serial execution (we know why), therefore the (although many) serial cores cannot do better than Xeon CPU cores, and the (too few) vectorized parts cannot be sped up much with the vector length of 4... I have a slowdown of usually 10 with one MIC compared to 16 Xeon cores, at best - with all resources and tricks -- 6 times slower.
What is annoying, with all (auto)vectorization on, the results are (very!) different between consecutive runs on the MIC. Only switching the vectorization off (-no-vec -no-simd) delivers predictable results. Therefore I checked all that forced vectorisation from the mighty Telemac@Cray past, but unfortunately in vain, throwing all IVDEPs out brings nothing.
We bother a bit about the new processors - the coming "Knights Landing" will have 72 stronger Atom cores and SIMD length doubled compared to MIC and can be applied as the main node processor, not only as a co-processor. But unfortunately, Telemac does not like the new many-core architecture.
Best regards,
Jacek