Welcome, Guest
Username: Password: Remember me

TOPIC: Potential vectorization errors in BIEF routines

Potential vectorization errors in BIEF routines 10 years 5 months ago #13233

  • jaj
  • jaj's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 69
  • Thank you received: 7
Dear all,

this message does not concern Telemac-3D but the library BIEF, but I do not know where to send it.

By searching for reasons of "non-reproducibility" of results with Telemac-3D between consecutive runs I have found that in the following routines of BIEF:

cflp12.f
cg1112.f
cg1113.f
vc08bb.f

the forced vectorization directives !DIR$ IVDEP before some loops are always active. This means, that when one uses Intel Fortran Compiler with optimisation -O2 or higher on an Intel Xeon processor, the given loops are forcibly vectorized for the use in this processor SIMD unit of a given vector length. If one has not sorted the mesh appropriately, the results might be unpredictable (does anyone there remember that one has to sort your mesh for a vector processor?). This happens even when a "serial machine" LVMAC=LV=1 is declared.

In theory, clean programming with forced vectorization would require the methodology as applied, for example, in assve1.f:

IF(LV.EQ.1) THEN
!
! SCALAR MODE
!
DO 40 IELEM = 1 , NELEM
X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
40 CONTINUE
!
ELSE
!
! VECTOR MODE
!
DO 60 IB = 1,(NELEM+LV-1)/LV
!VOCL LOOP,NOVREC
!DIR$ IVDEP
DO 50 IELEM = 1+(IB-1)*LV , MIN(NELEM,IB*LV)
X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
50 CONTINUE
60 CONTINUE
!
ENDIF

In this case the forced vectorization is active only if the user wishes this explicitly (and the mesh is appropriately sorted).

NOTICE: The bugs of this kind might be especially nasty with random behavior and can go undetected for years...

Best regards,
Jacek

PS. One switches the vectorization with -no-vec -no-simd off. jaj
The administrator has disabled public write access.

Potential vectorization errors in BIEF routines 10 years 5 months ago #13239

  • jmhervouet
  • jmhervouet's Avatar
Hello Jacek,

This is stunning, it was totally overlooked (I'll remove the lines), but could be interesting, do you think we should revive the old good renumbering process and force vectorisation ? Many years ago the speed-up of vectorisation was about a factor 10 in Telemac, this was left over with the rising power of parallelism.

Regards,

Jean-Michel

P.S. you will be interested to know that reproducibility between scalar and parallel with Tomawac will be in the next version 7.0. Finite element assembly is done with 8 bytes integers. With other modules there remains the problem of dot product of large vectors, which precludes the use of integers.
The administrator has disabled public write access.

Potential vectorization errors in BIEF routines 10 years 5 months ago #13242

  • jaj
  • jaj's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 69
  • Thank you received: 7
Dear Jean-Michel,

I am just testing Telemac-3D with the Wesel (3D) example for the runs on a MIC (vel Xeon Phi) (co-)processor in the so-called native modus, i.e. executing completely on the (separate) PCIe-card. MIC has 60 x86-cores which are "weak" -- with limited instruction set compared to Xeon and each with only 512KB cache -- but each equipped with twice so long SIMD vector unit as a Xeon core, 512 Bits, i.e. 8 single precision, or 4 double precision vector length. Because each card supports an operating system, you have an impression you have logged in to a Linux shared memory (very) parallel machine with 60 (weak) vector processors. You can immediately compile and run any parallel legacy program you already have.

However, Telemac is badly optimized for serial execution (we know why), therefore the (although many) serial cores cannot do better than Xeon CPU cores, and the (too few) vectorized parts cannot be sped up much with the vector length of 4... I have a slowdown of usually 10 with one MIC compared to 16 Xeon cores, at best - with all resources and tricks -- 6 times slower.

What is annoying, with all (auto)vectorization on, the results are (very!) different between consecutive runs on the MIC. Only switching the vectorization off (-no-vec -no-simd) delivers predictable results. Therefore I checked all that forced vectorisation from the mighty Telemac@Cray past, but unfortunately in vain, throwing all IVDEPs out brings nothing.

We bother a bit about the new processors - the coming "Knights Landing" will have 72 stronger Atom cores and SIMD length doubled compared to MIC and can be applied as the main node processor, not only as a co-processor. But unfortunately, Telemac does not like the new many-core architecture.

Best regards,
Jacek
The administrator has disabled public write access.

Potential vectorization errors in BIEF routines 10 years 5 months ago #13249

  • Lufia
  • Lufia's Avatar
Dear all,

are the routines really called by telemac3d in the Wesel example? I've used some simple write statements in the routines to check if they get called and had no success.

I'm not an expert in Bief, but from the comments in the source code it looks as this parts of Bief are used for the QUASI-BUBBLE elements?

Best regards,

Leo
The administrator has disabled public write access.

Potential vectorization errors in BIEF routines 10 years 5 months ago #13250

  • jmhervouet
  • jmhervouet's Avatar
Hello,

Exactly, in fact, cg1112.f and cg1113.f can be vectorised without risk (and maybe a compiler will do it without being told), only cflp12.f and vc08bb.f may have backward dependencies that would require a specific numbering of elements, so I removed the CDIR$ IVDEP for further versions for these last two subroutines.
Now if we want to take advantage of vectorisation we should work on matrix-vector product with edge-based storage, which questions the numbering of elements, brand new subject...

Regards,

Jean-Michel
The administrator has disabled public write access.

Potential vectorization errors in BIEF routines 10 years 5 months ago #13252

  • jaj
  • jaj's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 69
  • Thank you received: 7
Hello,

yes it is unfortunately true that removing the potential vectorization errors in the questioned routines does not help in the Telemac-3D Wesel example. When I noticed that the results are "non-reproducible" with vectorization on and reproducible with vectorization off, I have, as a veteran Cray or Fujitsu vector computer user, immediately searched for the forced vectorization directives in the sources and discovered the above mentioned bugs. In vain, the results on the MIC are consequently still not reproducible. So this is not the end of the story. I think I have to try another examples, maybe this disease concerns only Telemac-3D, or some specific part of it. Anyway, -no-vec -> no problem.

However - this might explain the strange "random" errors occurring from time to time also on the Xeon CPU (or other CPUs with SIMD vector unit) and Intel compiler applying with a higher optimisation SSE and/or AVX instruction set. I was shocked to have one of these "Telemac ghosts appearances" on the normal Xeon CPU, what triggered my more serious small research into the code.

Best regards,
Jacek
The administrator has disabled public write access.

Potential vectorization errors in BIEF routines 10 years 5 months ago #13254

  • jaj
  • jaj's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 69
  • Thank you received: 7
Dear Jean-Michel,

we have done quite a few years ago (ca. 2007?) a small research with Telemac-2D concerning the usage of a truly parallel vector computer of Hitachi. In theory this would be an ideal architecture for Telemac, which was written on a vector computer for a vector computer, with once 70-80% vectorized code (a very good result!) - and then parallelized with the domain decomposition. Perfect?

Unfortunately it occured that although all data structures and (most of) loops remained ready-for-vectorization (and therefore causing annoying massive cache misses on serial processors...), for success we had to go back to pre-2000 routines with all that element-by-element storage in order to get vectorized runs - and only for execution on -one- vector processor. The reason for this was, that we could sort only for EBE, and not for the edge-based storage, and, of course, we could sort only whole meshes and not their partitions for the MPI-parallel runs... And the sorting programs run for long hours, hours... Urgh! Given up.

A similar situation is for the many-core architecture (MIC: cores augmented with SIMD vector units); due to small caches you get punished with cache-misses even more than on the CPU cores due to the vector data structures, and because of the missing sorting routines for forced vectorization you cannot use vector units effectively. Compiler-made auto-vectorization (supposed to be safe?) brings only ca. 5% execution time improvement.

So, it seems we got stuck in the past, don't we?

Best regards,
Jacek
The administrator has disabled public write access.

Potential vectorization errors in BIEF routines 10 years 5 months ago #13253

  • Lufia
  • Lufia's Avatar
Hello,

I've started a small test with the gfortan compiler and the standard compiler flags for opensuse for the WESEL example. My local machine is a Intel i5-3470, so far (after 3 runs) the results are reproducible. But it needs some more tests.

Maybe the Intel Compiler and the aggressive optimization/vectorization is the problem?

Best regards,

Leo
The administrator has disabled public write access.

Potential vectorization errors in BIEF routines 10 years 5 months ago #13255

  • jaj
  • jaj's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 69
  • Thank you received: 7
Dear Leo,

must admit I use gfortran only sporadically, I have had some problems with it in the past (UnTRIM user interface was "too modern"). Please check if it forcibly vectorizes the given loops in question (important) and if they are executed at all in your example. By Intel, it would be -vec-report=3 or higher.

Yes I do not like Intel compiler optimizations as well, they do much too much for speed on the costs of the accuracy.

Best regards,
Jacek
The administrator has disabled public write access.

Potential vectorization errors in BIEF routines 10 years 5 months ago #13256

  • jaj
  • jaj's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 69
  • Thank you received: 7
Hello,

in order to finish this thread: It occurs that results of Telemac-3D (example Wesel) and Telemac-2D (example Donau) are perfectly reproducible between consecutive runs on the MIC (Xeon Phi) processor, when one applies by the Intel Fortran optimization -O2 (auto-vectorization on!) additionaly the floating point model "source", it is "-fp-model source". It means all intermediate results are rounded up to the source-defined precision. The default is "-fp-model fast=1" and it seems to be not so good for Telemac. One consults the Intel Compiler manual for details.

(This maybe also a hint also for other procesors...)

Best regards,
jaj
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.