Thank you for all your replies. Your messages help me understand how hard my optimization idea will be.
Our testing model consists of 754906 elements corresponding to 384691 nodes. In the future we are planning to target domains of 20 million elements or more. So we are looking for possibilities to improve Telemac performance.
About one week of running, my bad, my computer has two executing modes and I chose a buggy one so it took too long.
After changing mode, performance is much better.
telemac computation capabilities had been tested and nowadays, the most used solution to reduce computational time is to use parallelism capabilities
I know Telemac is one of the best performance CFD simulations, that is why I choose it. But you know, even the bests want to be better, I believe Telemac will have better performance in future
.
My post here just shares my idea about improving/optimizing Telemac. Maybe someone has done for their special purposes/test cases/examples. Or maybe someone will share his/her experiences like: "Oh man, I attempted to modify this code one year ago and I failed" or "You might use latest technologies or approaches like Hybrid MPI + OpenMP or
Process in Process"
Back to the above code, I think the main problem is that compiler can not automatically vectorize the loop as data dependencies available. And between iterations, the program should go back or forward again and again to access elements of X, Y instead of unit-stride memory access.
I wrote a small program to test my ideas. Firstly this program executes the original loop
Secondly, I write a new subroutine sumMult(Z, XA1, Y, G1, G2, N) like this:
Thirdly, I compile my program with option -O3 for module OptimizedLib.f90 (which contains sumMult). The main program still have option -O0.
As running 10000 (ten thousand) times in which each of six arrays (X, XA1, XA2, Y, G1, G2) has about 280 thousands numbers (real and integer), I estimate that my modification can save 59% computational times of the original loop.
Of course, it is too early to judge its performance, I need to confirm my suggestion will work with real program, real data and real hardware.
There are two big questions:
1. As compiling TELEMAC, is it possible to use different optimization options for different fortran modules?
2. Are six arrays (X, XA1, XA2, Y, G1, G2) independent to each other? For example, any change of X will not affect values of elements of others, will it?
Merci d'avance