Welcome, Guest
Username: Password: Remember me

TOPIC: TOMAWAC not running in parallel

TOMAWAC not running in parallel 7 years 6 months ago #26214

  • Phelype
  • Phelype's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 140
  • Thank you received: 64
Hello,

I have been experiencing a very strange behaviour in TOMAWAC when running in parallel.

I have been trying to run a simulation (link below) on two different computers; My own and an HPC cluster. The simulation files are exactly the same, but the results are completely different.

In the Cluster, the simulation results are fine, with wave heights around 6-7 metres. But in my computer, the results don't reach 1 metre anywhere in the domain.

So I tried turning on and off the superficial wind and the boundary wave conditions alternately to see what the problem is. The results are:
-> Turning off the wind in my computer, the results are zeroed out.
-> Turning off the wind in the Cluster, the results are a little reduced.
-> Turning off the boundary in my computer, the results don't change.
-> Turning off the boundary in the Cluster, the results become identical to my computer's with wind and wave on.

The conclusion here is pretty obvious, the boundary conditions are not being taken into account when I run the simulation in my Computer. But I have no clue on what's going on because the simulation is exactly the same, running on the same version of TOMAWAC, only on different computers.

Does anyone have an idea on what's going on?

Thanks in advance.

Best regards,

Phelype

Link to download the files: www.dropbox.com/s/4ifk9wa6x0nc16l/sim_ufrgs.rar?dl=0
The administrator has disabled public write access.

TOMAWAC not running in parallel 7 years 6 months ago #26215

  • tfouquet
  • tfouquet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 294
  • Thank you received: 112
Hello Phelype,

I would love to help you understand but unfortunately i can not go to dropbox (our internet policy), so could you please download your files elsewhere or simply here.

Best regards

thierry
The administrator has disabled public write access.

TOMAWAC not running in parallel 7 years 6 months ago #26217

  • Phelype
  • Phelype's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 140
  • Thank you received: 64
Hello Thierry,

I tried to upload the files here in the forum, but I think they exceed the limit size (~40 MB). What server can you download from?
The administrator has disabled public write access.

TOMAWAC not running in parallel 7 years 6 months ago #26222

  • tfouquet
  • tfouquet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 294
  • Thank you received: 112
Hey Phelype

I tried to run your case on one processor it is very big (one million of elements), are you sure you run exactly the same case on one processor ?

Secondly I noticed that your NUMBER OF SUB-ITERATIONS is very big (more than 2000) in general it means that your time step is far too big (or your mesh far too fine)

I will try to see if i see something else,but i guess we should start on a smaller problem.

best regards

Thierry
The administrator has disabled public write access.

TOMAWAC not running in parallel 7 years 6 months ago #26224

  • Phelype
  • Phelype's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 140
  • Thank you received: 64
Thierry,

I forgot to mention, I used both machines with the same number of processors, 12, in order to eliminate the possibility of being a parallelism issue.

And yes, the mesh is very large. I covers the whole Brazilian coast with a good refinement near the coast.

I didn't notice this issue with the number of sub-iterations. I'll modify this in further simulations.

I can make a smaller mesh, with less elements, to perform the tests, if you feel the need.

Thank you very much for your help.

Best regards,

Phelype
The administrator has disabled public write access.

TOMAWAC not running in parallel 7 years 6 months ago #26255

  • tfouquet
  • tfouquet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 294
  • Thank you received: 112
I realise that i misunderstood I thought it was 12 on cluster and one on your computer and that it was a problem with parallelism. So idon't see the reason of such a difference between 2 computer. I guess that it is a difference from a compiler to another which compiler are you using ?

I think that in a first step, you don't need such a fine mesh, try to reproduce the event with a far coarser mesh it will be easier to debug.
The administrator has disabled public write access.

TOMAWAC not running in parallel 7 years 6 months ago #26259

  • Phelype
  • Phelype's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 140
  • Thank you received: 64
I am using GFortran on both computers. GFortran 4.8.4 both on the Cluster and on my computer.

The MPI, on the other hand, is different. On my computer I use OpenMPI and on the cluster they use (this is the information the support staff told me) MPT 2.03.
But I discarded the possibility of being the MPI, since (as far as I understand) it is only used for the communication on the boundary nodes between the sub-domains.
So, if it were the MPI, inside the sub-domains each core would still do its job correctly, am I right?

I also tried running the simulation with Tomawac v6p3 and v7p2. Both results are the same.

I'll make a smaller mesh, once I'm finished I'll post it here.
The administrator has disabled public write access.

TOMAWAC not running in parallel 7 years 5 months ago #26686

  • Phelype
  • Phelype's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 140
  • Thank you received: 64
Hello Thierry and community,

Recently I tried again to get TOMAWAC working in parallel. I tried with a small mesh as Thierry suggested. It really helped in the debugging process, but I did not succeed.

I found out (up to some extent) why the boundary conditions aren't being included in the calculation, but I don't know how to solve it.

Basically what happens is that in the propagation step, when tomawac calls the subroutine POST_INTERP (that calls BIEF_INTERP), the Barycentric interpolation coefficients are mostly NaN. And when these NaN enter the wave spectrum matrix, the enter as zero, thus all the waves are killed. After that tomawac integrates the source terms, which explains the wind being taken into account.

So now the problem is: why are these Barycentric interpolation coefficients NaN? I know they come from the method of Characteristics, but I don't know how this method works, so I don't even know where to begin trying to solve the problem.

You'll find attached the files needed to start the simulation. The Fortran files are commented indicating the steps towards the conclusion above (which I don't really know if is correct).

Any help is much appreciated.

Best regards,

Phelype

P.S.: In case the upload doesn't work, here's a Google Drive link: drive.google.com/open?id=0B8r99pzQjmbQektidGF0V3JPUW8
The administrator has disabled public write access.

TOMAWAC not running in parallel 7 years 5 months ago #26695

  • tfouquet
  • tfouquet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 294
  • Thank you received: 112
Hello Phelype

In fact you're lucky that is a bug i discovered yesterday, as it does not happen with every compiler we have not encountered it yet.
It happens only when you use spherical coordinates. The Latitude of origin point is not set, so the compiler set it to Nan and then it propagates to other variables. I will add a keyword in the dictionnary to set it by default but meanwhile you can simply add "lambd0=48." in your file wac.F before the call to inbief.
I don't know yet what should be the default value but in telemac2d this value is 48.

Hope it helps

Thierry
The administrator has disabled public write access.
The following user(s) said Thank You: Phelype

TOMAWAC not running in parallel 7 years 5 months ago #26698

  • Phelype
  • Phelype's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 140
  • Thank you received: 64
o_o

Well, I really thought it was a bigger problem.

So what does this Latitude of Origin Point Physically means? Is it the latitude of the first node of the mesh? Or of the southernmost node?
I would like to set it automatically, to avoid (my) human error ;)

About the bug, I may have the solution. In the config file, among the options you can use gfortran with, there is the -finit-real, that sets the value of real variables at their initialization. In my config file this option is "-finit-real=nan", which explains the NaN, and in the HPC cluster there is no such option, thus the default value is (according to my version of gfortran) eps (2.0730915448995370E-317 for Double precision). That's why the simulation runs there. Other compilers may have an equivalent option.

Thank you very much for your help, Thierry.

All the best,

Phelype
The administrator has disabled public write access.
Moderators: tfouquet

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.