Welcome, Guest
Username: Password: Remember me

TOPIC: Memory leak when using method of caracteristics

Memory leak when using method of caracteristics 12 years 1 day ago #6475

  • jeremie
  • jeremie's Avatar
  • OFFLINE
  • Junior Boarder
  • Hydro-Quebec
  • Posts: 39
  • Thank you received: 7
Hi all,

We've been experiencing a memory leak for quite some time now with Telemac-2D on our cluster. The problem appears in parallel and can be quite hindering since calculation is terminated once the weakest node's memory is saturated.

The problem only appears when TYPE OF ADVECTION = 1;5 is used. It disappears when other u,v advection schemes are used (example 6;5). The rate of saturation of memory is also function of the number of processors.

This problem seems very esoteric to us, since it is sensitive to our choice of advection scheme. We've experienced it on versions V5P9-V6P2. Could it be that the method of caracteristics uses some specific libraries that are not compiled properly on our system?

Any suggestion as to what and where to look for would be highly appreciated. Our T2D distribution is compiled with intel fortran and we use impi for parallelism. Attached are some graphs of memory usage for two parallel cases (300 procs vs. 144 procs).

Regards,

jeremie


300cpu.75Tparh.jpg


144cpu.3Tparh.jpg
The administrator has disabled public write access.

Memory leak when using method of caracteristics 12 years 1 day ago #6477

  • jmhervouet
  • jmhervouet's Avatar
Hello,

I understand that the memory used is growing with time. This has been reported once on one machine by Sogreah-Artelia (I do not know how they solved it). It could be a dynamic allocation that accumulates with time because the compiler does not release it (but in streamline.f where the characteristics are done the memory is allocated once for all). I would suspect some mpi-related effect when data are transferred from one processor to the other. This is also sometimes an effect of automatic arrays on some compilers: if you declare locally X(NPOIN) and NPOIN is an argument, it will be allocated at every call and some compilers will crash (automatic arrays are not standard Fortran). We never observed this at EDF on our various machines, though we run very large numbers of iterations.

Hope that helps (but not sure...)

Jean-Michel
The administrator has disabled public write access.

Memory leak when using method of caracteristics 12 years 1 day ago #6478

  • jeremie
  • jeremie's Avatar
  • OFFLINE
  • Junior Boarder
  • Hydro-Quebec
  • Posts: 39
  • Thank you received: 7
Thank you jean-michel for the quick response.

I forgot to mention that at first I was suspecting some of our user Fortran routines. Some were using repeated loops on all NPOIN and were also saving some variables for later use. However, even simple cases without any customized Fortran exhibit the same behavior.

Our telemac source code is pulled directly from the public subversion and compiled as is. That is why I'm suspecting my compilation configs might matter. I will attach the systel.ini file once I clean it up a little bit.

jeremie
The administrator has disabled public write access.

Memory leak when using method of caracteristics 11 years 11 months ago #6527

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi Jeremie
As Jean-Michel wrote, we're facing similar problem on our cluster with T3D.
For the moment, we don't have any solution and don't have any idea about the origin of the problem.
Like you we could observe such phenomenon with the default executable in 6.1 and 6.2

We had some exchange with Jean-Michel and also Chi-Tuan but on our model, the problem never appear on EDF computer ...

We're trying to investigate the problem with BULL the builder of our cluster but without any success for the moment.
Personnaly i've made few test on my windows laptop and it seems the same phenomenon exist... but with less processors, the problem appears very slowly...

If you found something or want to exchange about this particular problem, don't hesitate to contact me

Best regards
Christophe
The administrator has disabled public write access.

Memory leak when using method of caracteristics 10 years 4 weeks ago #14589

  • VALENTIN
  • VALENTIN's Avatar
Hi all,
We are facing the same problem with telemac2d and v6p3 on our cluster.
The memory seems to increase growly with the adding of processors.


Do you still have this problem or found a solution today?

Thanks.
Attachments:
The administrator has disabled public write access.

Memory leak when using method of caracteristics 10 years 1 week ago #14803

  • VALENTIN
  • VALENTIN's Avatar
Nobody may help me ? :( :(
The administrator has disabled public write access.

Memory leak when using method of caracteristics 10 years 1 week ago #14804

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hard with so few information!
As the observed increase is very rapid, this sound strange!
Could we have more information about your model (size), your computation configuration (steering file), and also your cluster
Christophe
The administrator has disabled public write access.

Memory leak when using method of caracteristics 10 years 1 week ago #14805

  • jeremie
  • jeremie's Avatar
  • OFFLINE
  • Junior Boarder
  • Hydro-Quebec
  • Posts: 39
  • Thank you received: 7
Hi Valentin,

Thank you for the reminder. Since my last post, we have found a temporary "hardware" solution.

On our cluster setup there appears to be a link between this memory leak and communication between our infiniband switches. Our cluster is made up of 40+ machines of different generations connected together with 2 switches (about 20 machines per net switch).

By chance, I ran into a case that didn't saturate memory and investigated a bit. I found out that if the calculation is kept on machines that are connected to the same network switch, the problem disappears. It reappears as soon as the calculation involves computers spread over 2 different switches.

That is the only fix we have found for now. It hinders our ability to run cases on the whole cluster, but that is not really a problem for us.


jeremie
The administrator has disabled public write access.
The following user(s) said Thank You: VALENTIN

Memory leak when using method of caracteristics 10 years 1 week ago #14824

  • VALENTIN
  • VALENTIN's Avatar
Hello
Thanks for help.
We are running our model on a SGE cluster with 4 nodes with 10 processors for each.
We only have one infiny band switch. So i think its a different problem from your Jeremie. By the way, thanks for the suggestion.

Our model is attached bellow. Can you download it and run on your cluster to check if you have the same issue ?
14_couplage_test_RAM.zip - 16.9 MB

Thanks.
The administrator has disabled public write access.

Memory leak when using method of caracteristics 10 years 1 week ago #14825

  • VALENTIN
  • VALENTIN's Avatar
Sorry for the previous post

Hello
Thanks for help.
We are running our model on a SGE cluster with 4 nodes with 10 processors for each.
We only have one infiny band switch. So i think its a different problem from your Jeremie. By the way, thanks for the suggestion.

Our model is attached bellow. Can you download it and run on your cluster to check if you have the same issue ?

The link is :
www.filedropper.com/14couplagetestram

Thanks.
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.