Welcome, Guest
Username: Password: Remember me

TOPIC: telemac3d.py Segmentation fault when --ncsize > 61

telemac3d.py Segmentation fault when --ncsize > 61 3 years 4 months ago #38706

  • jdoe2
  • jdoe2's Avatar
Hi,

I'm currently working on executing Telemac 3D on large machines (64, 128, 224 CPUs). I choose to benchmark the computation nodes with examples provided in the Telemac source code (under ./examples/telemac3D/). Studies I launch are:
  • NonLinearWave
  • bendrans

Everything runs fine until I set --ncsize > 61. Then, I get a Segmentation Fault (see attachment file for complete output).

Program received signal SIGSEGV: Segmentation fault - invalid memory reference

Do you know why ?

Beside this error, do you know some limitations to run telemac3d.py on a single machine ? What is the largest setup you ever seen in term of parallel CPUs (clustered or not) ?

Thank you for your help,
Have a nice day :)
Attachments:
The administrator has disabled public write access.

telemac3d.py Segmentation fault when --ncsize > 61 3 years 4 months ago #38707

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi
As most of tests case are relatively small, benchmarking computation with large number of CPU is not necessarily relevant...
Maybe you should choose examples with large mesh like malpasset for example...

Otherwise, it's not easy to identify a segmentation fault except by using debug configuration to check where the problem is.

Otherwise large telemac3d model have been run on larger number of CPU's.
We know some model running on more than 8000 processors ...

Regards

PS: attachment named xxx.log are not allowed on the forum. Change extension to .txt in order to let us see the file
Christophe
The administrator has disabled public write access.

telemac3d.py Segmentation fault when --ncsize > 61 3 years 4 months ago #38708

  • jdoe2
  • jdoe2's Avatar
Hi Christophe,

Sorry for the attachment extension.
Sadly, I have no more luck with malpasset as well. With a machine with 128 CPUs.

telemac3d.py --ncsize=64 -t t3d_malpasset-fine_p2.cas > out.txt 2>&1

Output in attachment with .txt extension.

Thank you very much for your help.
Have a nice day.
Attachments:
The administrator has disabled public write access.

telemac3d.py Segmentation fault when --ncsize > 61 3 years 4 months ago #38711

  • jdoe2
  • jdoe2's Avatar
I recompiled Telemac with debug opts, we can see more infos

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f1370c188b0 in ???
#1  0x7f1370c17ae3 in ???
#2  0x7f13706e483f in ???
#3  0x7f136efcf936 in ???
#4  0x7f136efbd732 in ???
#5  0x7f136efcf5b3 in ???
#6  0x7f136f11c46d in ???
#7  0x7f136f0d488c in ???
#8  0x7f136f090d7b in ???
#9  0x7f136f16efe3 in ???
#10  0x7f136f998655 in ???
#11  0x7f13703ea119 in ???
#12  0x7f13709bae61 in ???
#13  0x7f13709e917d in ???
#14  0x7f13709139a7 in ???
#15  0x5612684657c7 in p_init_
        at /home/telemac/v8p1r2/sources/utils/parallel/p_init.F:107
#16  0x56126819f958 in bief_init_
        at /home/telemac/v8p1r2/sources/utils/bief/bief_init.f:56
#17  0x561267e1380e in homere_telemac3d
        at /home/telemac/v8p1r2/sources/telemac3d/homere_telemac3d.f:83
#18  0x561267e1361c in main
        at /home/telemac/v8p1r2/sources/telemac3d/homere_telemac3d.f:48
The administrator has disabled public write access.

telemac3d.py Segmentation fault when --ncsize > 61 3 years 4 months ago #38712

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Interesting but probably not evident...
Could you have a look into the domain partitions and see if all the partitions looks "regular".
The only times I've seen such kind of problem I had few domains with "exotic" decomposition...

Regards
Christophe
The administrator has disabled public write access.

telemac3d.py Segmentation fault when --ncsize > 61 3 years 4 months ago #38713

  • jdoe2
  • jdoe2's Avatar
Thanks Christophe,

This behaviour appears on every example I try in the examples/telemac3d directory. Can you reproduce this bug ? I use --ncsize 62 and -oversubcribe (this param is not responsible of the failure) on mpiexec like this:

/usr/bin/mpiexec -n 62 -oversubscribe [...]

My systel is attached bellow.
Attachments:
The administrator has disabled public write access.

telemac3d.py Segmentation fault when --ncsize > 61 3 years 4 months ago #38744

  • pham
  • pham's Avatar
  • OFFLINE
  • Administrator
  • Posts: 1559
  • Thank you received: 602
Hello,

I ran the malpasset fine example both in 2D and 3D, for 72, 144, 288 and 576 cores and all have finished. The issue may be in your installation of TELEMAC.

Have you tried to install metis 5.1 rather than 5.0 as suggested by the installation guide?
See wiki.opentelemac.org/doku.php?id=installation_on_linux

You can try to have a look at the systel.edf.cfg where you can find many configurations.

Hope this helps,

Chi-Tuan
The administrator has disabled public write access.

telemac3d.py Segmentation fault when --ncsize > 61 3 years 4 months ago #38766

  • jdoe2
  • jdoe2's Avatar
Thank you very much Pham, this really help,

I'll try to recompile with other options like you said. I'll keep you posted !

Have a great day.
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.