Welcome, Guest
Username: Password: Remember me

TOPIC: HPC Installtion v8p1r1, segmentation fault

HPC Installtion v8p1r1, segmentation fault 3 years 10 months ago #37467

  • cschwarz
  • cschwarz's Avatar
Dear Telemac-team,

First of all congratulations on the excellent forum.
I tried to solve my problems checking previous posts however unsuccessfully.

I am compiling TELEMAC v8p1r1 on a cluster which finishes successfully(my work is done), however when I run the the telemac2d/bump test-case in parallel (1 node, 2 processors) I get a segmentation fault when Metis is called which I was not able to solve.

Error message:
... partitioning base files (geo, conlim, sections, zones and weirs)
+> /work/cebg/sw/telemac/v8p1r1/builds/eoleIntel/bin/partel < partel_T2DGEO.par >> partel$
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
partel 000000000047582D for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AC22C9B25D0 Unknown Unknown Unknown
libmetis.so 00002AC22BAA8014 libmetis__CreateG Unknown Unknown
libmetis.so 00002AC22BAA7EF9 METIS_MeshToDual Unknown Unknown
libmetis.so 00002AC22BA90B09 METIS_PartMeshDua Unknown Unknown
partel 000000000042E217 Unknown Unknown Unknown
partel 000000000042EA53 Unknown Unknown Unknown
partel 00000000004334CF Unknown Unknown Unknown
partel 000000000040BF44 Unknown Unknown Unknown
partel 000000000040995E Unknown Unknown Unknown
libc-2.17.so 00002AC22CBE1495 __libc_start_main Unknown Unknown
partel 0000000000409869 Unknown Unknown Unknown
Traceback (most recent call last):
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/runcode.py", line 289, in <module>
main(None)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/runcode.py", line 272, in main
run_study(cas_file, code_name, options)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/run_cas.py", line 159, in run_stu$
run_hpc_cas(my_study, options)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/run_cas.py", line 117, in run_hpc$
my_study.partionning(options.use_link)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/study.py", line 411, in partionni$
use_link, i_part, s_concat)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/run.py", line 55, in run_partition
concat)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/run.py", line 138, in run_partel
+'\n\n'+log)
utils.exceptions.TelemacException: Could not split your file T2DGEO with the error as follows:

... The following command failed for the reason in the listing
/work/cebg/sw/telemac/v8p1r1/builds/eoleIntel/bin/partel < partel_T2DGEO.par >> partel_T2DGEO.$



It would be great if you could give me some advice how to solve this issue. I attached the systel.cfg and pysource files for your reference.
I am using metis 5.1.0, intel fortran compiler 2018u3, openmpi 3.1.1 moreover the cluster uses slurm for job-management.

Best Regards,
Christian

File Attachment:

File Name: systel.caviness2.cfg
File Size: 2 KB
The administrator has disabled public write access.

HPC Installtion v8p1r1, segmentation fault 3 years 10 months ago #37500

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi,

Are your computing node using the same compiler as the frontal node ?
With your configuration the partition will be run on the frontal node while the computation will be run on the computing node (through slurm)

You can actually check that easily by running a telemac example in sequential on the frontal node add option --mpi so that it does not do a slurm submission.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

HPC Installtion v8p1r1, segmentation fault 3 years 10 months ago #37599

  • cschwarz
  • cschwarz's Avatar
Dear Yugi,

Thank you for your response and suggestion.
I now recompiled telemac so that when I run a job everything is run in the cluster queue system(slurm).

I also re-compiled the whole system using intel-compilers (systel.caviness2.cfg) and gfortran-compilers (systel.caviness2g.cfg).

With the intel-compiled version I get the same segmentation fault:

... partitioning base files (geo, conlim, sections, zones and weirs)
+> srun -n 1 -N 1 /work/cebg/sw/telemac/v8p1r1/builds/eoleIntel/bin/partel < partel_T2DGE$
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
partel 000000000047582D for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AD3797C35D0 Unknown Unknown Unknown
libmetis.so 00002AD3788B9014 libmetis__CreateG Unknown Unknown
libmetis.so 00002AD3788B8EF9 METIS_MeshToDual Unknown Unknown
libmetis.so 00002AD3788A1B09 METIS_PartMeshDua Unknown Unknown
partel 000000000042E217 Unknown Unknown Unknown
partel 000000000042EA53 Unknown Unknown Unknown
partel 00000000004334CF Unknown Unknown Unknown
partel 000000000040BF44 Unknown Unknown Unknown
partel 000000000040995E Unknown Unknown Unknown
libc-2.17.so 00002AD3799F2495 __libc_start_main Unknown Unknown
partel 0000000000409869 Unknown Unknown Unknown
srun: error: r04n50: task 0: Exited with exit code 174
Traceback (most recent call last):
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/runcode.py", line 289, in <module>
main(None)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/runcode.py", line 272, in main
run_study(cas_file, code_name, options)

With the Gfortran compiled version I get:

... partitioning base files (geo, conlim, sections, zones and weirs)
+> srun -n 1 -N 1 /work/cebg/sw/telemac/v8p1r1/builds/eoleGfortran/bin/partel < partel_T2DGEO.par >> partel_T2DGEO.log
Current memory used: 0 bytes
Maximum memory used: 0 bytes
***Memory allocation failed for CreateGraphDual: nptr. Requested size: 6047313958608 bytes
STOP 0

... splitting / copying other input files

... checking the executable
> compiling objs
compiling: user_condin_h.f ... completed
compiling: user_corfon.f ... completed
compiling: user_utimp_telemac2d.f ... completed
/usr/bin/ld: cannot find crt1.o: No such file or directory
/usr/bin/ld: cannot find crti.o: No such file or directory
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lpthread
/usr/bin/ld: cannot find -lc
/usr/bin/ld: cannot find crtn.o: No such file or directory
collect2: error: ld returned 1 exit status
Traceback (most recent call last):
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/runcode.py", line 289, in <module>
main(None)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/runcode.py", line 272, in main
run_study(cas_file, code_name, options)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/run_cas.py", line 157, in run_study
run_local_cas(my_study, options)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/run_cas.py", line 43, in run_local_cas
my_study.compile_exe()
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/study.py", line 332, in compile_exe
self.cfg, self.code_name)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/process.py", line 644, in process_executable
str(code)+').\n '+tail)
utils.exceptions.TelemacException: Could not link your executable (runcode=1).

... The following command failed for the reason in the listing
mpif90 -fPIC -fconvert=big-endian -frecord-marker=4 -o "/home/2249/TELEMAC/bump/t2d_bump_FE.cas_2021-01-11-14h37min32s/out_user_fortran" /$

It would be great if you could provide some advice how to manage the above-shown errors ?

Best Regards,
Christian

File Attachment:

File Name: systel.caviness2_2021-01-11.cfg
File Size: 2 KB


File Attachment:

File Name: systel.caviness2g.cfg
File Size: 2 KB
The administrator has disabled public write access.

HPC Installtion v8p1r1, segmentation fault 3 years 10 months ago #37602

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
I think the issue is with you metis installation.
Can you try reinstalling it ?
There can be an issue in metis with the precision of integer.
Have a look on their forum about that (you need to modify a .h file in the metis source file)
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

HPC Installtion v8p1r1, segmentation fault 3 years 10 months ago #37658

  • cschwarz
  • cschwarz's Avatar
Hey Yugi,

Thanks a lot, yes, it was indeed a METIS, problem, I reinstalled it and that fixed this error. Now partel is successful in splitting the domains.

Unfortunately I now get a different error:

.. splitting / copying other input files

... checking the executable
> compiling objs
compiling: user_condin_h.f ... completed
compiling: user_corfon.f ... completed
compiling: user_utimp_telemac2d.f ... completed
/usr/bin/ld: cannot find crt1.o: No such file or directory
/usr/bin/ld: cannot find crti.o: No such file or directory
/usr/bin/ld: cannot find -lpthread
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lm
collect2: error: ld returned 1 exit status
Traceback (most recent call last):
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/runcode.py", line 289, in <module>
main(None)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/runcode.py", line 272, in main
run_study(cas_file, code_name, options)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/run_cas.py", line 157, in run_study
run_local_cas(my_study, options)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/run_cas.py", line 43, in run_local_cas
my_study.compile_exe()
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/study.py", line 332, in compile_exe
self.cfg, self.code_name)
File "/work/cebg/sw/telemac/v8p1r1/scripts/python3/execution/process.py", line 644, in process_executable
str(code)+').\n '+tail)
utils.exceptions.TelemacException: Could not link your executable (runcode=1).

... The following command failed for the reason in the listing
mpif90 -fPIC -fconvert=big-endian -frecord-marker=4 -lpthread -lm -o "/home/2249/TELEMAC/bump/t2d_bump_FE.cas_2021-01-17-15h30min02s/out_user_fort$

Would you have a suggestion how to fix this one?
Thank you for your help so far,
Best Regards,
C
The administrator has disabled public write access.

HPC Installtion v8p1r1, segmentation fault 3 years 10 months ago #37667

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Ok this means that there are some things missing in your compiler installation.

You can have a look at the following links:
community.intel.com/t5/Intel-Fortran-Com...ols-File/td-p/967498

community.intel.com/t5/Intel-Fortran-Com...-missing/td-p/746539
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.