Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1
  • 2

TOPIC: V8p2r0 crashing when running in parallel

V8p2r0 crashing when running in parallel 3 years 5 months ago #38645

  • marcioxyz
  • marcioxyz's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 14
  • Thank you received: 1
Hello

Telemac3D v8p2r0 compiled without errors using gfortran and METIS (compilation config files are attached). I have python 3.8.5 installed.
Telemac3D simulations run flawlessly using one core .

When I try to run in parallel using METIS I get this error:
====================================================================

... partitioning base files (geo, conlim, sections, zones and weirs)
+> /home/luke/telemac_v8p2r0/v8p2r0/builds/ubugfopenmpi/bin/partel < PARTEL.PAR >> partel_T3DGEO.log
/bin/sh: 1: cannot open PARTEL.PAR: No such file
Traceback (most recent call last):
File "/home/luke/telemac_v8p2r0/v8p2r0/scripts/python3/telemac3d.py", line 7, in <module>
main('telemac3d')
File "/home/luke/telemac_v8p2r0/v8p2r0/scripts/python3/runcode.py", line 271, in main
run_study(cas_file, code_name, options)
File "/home/luke/telemac_v8p2r0/v8p2r0/scripts/python3/execution/run_cas.py", line 157, in run_study
run_local_cas(my_study, options)
File "/home/luke/telemac_v8p2r0/v8p2r0/scripts/python3/execution/run_cas.py", line 31, in run_local_cas
my_study.partionning(options.use_link)
File "/home/luke/telemac_v8p2r0/v8p2r0/scripts/python3/execution/study.py", line 404, in partionning
run_partition(parcmd, self.cas, g_geo, g_fmt_geo, g_conlim,
File "/home/luke/telemac_v8p2r0/v8p2r0/scripts/python3/execution/run.py", line 51, in run_partition
run_partel(partel, geom, fmtgeom, conlim, ncsize, False,
File "/home/luke/telemac_v8p2r0/v8p2r0/scripts/python3/execution/run.py", line 132, in run_partel
log = "No log available check command:\n"+par_cmd
TypeError: can only concatenate str (not "list") to str
=====================================================================

I tried a few possible solutions suggested in some forum threads such as:

> sudo apt install python-is-python3 python3-dev python3-pip libopenmpi-dev

> /usr/bin/env python (make sure it points to python3)

> Including this line in my compilation .cfg file:
incs_all: I /usr/lib/x86_64-linux-gnu/openmpi/include

> deleted the following line friom the .cfg file:
par_cmdexec: <config>/partel < PARTEL.PAR >> <partel.log>

I found a suggestion* from Yugi that changed the error to:

[C3PO:174614] *** An error occurred in MPI_Waitall
[C3PO:174614] *** reported by process [2673737729,1]
[C3PO:174614] *** on communicator MPI_COMM_WORLD
[C3PO:174614] *** MPI_ERR_TRUNCATE: message truncated
[C3PO:174614] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[C3PO:174614] *** and potentially your MPI job)

*Suggestion was:
"in par_cmd replace PARTEL.PAR by <partel.par>
This is somthing that changed in python3."

I also made a few other attempts with different configs in the last 7 days without success.
Any insight is welcome.
Thank you.

Márcio
Attachments:
The administrator has disabled public write access.

V8p2r0 crashing when running in parallel 3 years 5 months ago #38646

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi
The first problem as indicate is related to the fact there is no PARTEL.PAR file in your computation directory.
If the simulation works fine on 1 core, this also means most part of the installation is working well.
The most interesting point is in the suggestion as it seems the script start to works if you change PARTEL.PAR to partel.par
As linux is case-sensitive this make sense.

Then the problem seems to be located in MPI. so you should try to investigate this point and first try to see if you could run one simple MPI example which is probably available.

Hope this helps
Christophe
The administrator has disabled public write access.

V8p2r0 crashing when running in parallel 3 years 5 months ago #38703

  • marcioxyz
  • marcioxyz's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 14
  • Thank you received: 1
Hello Christophe

Thank you for the comment and suggestion.

I tried to run an MPI example from the tutorials by inserting the following line in the t3D_canal.cas file:
PARALLEL PROCESSORS = 2

and/or running from the command line:
telemac3d.py --ncsize=2 t3D_canal.cas

I guess the important part of the error is this:
=======================================
At line 136 of file /home/luke/telemac_v8p2r0/v8p2r0/sources/utils/bief/read_partel_info.f (unit = 17, file = 'T3DPAR00001-00000')
Fortran runtime error: Bad integer for item 2 in list input

Error termination. Backtrace:
#0 0x7f59c9737d01 in ???
#1 0x7f59c9738849 in ???
#2 0x7f59c973952f in ???
#3 0x7f59c9972ad3 in ???
(...)
=======================================


Line 136 of read_partel_info.f is the third line of the code below:
********************************************
IF(NHALO.GT.0) THEN
DO K=1,NHALO
READ(NPAR,*) IF1,IF2,IF3,IF4,IF5,IF6,IF7
!
! CORRECTS A BUG (IN IFAPAR THERE IS A CONFUSION BETWEEN PROCESSOR 0
! AND LIQUID BOUNDARY BUT
! IN CASE OF LIQUID BOUNDARY, THE ELEMENT BEHIND
! IS GIVEN AS 0, SO BOTH CASES MAY BE DISTINGUISHED
! HERE ALL BOUNDARIES (LIQUID OR SOLID) ARE SET AT -1
!
IF(IF5.EQ.0) IF2=-1
IF(IF6.EQ.0) IF3=-1
IF(IF7.EQ.0) IF4=-1
!
MESH%IFAPAR%I(6*(IF1-1)+1)=IF2
MESH%IFAPAR%I(6*(IF1-1)+2)=IF3
MESH%IFAPAR%I(6*(IF1-1)+3)=IF4
MESH%IFAPAR%I(6*(IF1-1)+4)=IF5
MESH%IFAPAR%I(6*(IF1-1)+5)=IF6
MESH%IFAPAR%I(6*(IF1-1)+6)=IF7
ENDDO
ENDIF
!
CLOSE(NPAR)
***********************************************

I tried to understand the code on this file but I could not reach a conclusion.

Note: I'm using the same openmpi installation to run OpenFoam in parallel. I also added some libs from openmpi package that are not installed by default.

I would be glad if you could point me to the right direction.

Thank you.
The administrator has disabled public write access.

V8p2r0 crashing when running in parallel 1 year 6 months ago #42596

  • Youenn
  • Youenn's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 16
Hello

I am using the V8P4 version and I encountered exactly the same problem as you.

Did you managed to find what was wrong after all this time?

Jan Youenn
The administrator has disabled public write access.

V8p2r0 crashing when running in parallel 1 year 6 months ago #42606

  • marcioxyz
  • marcioxyz's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 14
  • Thank you received: 1
Hello Jan

Unfortunately I could not solve this issue, I'm running on only one processor since then. I intend to make some more tries soon. I'll get back here if i have any success.

Márcio
The administrator has disabled public write access.

V8p2r0 crashing when running in parallel 1 year 5 months ago #42651

  • pham
  • pham's Avatar
  • OFFLINE
  • Administrator
  • Posts: 1559
  • Thank you received: 602
Hello Marcio,

Have you tried to run one or several examples of the TELEMAC-3D database in parallel and cannot succeed to get it finished correctly?

If not, you may have an installation issue.

In your log, you seem to have an issue with the generic file name T3DPAR (which corresponds to the STAGE-DISCHARGE CURVES FILE. It would be interesting to upload it. I suppose you have also run computations without this feature?

Chi-Tuan
The administrator has disabled public write access.

V8p2r0 crashing when running in parallel 1 year 5 months ago #42789

  • marcioxyz
  • marcioxyz's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 14
  • Thank you received: 1
Hello Chi-Tuan

Thank you for your answer.
I have tried five cases from the examples folder, all of them run on 1 processor but show me this same error with 2 or more processors:

At line 136 of file /home/luke/telemac_v8p2r0/v8p2r0/sources/utils/bief/read_partel_info.f (unit = 15, file = 'T3DPAR00005-00000')
Fortran runtime error: Bad integer for item 2 in list input

I noticed that when ncsize number is even, I get "Bad integer for item 2 in list input", when it is odd, I get "Bad integer for item 3 in list input".

I don't think I used a stage-discharge file in any of the cases I ran, but I did use a hydrograph file (.hyd).

I think it is an installation problem, although I did not detect any suspicious message during compilation.
I will try installation in a different machine.

Best regards,
Márcio
The administrator has disabled public write access.
The following user(s) said Thank You: peterbell

V8p2r0 crashing when running in parallel 1 year 1 month ago #43413

Has anybody solved this problem ?? I have used the automatic windows installer. 1 processor works but any increase in processor number causes the system to fail. There must be a funaemental problem with the automatic installer.
The administrator has disabled public write access.

V8p2r0 crashing when running in parallel 5 months 2 weeks ago #44926

  • phmusiedlak
  • phmusiedlak's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 11
  • Thank you received: 2
Hello everyone,

I got the same error on our local cluster when using 2 cores full (hence 88 procs) but not when using 1 core or 2 cores but only 54 procs (these cases run without problems).

Version: 8p5r0 (? I guess i just clone main repo yesterday from main branch)

Cases: TELEMAC-2D Negretti (or my "English-channel model using TPXO or previmer)

command ran on cluster: telemac2d.py --ncsize=88 t2d_negretti.cas > log 2>&1
systel.cfg : mpi_cmdexec: mpirun -np <ncsize> <exename>
(I removed the "-machinefile MPI_HOSTFILE" because of www.opentelemac.org/index.php/kunena/12-...-in-the-system#38249 )

I have attached the log file, systel.cfg and pysource.gfortranHPC.sh

I tried adding the flags "--nctile=44" and "--ncnode=2" but makes no difference.
I am running out of idea of what to test or add in mpirun command or else.

Any suggestions would be great !

Pierre-Henri
Attachments:
The administrator has disabled public write access.

V8p2r0 crashing when running in parallel 5 months 1 week ago #44941

  • phmusiedlak
  • phmusiedlak's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 11
  • Thank you received: 2
hello everyone,

I think I solve the pb.

More explanation on the pb:
- when splitting the mesh in a large enough number of part (in my case around 80), the problem airses. I also add the problem on my local machine.
- pb comes from partel and more exactly METIS: when partitionning the domain (
telemac2d.py -w meshspli_2/ --split --ncsize=88 tel.cas > log 2>err
) and assigning the conditions to the parts, some parts are baddly assigned. Some T2DPAR000XX-000XX files (attached an example) got stars - ********* - in them, and the err file is :
munmap_chunk(): invalid pointer
munmap_chunk(): invalid pointer
corrupted size vs. prev_size
STOP 0

and I got some "ISOLATED BOUNDARY POINT" on partel_T2DGEO.log
==> which makes sense in the error I get latter when running
telemac2d.py --ncsize=88
: read_partel_info.f (unit = 16, file = 'T2>Fortran runtime error: Bad integer for item 4 in list input

(I also add some inconsistencies between my systel.cfg and pysource.gfortranHPC.sh in the definition of METISHOME and LD_LIBRARY_PATH which were found not the have any influence - ie solving them does not solve the pb).

The solution:
- recompile metis-5.1.0 from http://glaros.dtc.umn.edu/gkhome/metis/metis/download
with this method https://hydro-informatics.com/get-started/install-telemac.html#parallelism-install-metis (telemac wiki method does not work for me).
- and recompile you telemac install: compile_telemac.py --clean


A couple of days ago telemac wiki was down, so I did not had access to this METIS link but to the metis github repo - https://github.com/KarypisLab/METIS - which appears to create the pb.

The only thing I am wondering is: why (the hell) did this pb does not appear directly in parrallel ?


Pierre-Henri
Attachments:
The administrator has disabled public write access.
The following user(s) said Thank You: nicogodet
  • Page:
  • 1
  • 2
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.