Welcome, Guest
Username: Password: Remember me

TOPIC: Telemac2D simulation stops: no apparent error

Telemac2D simulation stops: no apparent error 6 years 4 months ago #30894

  • Araujo
  • Araujo's Avatar
Dear all,

I am using Telemac2D version 7p2r2 in a cluster. I have been doing many tries in running a Telemac2D simulation for about 20 days (simulation time). The problem is that the simulation stops after running about sometimes 7 days, other times 6 days or even 8 days, and I cannot see any apparent error. The LOG files look normal, without any error message:

ITERATION 7362000 TIME: 8 D 12 H 29 MN 59.9999 S ( 736199.9999 S)
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818
LIQUID BOUNDARY: VTU(1) = 0.12853300083357397
LIQUID BOUNDARY: VTV(1) = 0.59538900395316818


The ending of the LSF file I get says:

ITERATION 7398000 TIME: 8 D 13 H 29 MN 59.9999 S ( 739799.9999 S)
DIFFUSION-PROPAGATION STEP
CVTRVF_POS_2 (SCHEME 13 OR 14): 10 ITERATIONS
GRACJG (BIEF) : 2 ITERATIONS, ABSOLUTE PRECISION: 0.8872706E-06
POSITIVE DEPTHS OBTAINED IN 10 ITERATIONS
DIFFUSION-PROPAGATION STEP
CVTRVF_POS_2 (SCHEME 13 OR 14): 10 ITERATIONS
GRACJG (BIEF) : 1 ITERATIONS, ABSOLUTE PRECISION: 0.6239400E-06
POSITIVE DEPTHS OBTAINED IN 10 ITERATIONS
BALANCE OF WATER VOLUME
VOLUME IN THE DOMAIN : 0.1643756E+10 M3
FLUX BOUNDARY 1: 34964.75 M3/S ( >0 : ENTERING <0 : EXITING )
FLUX BOUNDARY 2: -26070.62 M3/S ( >0 : ENTERING <0 : EXITING )
RELATIVE ERROR IN VOLUME AT T = 0.7398E+06 S : -0.8433504E-15
_____________
runcode::main:
:
|runCode: Fail to run
|/gpfs/software/openmpi/2.1.0/gcc/mellanox/bin/mpiexec -wdir /gpfs/home/hzh14unu/projects/BEEMS/sizewell_BLF/t2d/sizew_t2d_no_BLF_29days.cas_2018-07-15-17h48min30s -n 56 /gpfs/home/hzh14unu/projects/BEEMS/sizewell_BLF/t2d/sizew_t2d_no_BLF_29days.cas_2018-07-15-17h48min30s/out_princi_uv
|~~~~~~~~~~~~~~~~~~
|[i0073:15297] 55 more processes have sent help message help-mpi-common-cuda.txt / dlopen failed
|[i0073:15297] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
|[[62215,1],8][btl_openib_component.c:3531:handle_wc] from i0073 to: i0075 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 2204c00 opcode 1 vendor error 136 qp_idx 1
|[[62215,1],27][btl_openib_component.c:3531:handle_wc] from i0073 to: i0075 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 2889980 opcode 1 vendor error 136 qp_idx 0
|~~~~~~~~~~~~~~~~~~

zip error: Nothing to do! (*_p*.sortie)


Have you seen this kind of error before? Do you have any idea how can I solve this?
It seems not to be related with my model setup as I have checked the results I got until the simulation stopped and they look fine and very similar to the observations.

Thank you very much for your help.

Regards,

Amelia
The administrator has disabled public write access.

Telemac2D simulation stops: no apparent error 6 years 4 months ago #30903

  • riadh
  • riadh's Avatar
Hello Amelia

Hard to guess what is the problem. Especilly in parallel computation, error message are not that explicit.
On possible cause is that the maximal time in the liquid boundary file is reached.
In order tobe sure, you have to generate the result file by using gretel and then run the case using this result file as a restart file. The run must be in SERIAL in order to see explicit error message.

I hope this helps
with my best regards

Riadh
The administrator has disabled public write access.

Telemac2D simulation stops: no apparent error 6 years 4 months ago #30911

  • Araujo
  • Araujo's Avatar
Hi Riadh,

thank you for your feedback. The curious thing is that the simulation does not stop after running the same number of days; sometimes it stops after running about 7 days, others after running about 8.5 days (this was the maximum number of days it ran in a row). My liquid boundary file has 1384 lines only (corresponding to about 29 days of simulation, i.e. 2485800 s, and data spaced 30 min). Therefore, the issue presented in this post-http://www.opentelemac.org/index.php/assistance/forum5/21-telemac-3d/6521-liquid-boundaries-file should not be applied to my case. I have been using restart files to be able to run the simulation along the 29 days, and the results look fine. It will take a very long time if I run the first 8.5 days in serial to test!

For using Telemac version 7p2r2, I had to adapt my Fortran boundary conditions file princi_uv.f (which I have previously used with version 6.2), by changing some of the keywords that were not correct anymore. Maybe I did something wrong in that file. I would appreciate if you can have a quick look at it (is attached).

Many thanks!

Regards,

Amelia

File Attachment:

File Name: princi_uv_2018-07-18.f
File Size: 23 KB
The administrator has disabled public write access.

Telemac2D simulation stops: no apparent error 6 years 4 months ago #30940

  • riadh
  • riadh's Avatar
Hello Amelia

In order to upgrade to the release v7p2, your bord subroutine needs only to adapt the lines 250 and 251 in which you call functions vtu and vtv.
The best way to do is to copy the bord subroutine of release v7p2r2 and make changes only at these lines. There are several other small changes that are not up to date in your princi file and which can have effects depending on the selection of keywords and/or the model itself.
To see these differences you can use any diff software (vimdiff, meld, kdiff etc.)
On an other hand, for long simulation, be careful about the size of your result file which can outreach the available fisk size.

kind regards

Riadh
The administrator has disabled public write access.

Telemac2D simulation stops: no apparent error 6 years 3 months ago #31010

  • Araujo
  • Araujo's Avatar
Hi Riadh,

Many thanks for your feedback. I have been checking and the size of my results files (when the simulation stops) is not the same, but in all cases is below 400 Mb, what does not seems too big to me. How can I check the "available fisk size"?

Thank you.

Regards,

Amelia
The administrator has disabled public write access.

Telemac2D simulation stops: no apparent error 6 years 3 months ago #31011

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi Amelia
You're right, 400 Mb is not a big result file so probably the problems is not located in a potential limitation of result file.
The problem you report seems to be really linked to your cluster and not to Telemac itself...
So you probably should check with administrator where is the blocking point.
As some ideas, there is sometime a limitation in the duration of computation with an automatic kill of the process when the duration exceed limit...

I agree with Riadh about the run of the simulation in sequential mode. Even this will be probably very long, this will give some hints...

Another idea would be to find someone with a cluster which could accept to run your simulation and see if it also crash or not...

Hope this helps
Christophe
The administrator has disabled public write access.

Telemac2D simulation stops: no apparent error 6 years 3 months ago #31012

  • Araujo
  • Araujo's Avatar
Hi C.Coulet,

thank you very much for your feedback. I have been running the simulation in serial mode. I think it will take some days to get to the stage of the first stop. Let's see if I will get any clue.
The cluster I have been using is usually very busy. I have used 2 full nodes of 28 processors each (56 processors in total), so I am putting the possibility that the communication between the nodes/processors might take too long. Or maybe the cluster needs a newer version of the OpenMPI...

Thanks.

Best regards,

Amelia
The administrator has disabled public write access.

Telemac2D simulation stops: no apparent error 6 years 3 months ago #31062

  • Araujo
  • Araujo's Avatar
Hi Riadh and C.Coulet,

In the meantime, I have run a simulation with 1 node only (28 processors) and for the first time the simulation went up to the end without any error. So, I have no doubt that the issues I got before were related to communication problems between the nodes/processors due to the fact that the cluster was too busy.

Thanks for all your suggestions and your support!

Kind Regards,

Amelia
The administrator has disabled public write access.

Telemac2D simulation stops: no apparent error 6 years 3 months ago #31088

  • riadh
  • riadh's Avatar
Glad to see that you went through your issue !

kind regards

Riadh
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.