Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1
  • 2

TOPIC: Parallel computing error

Parallel computing error 7 years 7 months ago #25967

  • acsantosr
  • acsantosr's Avatar
Hi,
I try to develop a model using parallel option. When I assign more that 6 processors, Telemac2D show me the next message. I simplify my model for understanding the routines but is more complicated.

And I have others questions:

1_Is possible that the points (*xyz) produce this kind of error?
2_Is Blue Kenue adequate for generating the mesh for parallel processing?

Thanks


//////////////////////////////////////////////////////////////

READ_MESH_INFO: TITLE= geo
NUMBER OF ELEMENTS: 262
NUMBER OF POINTS: 159

FORMAT NOT INDICATED IN TITLE

MXPTEL (BIEF) : MAXIMUM NUMBER OF ELEMENTS AROUND A POINT: 8
MAXIMUM NUMBER OF POINTS AROUND A POINT: 9
(GLOBAL MESH)
SEGBOR (BIEF) : NUMBER OF BOUNDARY SEGMENTS = 54
INCLUDING THOSE DUE TO DOMAIN DECOMPOSITION
CORRXY (BIEF):NO MODIFICATION OF COORDINATES

CHECKING THE MESH

SMALLEST DISTANCE BETWEEN TWO POINTS: 115.54755852569971
BETWEEN POINTS: 218 AND 222
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[5453,1],0]
Exit code: 1
_____________
runcode::main:
:
|runCode: Fail to run
|/usr/bin/mpiexec -wdir /home/geofisico/Dropbox/Paralelo/Telepar6/corrida_1.cas_2017-04-05-16h02min59s -n 6 /home/geofisico/Dropbox/Paralelo/Telepar6/corrida_1.cas_2017-04-05-16h02min59s/out_telemac2d
|~~~~~~~~~~~~~~~~~~
|Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
|
|Backtrace for this error:
|
|Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
|
|Backtrace for this error:
|
|Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
|
|Backtrace for this error:
|
|Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
|
|Backtrace for this error:
|
|Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
|
|Backtrace for this error:
|Operating system error: Cannot allocate memory
|Memory allocation failed in xmallocarray
|#0 0x7F7D86246E08
|#1 0x7F7D86245F90
|#2 0x7F7D859784AF
|#3 0x621A7C in paraco_
|#4 0x5D03D4 in parcom2_
|#5 0x5D1008 in parcom_
|#6 0x68D7A5 in checkmesh_
|#7 0x61F90B in almesh_
|#8 0x4CFED4 in point_telemac2d_
|#9 0x4ED293 in MAIN__ at homere_telemac2d.f:?
|#0 0x7F1619518E08
|#1 0x7F1619517F90
|#2 0x7F1618C4A4AF
|#3 0x621A7C in paraco_
|#4 0x5D03D4 in parcom2_
|#5 0x5D1008 in parcom_
|#6 0x68D7A5 in checkmesh_
|#7 0x61F90B in almesh_
|#8 0x4CFED4 in point_telemac2d_
|#9 0x4ED293 in MAIN__ at homere_telemac2d.f:?
|~~~~~~~~~~~~~~~~~~
Attachments:
The administrator has disabled public write access.

Parallel computing error 7 years 7 months ago #25971

  • josekdiaz
  • josekdiaz's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 161
  • Thank you received: 48
Hello,

Depending on the version of Telemac I've heard there are some issues with the "CHECKING THE MESH" keyword, have you tried to remove it and run the case again?

1) if the model runs with say 1 processor and doesn't with 6 it's a split-domain related issue. Try increasing or decreasing the number of workers.

2) Bluekenue it's a decent mesh generator, sure there are better options out there but I believe this shouldn't be an issue if the meshing process its done correct (being careful with refinement zones, transitions, channel mesh interaction with domain mesh etc).

Regards,

José Díaz.
The administrator has disabled public write access.
The following user(s) said Thank You: acsantosr

Parallel computing error 7 years 7 months ago #25974

  • acsantosr
  • acsantosr's Avatar
;) /My work is done!!!

Thanks
The administrator has disabled public write access.

Parallel computing error 7 years 7 months ago #25972

Hi,
if you remove 'CHECKING THE MESH' from your cas file it should run for whatever number of processors you specify

BK generates an appropriate mesh for single or parallel processing as it is mpi protocol through metis package that splits up the mesh for parallel processing

regards
Tony C
The administrator has disabled public write access.
The following user(s) said Thank You: acsantosr

Parallel computing error 7 years 7 months ago #25997

  • acsantosr
  • acsantosr's Avatar
Hi!!
Thanks for your replies!
I suppressed the keyword CHECKING THE MESH and run in a simplified model. But, when I try to run the "complex" model don't work.

Te message is:

Loading Options and Configurations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

_____ __ __
|___ | /_ | /_ |
__ __ _/ / _ __ | | _ __ | |
\ \ / / |_ _| | '_ \ | || '__| | |
\ V / / / | |_) | | || | | |
\_/ /_/ | .__/ |_||_| |_|
| |
|_|
_ _ ___ _ _ _____ ___
_| || |_ / _ \ | || | | ____| / _ \
_ __ ___ __ __ |_ __ _|| (_) || || |_ | |__ | (_) |
| '__| / _ \\ \ / / _| || |_ > _ < |__ _||___ \ \__, |
| | | __/ \ V / _ |_ __ _|| (_) | | | ___) | / /
|_| \___| \_/ (_) |_||_| \___/ |_| |____/ /_/


... parsing configuration file: /home/leno/opentelemac/v7p1r1/configs/systel.cfg


Running your CAS file for:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+> configuration: ubugfopenmpi
+> root: /home/leno/opentelemac/v7p1r1


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


... reading the main module dictionary

... processing the main CAS file(s)
+> running in English

... handling temporary directories

... checking coupling between codes

... checking parallelisation

... first pass at copying all input files
copying: geo.slf /media/leno/TITI2T/Dropbox/100_WorkSpace/Tele10/corrida_paralelo.cas_2017-04-06-08h57min29s/T2DGEO
copying: bc_sistema.cli /media/leno/TITI2T/Dropbox/100_WorkSpace/Tele10/corrida_paralelo.cas_2017-04-06-08h57min29s/T2DCLI
re-copying: /media/leno/TITI2T/Dropbox/100_WorkSpace/Tele10/corrida_paralelo.cas_2017-04-06-08h57min29s/T2DCAS
copying: telemac2d.dico /media/leno/TITI2T/Dropbox/100_WorkSpace/Tele10/corrida_paralelo.cas_2017-04-06-08h57min29s/T2DDICO

... checking the executable
re-copying: telemac2d /media/leno/TITI2T/Dropbox/100_WorkSpace/Tele10/corrida_paralelo.cas_2017-04-06-08h57min29s/out_telemac2d

... modifying run command to MPI instruction

... modifying run command to PARTEL instruction

... partitioning base files (geo, conlim, sections and zones)
+> /home/leno/opentelemac/v7p1r1/builds/ubugfopenmpi/bin/partel < PARTEL.PAR >> partel_T2DGEO.log
*** The MPI_Comm_f2c() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[leno-Lenovo-ideapad-700-15ISK:7871] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
runPartition:
|runPARTEL: Could not split your file T2DGEO (runcode=1) with the error as follows:
|
|... The following command failed for the reason above (or below)
|/home/leno/opentelemac/v7p1r1/builds/ubugfopenmpi/bin/partel < PARTEL.PAR >> partel_T2DGEO.log
|
| You may have forgotten to compile PARTEL with the appropriate compiler directive
| (add -DHAVE_MPI to your cmd_obj in your configuration file).
|
|Here is the log:
|
|
| +
+
|
| PARTEL/PARRES: TELEMAC METISOLOGIC PARTITIONER
|
|
|
| REBEKKA KOPMANN & JACEK A. JANKOWSKI (BAW)
|
| JEAN-MICHEL HERVOUET (LNHE)
|
| CHRISTOPHE DENIS (SINETICS)
|
| YOANN AUDOUIN (LNHE)
|
| PARTEL (C) COPYRIGHT 2000-2002
|
| BUNDESANSTALT FUER WASSERBAU, KARLSRUHE
|
|
|
| METIS 5.0.2 (C) COPYRIGHT 2012
|
| REGENTS OF THE UNIVERSITY OF MINNESOTA
|
|
|
| BIEF 7.1 (C) COPYRIGHT 2012 EDF
|
| +
+
|
|
|
|
|
| MAXIMUM NUMBER OF PARTITIONS: 100000
|
|
|
| +
+
|
|
|
| --INPUT FILE NAME <INPUT_NAME>:
|
| INPUT: T2DGEO
|
| --INPUT FILE FORMAT <INPFORMAT> [MED,SERAFIN,SERAFIND]:
|
| INPUT: SERAFIN
|
| --BOUNDARY CONDITIONS FILE NAME:
|
| INPUT: T2DCLI
|
|--NUMBER OF PARTITIONS <NPARTS> [2 -100000]:
|
| INPUT: 2
|
| PARTITIONING METHOD <PMETHOD> [1 (METIS) OR 2 (SCOTCH)]:
|
| --INPUT: 1
|
| --CONTROL SECTIONS FILE NAME (OR RETURN) :
|
| NO SECTIONS
|
| --CONTROL ZONES FILE NAME (OR RETURN) :
|
| NO ZONES
|
| --GEOMETRY FILE NAME <INPUT_NAME>:
|
| INPUT: T2DGEO
|
| --GEOMETRY FILE FORMAT <GEOFORMAT> [MED,SERAFIN,SERAFIND]:
|
| INPUT: SERAFIN
|
| +---- PARTEL: BEGINNING
+
|
|
|
|
|
| READ_MESH_INFO: TITLE= newSelafin
|
| NUMBER OF ELEMENTS: 283042
|
| NUMBER OF POINTS: 142158
|
|
|
| FORMAT NOT INDICATED IN TITLE
|
|
|
|
|
| ONE-LEVEL MESH.
|
| NDP NODES PER ELEMENT: 3
|
| ELEMENT TYPE : 10
|
| NPOIN NUMBER OF MESH NODES: 142158
|
| NELEM NUMBER OF MESH ELEMENTS: 283042
|
|
|
| THE INPUT FILE ASSUMED TO BE 2D
|
| THERE ARE 1 TIME-DEPENDENT RECORDINGS
|
|
|
| THERE IS 7 LIQUID BOUNDARIES:
|
|
|
| BOUNDARY 1 :
|
| BEGINS AT BOUNDARY POINT: 252 , WITH GLOBAL NUMBER: 141281
|
| AND COORDINATES: 894627.8 1402157.
|
| ENDS AT BOUNDARY POINT: 281 , WITH GLOBAL NUMBER: 141562
|
| AND COORDINATES: 894710.1 1402297.
|
|
|
| BOUNDARY 2 :
|
| BEGINS AT BOUNDARY POINT: 543 , WITH GLOBAL NUMBER: 109242
|
| AND COORDINATES: 898369.1 1412088.
|
| ENDS AT BOUNDARY POINT: 557 , WITH GLOBAL NUMBER: 109340
|
| AND COORDINATES: 898378.8 1412166.
|
|
|
| BOUNDARY 3 :
|
| BEGINS AT BOUNDARY POINT: 768 , WITH GLOBAL NUMBER: 103095
|
| AND COORDINATES: 894480.9 1427473.
|
| ENDS AT BOUNDARY POINT: 779 , WITH GLOBAL NUMBER: 102993
|
| AND COORDINATES: 894332.8 1427515.
|
|
|
| BOUNDARY 4 :
|
| BEGINS AT BOUNDARY POINT: 836 , WITH GLOBAL NUMBER: 102041
|
| AND COORDINATES: 893365.9 1427819.
|
| ENDS AT BOUNDARY POINT: 853 , WITH GLOBAL NUMBER: 101846
|
| AND COORDINATES: 893240.7 1427872.
|
|
|
| BOUNDARY 5 :
|
| BEGINS AT BOUNDARY POINT: 908 , WITH GLOBAL NUMBER: 122073
|
| AND COORDINATES: 891395.2 1428052.
|
| ENDS AT BOUNDARY POINT: 921 , WITH GLOBAL NUMBER: 121880
|
| AND COORDINATES: 891227.2 1427982.
|
|
|
| BOUNDARY 6 :
|
| BEGINS AT BOUNDARY POINT: 1023 , WITH GLOBAL NUMBER: 113224
|
| AND COORDINATES: 880841.8 1420896.
|
| ENDS AT BOUNDARY POINT: 1042 , WITH GLOBAL NUMBER: 112955
|
| AND COORDINATES: 880646.0 1420812.
|
|
|
| BOUNDARY 7 :
|
| BEGINS AT BOUNDARY POINT: 1214 , WITH GLOBAL NUMBER: 111003
|
| AND COORDINATES: 881869.2 1405401.
|
| ENDS AT BOUNDARY POINT: 1243 , WITH GLOBAL NUMBER: 111498
|
| AND COORDINATES: 881933.2 1405304.
|
|
|
| THERE IS 7 SOLID BOUNDARIES:
|
|
|
| BOUNDARY 1 :
|
| BEGINS AT BOUNDARY POINT: 1243 , WITH GLOBAL NUMBER: 111498
|
| AND COORDINATES: 881933.2 1405304.
|
| ENDS AT BOUNDARY POINT: 252 , WITH GLOBAL NUMBER: 141281
|
| AND COORDINATES: 894627.8 1402157.
|
|
|
| BOUNDARY 2 :
|
| BEGINS AT BOUNDARY POINT: 281 , WITH GLOBAL NUMBER: 141562
Attachments:
The administrator has disabled public write access.

Parallel computing error 7 years 7 months ago #25999

  • josekdiaz
  • josekdiaz's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 161
  • Thank you received: 48
Hello,

Before anything else you should correct this line:
PRESCRIBED FLOWRATES = 425.0; 413.40; -483.3; -522.9; -480.0; 480.0; 412.2

As it has more than 70+ characters. You could supress the spaces and maybe split in two lines Like this:
PRESCRIBED FLOWRATES =425.0;413.40;-483.3;-522.9;-480.0
;480.0;412.2

If necessary do the same with prescribed elevations.On the other hand, is your build correctly compiled for parallel runs (did you test it with e.g. malpasset case withgh multiple workers)?

Regards,

José Díaz.
The administrator has disabled public write access.

Parallel computing error 7 years 7 months ago #26001

  • acsantosr
  • acsantosr's Avatar
Hi Jose

Modify the "t2d_malpasset-large.cas" by typing PROCESSEURS PARALLELES: 2, and run

Delete spaces in Flow and Elevations Prescribed in my * .cas and not yet run :unsure:
The administrator has disabled public write access.

Parallel computing error 7 years 7 months ago #26000

  • acsantosr
  • acsantosr's Avatar
The files:
Attachments:
The administrator has disabled public write access.

Parallel computing error 7 years 7 months ago #26002

Hi As Jose has said the line length in your cas file must be less than 70
but that is not what is causing you error

It is difficult to tell without the geo.slf file, i assume this is too large to post

If there was something wrong with your mpi installation the smaller model should not have run? I do note you are using a dropbox directory could that be causing you difficulty?

if you find the test cas that Jose recommended runs fine and gives you the correct result it might be worth reducing the model size

regards
Tony C
The administrator has disabled public write access.

Parallel computing error 7 years 7 months ago #26003

also try and run it without the unbug option in the cas file to see if it runs
The administrator has disabled public write access.
  • Page:
  • 1
  • 2
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.