Welcome, Guest
Username: Password: Remember me

TOPIC: my simulation doesn't work for compute nodes

my simulation doesn't work for compute nodes 1 year 6 months ago #42471

  • josiastud
  • josiastud's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 132
  • Thank you received: 2
Hi everybody,
I have a homemade cluster of :
-3 nodes (desktops), 4cpu each,
-using nfs,
-telemac installed on the master node and mirrored to the client nodes
- ubuntu server 20.04 for all nodes (Centos was too complicate for me to install)
-config file & bash file attached (s10.gfortran.dyn)
-simulation I tried: telemac2d.py t2d_gouttedo.cas --ncsize=12 --ncnode=3
when running ncsize=4 it does work,
but can't work for ncsize=12 or 8 or 6. or even ncnode=3 (I can't use -np=12, telemac doesn't recognize as a argument)
That's why I think it is only the master node that compute.

so here my questions,
1) how to make all my nodes to run simulations
2)how can I distribute several task on all my machine

here the error message


... partitioning base files (geo, conlim, sections, zones and weirs)
+> /home/ben/telemac-mascaret/builds/S10.gfortran.dyn/bin/partel < partel_T2DGEO.par >> partel_T2DGEO.log
partel: malloc.c:4036: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
Failed during initial partitioning

At line 45 of file /home/ben/telemac-mascaret/sources/utils/special/extens.f
Fortran runtime error: End of file

Error termination. Backtrace:
#0 0x7f9630010d21 in ???
#1 0x7f9630011869 in ???
#2 0x7f963001254f in ???
#3 0x7f9630254e8a in ???
#4 0x7f963025ef36 in ???
#5 0x7f9630258584 in ???
#6 0x7f96302589d3 in ???
#7 0x7f96302ce660 in ???
#8 0x55ca2524e68c in ???
#9 0x55ca25237c54 in ???
#10 0x55ca25235882 in ???
#11 0x7f962fb33082 in __libc_start_main
at ../csu/libc-start.c:308
#12 0x55ca252358bd in ???
#13 0xffffffffffffffff in ???
munmap_chunk(): invalid pointer

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0 0x7f9630010d21 in ???
#1 0x7f963000fef5 in ???
#2 0x7f962fb5208f in ???
at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3 0x7f962fb5200b in __GI_raise
at ../sysdeps/unix/sysv/linux/raise.c:51
#4 0x7f962fb31858 in __GI_abort
at /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79
#5 0x7f962fb9c26d in __libc_message
at ../sysdeps/posix/libc_fatal.c:155
#6 0x7f962fba42fb in malloc_printerr
at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:5347
#7 0x7f962fba454b in munmap_chunk
at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:2830
#8 0x7f963025b671 in ???
#9 0x7f963025b2ad in ???
#10 0x7f963025b4c9 in ???
#11 0x7f9630acef6a in _dl_fini
at /build/glibc-SzIz7B/glibc-2.31/elf/dl-fini.c:138
#12 0x7f962fb558a6 in __run_exit_handlers
at /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:108
#13 0x7f962fb55a5f in __GI_exit
at /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:139
#14 0x7f963001184d in ???
#15 0x7f963001254f in ???
#16 0x7f9630254e8a in ???
#17 0x7f963025ef36 in ???
#18 0x7f9630258584 in ???
#19 0x7f96302589d3 in ???
#20 0x7f96302ce660 in ???
#21 0x55ca2524e68c in ???
#22 0x55ca25237c54 in ???
#23 0x55ca25235882 in ???
#24 0x7f962fb33082 in __libc_start_main
at ../csu/libc-start.c:308
#25 0x55ca252358bd in ???
#26 0xffffffffffffffff in ???
Aborted (core dumped)
Traceback (most recent call last):
File "/home/ben/telemac-mascaret/scripts/python3/telemac2d.py", line 7, in <module>
main('telemac2d')
File "/home/ben/telemac-mascaret/scripts/python3/runcode.py", line 288, in main
run_study(cas_file, code_name, options)
File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 169, in run_study
run_local_cas(my_study, options)
File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 31, in run_local_cas
my_study.partionning(options.use_link)
File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 429, in partionning
run_partition(parcmd, self.cas, g_geo, g_fmt_geo, g_conlim,
File "/home/ben/telemac-mascaret/scripts/python3/execution/run.py", line 51, in run_partition
run_partel(partel, geom, fmtgeom, conlim, ncsize, False,
File "/home/ben/telemac-mascaret/scripts/python3/execution/run.py", line 133, in run_partel
raise TelemacException(
utils.exceptions.TelemacException: Could not split your file T2DGEO with the error as follows:

... The following command failed for the reason in the listing
/home/ben/telemac-mascaret/builds/S10.gfortran.dyn/bin/partel < partel_T2DGEO.par >> partel_T2DGEO.log


Here is the log:


+
+

PARTEL/PARRES: TELEMAC METISOLOGIC PARTITIONER



REBEKKA KOPMANN & JACEK A. JANKOWSKI (BAW)

JEAN-MICHEL HERVOUET (LNHE)

CHRISTOPHE DENIS (SINETICS)

YOANN AUDOUIN (LNHE)

PARTEL (C) COPYRIGHT 2000-2002

BUNDESANSTALT FUER WASSERBAU, KARLSRUHE



METIS 5.0.2 (C) COPYRIGHT 2012

REGENTS OF THE UNIVERSITY OF MINNESOTA



BIEF V8P4 (C) COPYRIGHT 2012 EDF

+
+





MAXIMUM NUMBER OF PARTITIONS: 100000



+
+



--INPUT FILE NAME <INPUT_NAME>:

INPUT: T2DGEO

--INPUT FILE FORMAT <INPFORMAT> [MED,SERAFIN,SERAFIND]:

INPUT: SERAFIN

--BOUNDARY CONDITIONS FILE NAME:

INPUT: T2DCLI

--NUMBER OF PARTITIONS <NPARTS> [2 -100000]:

INPUT: 12

PARTITIONING METHOD <PMETHOD> [1 (METIS) OR 2 (SCOTCH)]:

--INPUT: 1

--CONTROL SECTIONS FILE NAME (OR RETURN) :

NO SECTIONS

--CONTROL ZONES FILE NAME (OR RETURN) :

NO ZONES

--WEIR FILE NAME (OR RETURN) :

NO WEIRS

--GEOMETRY FILE NAME <INPUT_NAME>:

INPUT: T2DGEO

--GEOMETRY FILE FORMAT <GEOFORMAT> [MED,SERAFIN,SERAFIND]:

INPUT: SERAFIN

--CONCATENATE FILES <YES-NO>:

CONCATENATE: NO

+---- PARTEL: BEGINNING
+

FICHIER:T2DGEO





READ_MESH_INFO: TITLE= TELEMAC 2D : GOUTTE D'EAU DANS UN BASSIN$

NUMBER OF ELEMENTS: 8978

NUMBER OF POINTS: 4624



TYPE OF ELEMENT: TRIANGLE

TYPE OF BND ELEMENT: POINT



SINGLE PRECISION FORMAT (R4)





ONE-LEVEL MESH.

NDP NODES PER ELEMENT: 3

ELEMENT TYPE : 10

NPOIN NUMBER OF MESH NODES: 4624

NELEM NUMBER OF MESH ELEMENTS: 8978



THE INPUT FILE ASSUMED TO BE 2D

THERE ARE 1 TIME-DEPENDENT RECORDINGS



THERE IS 1 SOLID BOUNDARIES:



BOUNDARY 1 :

BEGINS AT BOUNDARY POINT: 1 , WITH GLOBAL NUMBER: 1

AND COORDINATES: 0.000000 0.000000

ENDS AT BOUNDARY POINT: 1 , WITH GLOBAL NUMBER: 1

AND COORDINATES: 0.000000 0.000000

THE MESH PARTITIONING STEP STARTS

BEGIN PARTITIONING WITH METIS

RUNTIME OF METIS 0.00000000 SECONDS

THE MESH PARTITIONING STEP HAS FINISHED
Attachments:
The administrator has disabled public write access.

my simulation doesn't work for compute nodes 1 year 6 months ago #42475

  • josiastud
  • josiastud's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 132
  • Thank you received: 2
hi everyone,
UPDATE:
I tried several modification of the config file and the last modification I did is to put this line in comment
#par_cmd_exec_edf: srun -n 1 -N 1 <config>/partel < <partel.par> >> <partel.log>

any way
I can run a simulation on any node from any node, but
I still can't run 1 simulation on 2 or 3 different nodes simultaneously
for example i can run
telemac2d.py t2d_gouttedo.cas --hosts=client_node1 --ncsize=4

but I cannot run

telemac2d.py t2d_gouttedo.cas --ncsize=6

There are not enough slots available in the system to satisfy the 6
slots that were requested by the application:

/home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h32min48s/out_user_fortran

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:

1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
Traceback (most recent call last):
File "/home/ben/telemac-mascaret/scripts/python3/telemac2d.py", line 7, in <module>
main('telemac2d')
File "/home/ben/telemac-mascaret/scripts/python3/runcode.py", line 288, in main
run_study(cas_file, code_name, options)
File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 169, in run_study
run_local_cas(my_study, options)
File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 65, in run_local_cas
my_study.run(options)
File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 644, in run
self.run_local()
File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 465, in run_local
run_code(self.run_cmd, self.sortie_file)
File "/home/ben/telemac-mascaret/scripts/python3/execution/run.py", line 182, in run_code
raise TelemacException('Fail to run\n'+exe)
utils.exceptions.TelemacException: Fail to run
mpirun -np 6 /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h32min48s/out_user_fortran


telemac2d.py t2d_gouttedo.cas --ncnode=2 --nctile=4



Running your simulation(s) :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



In /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s:
mpirun -np 8 /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran


/home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran: error while loading shared libraries: libtelemac2d.so: cannot open shared object file: No such file or directory
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
/home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran: error while loading shared libraries: libtelemac2d.so: cannot open shared object file: No such file or directory
/home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran: error while loading shared libraries: libtelemac2d.so: cannot open shared object file: No such file or directory
/home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran: error while loading shared libraries: libtelemac2d.so: cannot open shared object file: No such file or directory
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[15249,1],6]
Exit code: 127
Traceback (most recent call last):
File "/home/ben/telemac-mascaret/scripts/python3/telemac2d.py", line 7, in <module>
main('telemac2d')
File "/home/ben/telemac-mascaret/scripts/python3/runcode.py", line 288, in main
run_study(cas_file, code_name, options)
File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 169, in run_study
run_local_cas(my_study, options)
File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 65, in run_local_cas
my_study.run(options)
File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 644, in run
self.run_local()
File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 465, in run_local
run_code(self.run_cmd, self.sortie_file)
File "/home/ben/telemac-mascaret/scripts/python3/execution/run.py", line 182, in run_code
raise TelemacException('Fail to run\n'+exe)
utils.exceptions.TelemacException: Fail to run
mpirun -np 8 /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran
The administrator has disabled public write access.

my simulation doesn't work for compute nodes 1 year 6 months ago #42476

  • josiastud
  • josiastud's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 132
  • Thank you received: 2
hi, sorry for overloading this thread, I just think you need to have more details to be able to find the solution easily so

update 2
when running a 3d simulation I got this error
A process or daemon was unable to complete a TCP connection
to another process:
Local host: ben0
Remote host: ben1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

A process or daemon was unable to complete a TCP connection
to another process:[ben0:20402] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
Local host: ben0
Remote host: ben2
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
[ben0:20402] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
....

even after disabling ufw by the command
ufw disable
I still got the same message (nearly same because this change [ben0:20402] to ben0:02527 and so forth)

thx for your help
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.