Welcome,
Guest
|
TOPIC: my simulation doesn't work for compute nodes
my simulation doesn't work for compute nodes 1 year 6 months ago #42471
|
Hi everybody,
I have a homemade cluster of : -3 nodes (desktops), 4cpu each, -using nfs, -telemac installed on the master node and mirrored to the client nodes - ubuntu server 20.04 for all nodes (Centos was too complicate for me to install) -config file & bash file attached (s10.gfortran.dyn) -simulation I tried: telemac2d.py t2d_gouttedo.cas --ncsize=12 --ncnode=3 when running ncsize=4 it does work, but can't work for ncsize=12 or 8 or 6. or even ncnode=3 (I can't use -np=12, telemac doesn't recognize as a argument) That's why I think it is only the master node that compute. so here my questions, 1) how to make all my nodes to run simulations 2)how can I distribute several task on all my machine here the error message ... partitioning base files (geo, conlim, sections, zones and weirs) +> /home/ben/telemac-mascaret/builds/S10.gfortran.dyn/bin/partel < partel_T2DGEO.par >> partel_T2DGEO.log partel: malloc.c:4036: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed. Failed during initial partitioning At line 45 of file /home/ben/telemac-mascaret/sources/utils/special/extens.f Fortran runtime error: End of file Error termination. Backtrace: #0 0x7f9630010d21 in ??? #1 0x7f9630011869 in ??? #2 0x7f963001254f in ??? #3 0x7f9630254e8a in ??? #4 0x7f963025ef36 in ??? #5 0x7f9630258584 in ??? #6 0x7f96302589d3 in ??? #7 0x7f96302ce660 in ??? #8 0x55ca2524e68c in ??? #9 0x55ca25237c54 in ??? #10 0x55ca25235882 in ??? #11 0x7f962fb33082 in __libc_start_main at ../csu/libc-start.c:308 #12 0x55ca252358bd in ??? #13 0xffffffffffffffff in ??? munmap_chunk(): invalid pointer Program received signal SIGABRT: Process abort signal. Backtrace for this error: #0 0x7f9630010d21 in ??? #1 0x7f963000fef5 in ??? #2 0x7f962fb5208f in ??? at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 #3 0x7f962fb5200b in __GI_raise at ../sysdeps/unix/sysv/linux/raise.c:51 #4 0x7f962fb31858 in __GI_abort at /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79 #5 0x7f962fb9c26d in __libc_message at ../sysdeps/posix/libc_fatal.c:155 #6 0x7f962fba42fb in malloc_printerr at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:5347 #7 0x7f962fba454b in munmap_chunk at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:2830 #8 0x7f963025b671 in ??? #9 0x7f963025b2ad in ??? #10 0x7f963025b4c9 in ??? #11 0x7f9630acef6a in _dl_fini at /build/glibc-SzIz7B/glibc-2.31/elf/dl-fini.c:138 #12 0x7f962fb558a6 in __run_exit_handlers at /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:108 #13 0x7f962fb55a5f in __GI_exit at /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:139 #14 0x7f963001184d in ??? #15 0x7f963001254f in ??? #16 0x7f9630254e8a in ??? #17 0x7f963025ef36 in ??? #18 0x7f9630258584 in ??? #19 0x7f96302589d3 in ??? #20 0x7f96302ce660 in ??? #21 0x55ca2524e68c in ??? #22 0x55ca25237c54 in ??? #23 0x55ca25235882 in ??? #24 0x7f962fb33082 in __libc_start_main at ../csu/libc-start.c:308 #25 0x55ca252358bd in ??? #26 0xffffffffffffffff in ??? Aborted (core dumped) Traceback (most recent call last): File "/home/ben/telemac-mascaret/scripts/python3/telemac2d.py", line 7, in <module> main('telemac2d') File "/home/ben/telemac-mascaret/scripts/python3/runcode.py", line 288, in main run_study(cas_file, code_name, options) File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 169, in run_study run_local_cas(my_study, options) File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 31, in run_local_cas my_study.partionning(options.use_link) File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 429, in partionning run_partition(parcmd, self.cas, g_geo, g_fmt_geo, g_conlim, File "/home/ben/telemac-mascaret/scripts/python3/execution/run.py", line 51, in run_partition run_partel(partel, geom, fmtgeom, conlim, ncsize, False, File "/home/ben/telemac-mascaret/scripts/python3/execution/run.py", line 133, in run_partel raise TelemacException( utils.exceptions.TelemacException: Could not split your file T2DGEO with the error as follows: ... The following command failed for the reason in the listing /home/ben/telemac-mascaret/builds/S10.gfortran.dyn/bin/partel < partel_T2DGEO.par >> partel_T2DGEO.log Here is the log: + + PARTEL/PARRES: TELEMAC METISOLOGIC PARTITIONER REBEKKA KOPMANN & JACEK A. JANKOWSKI (BAW) JEAN-MICHEL HERVOUET (LNHE) CHRISTOPHE DENIS (SINETICS) YOANN AUDOUIN (LNHE) PARTEL (C) COPYRIGHT 2000-2002 BUNDESANSTALT FUER WASSERBAU, KARLSRUHE METIS 5.0.2 (C) COPYRIGHT 2012 REGENTS OF THE UNIVERSITY OF MINNESOTA BIEF V8P4 (C) COPYRIGHT 2012 EDF + + MAXIMUM NUMBER OF PARTITIONS: 100000 + + --INPUT FILE NAME <INPUT_NAME>: INPUT: T2DGEO --INPUT FILE FORMAT <INPFORMAT> [MED,SERAFIN,SERAFIND]: INPUT: SERAFIN --BOUNDARY CONDITIONS FILE NAME: INPUT: T2DCLI --NUMBER OF PARTITIONS <NPARTS> [2 -100000]: INPUT: 12 PARTITIONING METHOD <PMETHOD> [1 (METIS) OR 2 (SCOTCH)]: --INPUT: 1 --CONTROL SECTIONS FILE NAME (OR RETURN) : NO SECTIONS --CONTROL ZONES FILE NAME (OR RETURN) : NO ZONES --WEIR FILE NAME (OR RETURN) : NO WEIRS --GEOMETRY FILE NAME <INPUT_NAME>: INPUT: T2DGEO --GEOMETRY FILE FORMAT <GEOFORMAT> [MED,SERAFIN,SERAFIND]: INPUT: SERAFIN --CONCATENATE FILES <YES-NO>: CONCATENATE: NO +---- PARTEL: BEGINNING + FICHIER:T2DGEO READ_MESH_INFO: TITLE= TELEMAC 2D : GOUTTE D'EAU DANS UN BASSIN$ NUMBER OF ELEMENTS: 8978 NUMBER OF POINTS: 4624 TYPE OF ELEMENT: TRIANGLE TYPE OF BND ELEMENT: POINT SINGLE PRECISION FORMAT (R4) ONE-LEVEL MESH. NDP NODES PER ELEMENT: 3 ELEMENT TYPE : 10 NPOIN NUMBER OF MESH NODES: 4624 NELEM NUMBER OF MESH ELEMENTS: 8978 THE INPUT FILE ASSUMED TO BE 2D THERE ARE 1 TIME-DEPENDENT RECORDINGS THERE IS 1 SOLID BOUNDARIES: BOUNDARY 1 : BEGINS AT BOUNDARY POINT: 1 , WITH GLOBAL NUMBER: 1 AND COORDINATES: 0.000000 0.000000 ENDS AT BOUNDARY POINT: 1 , WITH GLOBAL NUMBER: 1 AND COORDINATES: 0.000000 0.000000 THE MESH PARTITIONING STEP STARTS BEGIN PARTITIONING WITH METIS RUNTIME OF METIS 0.00000000 SECONDS THE MESH PARTITIONING STEP HAS FINISHED
Attachments:
|
The administrator has disabled public write access.
|
my simulation doesn't work for compute nodes 1 year 6 months ago #42475
|
hi everyone,
UPDATE: I tried several modification of the config file and the last modification I did is to put this line in comment #par_cmd_exec_edf: srun -n 1 -N 1 <config>/partel < <partel.par> >> <partel.log> any way I can run a simulation on any node from any node, but I still can't run 1 simulation on 2 or 3 different nodes simultaneously for example i can run telemac2d.py t2d_gouttedo.cas --hosts=client_node1 --ncsize=4 but I cannot run telemac2d.py t2d_gouttedo.cas --ncsize=6 There are not enough slots available in the system to satisfy the 6 slots that were requested by the application: /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h32min48s/out_user_fortran Either request fewer slots for your application, or make more slots available for use. A "slot" is the Open MPI term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which Open MPI processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor cores In all the above cases, if you want Open MPI to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --oversubscribe option to ignore the number of available slots when deciding the number of processes to launch. Traceback (most recent call last): File "/home/ben/telemac-mascaret/scripts/python3/telemac2d.py", line 7, in <module> main('telemac2d') File "/home/ben/telemac-mascaret/scripts/python3/runcode.py", line 288, in main run_study(cas_file, code_name, options) File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 169, in run_study run_local_cas(my_study, options) File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 65, in run_local_cas my_study.run(options) File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 644, in run self.run_local() File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 465, in run_local run_code(self.run_cmd, self.sortie_file) File "/home/ben/telemac-mascaret/scripts/python3/execution/run.py", line 182, in run_code raise TelemacException('Fail to run\n'+exe) utils.exceptions.TelemacException: Fail to run mpirun -np 6 /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h32min48s/out_user_fortran telemac2d.py t2d_gouttedo.cas --ncnode=2 --nctile=4 Running your simulation(s) : ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s: mpirun -np 8 /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran: error while loading shared libraries: libtelemac2d.so: cannot open shared object file: No such file or directory Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran: error while loading shared libraries: libtelemac2d.so: cannot open shared object file: No such file or directory /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran: error while loading shared libraries: libtelemac2d.so: cannot open shared object file: No such file or directory /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran: error while loading shared libraries: libtelemac2d.so: cannot open shared object file: No such file or directory mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[15249,1],6] Exit code: 127 Traceback (most recent call last): File "/home/ben/telemac-mascaret/scripts/python3/telemac2d.py", line 7, in <module> main('telemac2d') File "/home/ben/telemac-mascaret/scripts/python3/runcode.py", line 288, in main run_study(cas_file, code_name, options) File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 169, in run_study run_local_cas(my_study, options) File "/home/ben/telemac-mascaret/scripts/python3/execution/run_cas.py", line 65, in run_local_cas my_study.run(options) File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 644, in run self.run_local() File "/home/ben/telemac-mascaret/scripts/python3/execution/study.py", line 465, in run_local run_code(self.run_cmd, self.sortie_file) File "/home/ben/telemac-mascaret/scripts/python3/execution/run.py", line 182, in run_code raise TelemacException('Fail to run\n'+exe) utils.exceptions.TelemacException: Fail to run mpirun -np 8 /home/ben/telemac-mascaret/examples/telemac2d/gouttedo/t2d_gouttedo.cas_2023-04-30-13h34min33s/out_user_fortran |
The administrator has disabled public write access.
|
my simulation doesn't work for compute nodes 1 year 6 months ago #42476
|
hi, sorry for overloading this thread, I just think you need to have more details to be able to find the solution easily so
update 2 when running a 3d simulation I got this error A process or daemon was unable to complete a TCP connection to another process: Local host: ben0 Remote host: ben1 This is usually caused by a firewall on the remote host. Please check that any firewall (e.g., iptables) has been disabled and try again. A process or daemon was unable to complete a TCP connection to another process:[ben0:20402] 3 more processes have sent help message help-mpi-api.txt / mpi-abort Local host: ben0 Remote host: ben2 This is usually caused by a firewall on the remote host. Please check that any firewall (e.g., iptables) has been disabled and try again. [ben0:20402] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages .... even after disabling ufw by the command ufw disable I still got the same message (nearly same because this change [ben0:20402] to ben0:02527 and so forth) thx for your help |
The administrator has disabled public write access.
|
Moderators: borisb