TELEMAC-MASCARET Forum: SLURM on HPC help with submitting jobs (1/1)

SLURM on HPC help with submitting jobs 4 years 2 months ago #37613

Kurum
OFFLINE
Fresh Boarder
Posts: 23
Thank you received: 2

Hi all,

I am trying to get my first ever TELEMAC (v8p1) run on a HPC with SLURM. I am sort of a rookie with compiling and HPCs but learned a lot thanks to this FORUM.

I used the systel.edf.cfg for an exmaple and edited it to compile with Intel compilers [siku.intel.dyn]. My cfg file is attached.

Compilation went fine, I am having trouble submitting a job.

From the following lines in the cfg file

hpc_runcode_edf: bash <root>/scripts/submit_slurm.sh <id_log>
par_cmd_exec_edf: srun -n 1 -N 1 <config>/partel < <partel.par> >> <partel.log>

I created a submit_slurm.sh in <root>/scripts/ (not in <root>/scripts/python3/ - maybe this is the problem?) which is also attached.

Then I created another sh file in my run directory, TMAC_sbatch.sh also attached.

I guess the main question is how do I properly start a run now? Sorry if this is very basic, I am doing my homework as much as possible searching the forum and online but could not solve it yet. I did not share the errors I get since I am not even sure which one is meaningful. But I suppose the main error is this:

File "/home/user/v8p1/scripts/python3/execution/run.py", line 182, in run_code
    raise TelemacException('Fail to run\n'+exe)
utils.exceptions.TelemacException: Fail to run
bash /home/user/v8p1/scripts/submit_slurm.sh <id_log>

Cheers
Onur

Attachments:

The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 4 years 2 months ago #37632

Yunhao Song
OFFLINE
Senior Boarder
Posts: 119
Thank you received: 9

Hi Onur,

First maybe you should try the latest version (v8p2r0), which is always recommended by the forum.

Since you finished the compilation, did you successfully run your test case in sequential? If you did please try the MPI test before launching your case with multiple cores, to make sure the parallel configuration works properly.

I'm also using a HPC Cluster with SLURM so to my limited experience the error probably came from your .cfg file, I shared my .cfg file and sbatch file in the attachment for your reference, hope it helps.

Regards,
Yunhao

Attachments:

systel.cis...1-15.txt (2KB)
myjob.sbat...1-15.txt (0KB)

The administrator has disabled public write access.

The following user(s) said Thank You: Kurum

SLURM on HPC help with submitting jobs 4 years 2 months ago #37645

Kurum OFFLINE Fresh Boarder Posts: 23 Thank you received: 2	Hi Yunhao, thank you so much for sharing your cfg file - indeed, I found my mistake. I forgot to load the intelmpi module. I realized that when I saw the line: cmd_obj: mpiifort .... I was basically compiling with intel fortran + openmpi . Fixed my cfg, re-compiled and all is good.
	The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 4 years 2 months ago #37647

Kurum
OFFLINE
Fresh Boarder
Posts: 23
Thank you received: 2

I might have spoken too soon, my run starts fine, I see all the partitioning done properly then the run "finishes" immediately. Like below:

Submitted batch job 157451
... Your simulation (t2d_PLNG_S127_Q400_wsetup.cas) has been launched through the queue.
 
   +> You need to wait for completion before re-collectingfiles using the option --merge
 
 
 
My work is done

I check the queue (qs) and nothing is running. Did you run into this issue Yunhao?

The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 4 years 2 months ago #37648

Yunhao Song
OFFLINE
Senior Boarder
Posts: 119
Thank you received: 9

The information you listed is normal for SLURM, 'My work is done' means your job has been successfully submitted to the queue. Your model would start running only if there're enough CPUs idle for your request. Then you could use command like 'sacct' or 'squeue' to check the state of your job. Did you check if there're any file named 'slurm-157451.out' or 't2d_PLNG_S127_Q400_wsetup.cas.sortie' generated in your running directory? The output listing of TELEMAC would be there no matter your job is running, failed or completed.

Cheers,
Yunhao

The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 4 years 2 months ago #37652

Kurum
OFFLINE
Fresh Boarder
Posts: 23
Thank you received: 2

Hi Yunhao,

I just checked the slurm-157451.out right now:

+> module: ad / api / artemis / bief
               damocles  / gaia  / gretel  / hermes
               identify_liq_bnd  / khione  / mascaret  / nestor
               parallel  / partel  / postel3d  / sisyphe
               special  / stbtel  / telemac2d  / telemac3d
               tomawac / waqtel


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



My work is done


Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

Does this mean anything to you?

The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 4 years 2 months ago #37653

Yunhao Song OFFLINE Senior Boarder Posts: 119 Thank you received: 9	Hi Onur, It's highly recommended to run the MPI test before running your cases as I mentioned in #37632, especially when you just finished the compilation of a new version of TELEMAC. The error may came from the MPI setting in your configuration file, if you are using openmpi, replacing mpiifort in 'cmd_obj' and 'cmd_exe' with mpif90 should solve the problem. Hope this helps! Yunhao
	The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 4 years 2 months ago #37664

Kurum
OFFLINE
Fresh Boarder
Posts: 23
Thank you received: 2

Hi again Yunhao,

I think the intel mpi works fine.

[okurum@login1 work]$ mpiifort hello_world.f90 -o hello_world
[okurum@login1 work]$ mpiexec -n 4 ./hello_world
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
 node           0 : Hello world
 node           3 : Hello world
 node           2 : Hello world
 node           1 : Hello world

I looked up the error I was getting, i.e. "Fatal error in PMPI_Init: Other MPI error, error stack"

The error seems to be caused by not using the correct nodes (the bold text below) from Intel's website.
I forwarded this problem to the cluster's support. I will let you know what they come up with.
I am happy to try any other ideas you or anybody else has.

thanks for all your help
Cheers
Onur

Error Message: Fatal Error
Case 1
Error Message

Abort(1094543) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

MPIR_Init_thread(653)......:

MPID_Init(860).............:

MPIDI_NM_mpi_init_hook(698): OFI addrinfo() failed

(ofi_init.h:698:MPIDI_NM_mpi_init_hook:No data available) Cause The current provider cannot be run on these nodes. The MPI application is run over the psm2 provider on the non-Intel® Omni-Path card or over the verbs provider on the non-InfiniBand*, non-iWARP, or non-RoCE card.
Solution
[b]Change the provider or run MPI application on the right nodes.[/b] Use fi_info to get information about the current provider.

The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 4 years 2 months ago #37665

Kurum OFFLINE Fresh Boarder Posts: 23 Thank you received: 2	I should have added this. mpiexec works but srun doesn't. It's the same error I get from TELEMAC. srun -n 4 ./hello_world Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
	The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 4 years 2 months ago #37668

Kurum
OFFLINE
Fresh Boarder
Posts: 23
Thank you received: 2

I found this on stackoverflow:

"If you want to use srun with Intel MPI, an extra step is required. You first need to
export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so"

so I did that in my cfg and this got rid of the "Fatal error in PMPI_Init: Other MPI error, error stack".

srun -n 4 ./hello_world
 node           2 : Hello world
 node           1 : Hello world
 node           3 : Hello world
 node           0 : Hello world

So I added the line below to my config
export I_MPI_PMI_LIBRARY=/opt/software/slurm-20.11.0/lib/libpmi.so

You would think by now the issue would be solved...

Here is the final HPC_STDIN:

#!/bin/bash
#SBATCH --ntasks=28
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=4000M
#SBATCH --time=01:00:00
#SBATCH -o OK-%j.out               # Write job output
#SBATCH -o OK-%j.err               # Write job error
module restore telemacv8p2_modules
source /home/okurum/v8p1/configs/pysource.SikuAceNet.sh
config.py
export I_MPI_PMI_LIBRARY=/opt/software/slurm-20.11.0/lib/libpmi.so
srun -n 28 /home/okurum/test_v8p1/t2d_test.cas_2021-01-19-06h07min15s/out_user_fortran
exit

sq shows that the run is going but after the initial partitioning, nothing gets updated in /home/okurum/test_v8p1/t2d_test.cas_2021-01-19-06h07min15s

checking the logs, it seems the model hangs here. My tpxo data is there, the path to it is correct in my cas. I am not sure why it hangs like this.

*************************************
* END OF MEMORY ORGANIZATION: *
*************************************

INITIALIZING TELEMAC2D FOR
INBIEF (BIEF): NOT A VECTOR MACHINE (ACCORDING TO YOUR DATA)
FONSTR : FRICTION COEFFICIENTS READ IN THE
GEOMETRY FILE
STRCHE (BIEF): NO MODIFICATION OF FRICTION

NUMBER OF LIQUID BOUNDARIES: 2

CORFON (TELEMAC2D): NO MODIFICATION OF BOTTOM

INITIALISATION BASED ON TPXO:

The administrator has disabled public write access.

open TELEMAC-MASCARET The mathematically superior suite of solvers

Nav view search

Navigation

Search

TOPIC: SLURM on HPC help with submitting jobs

SLURM on HPC help with submitting jobs 4 years 2 months ago #37613

SLURM on HPC help with submitting jobs 4 years 2 months ago #37632

SLURM on HPC help with submitting jobs 4 years 2 months ago #37645

SLURM on HPC help with submitting jobs 4 years 2 months ago #37647

SLURM on HPC help with submitting jobs 4 years 2 months ago #37648

SLURM on HPC help with submitting jobs 4 years 2 months ago #37652

SLURM on HPC help with submitting jobs 4 years 2 months ago #37653

SLURM on HPC help with submitting jobs 4 years 2 months ago #37664

SLURM on HPC help with submitting jobs 4 years 2 months ago #37665

SLURM on HPC help with submitting jobs 4 years 2 months ago #37668

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.

Latest News

Latest forum posts