Welcome, Guest
Username: Password: Remember me

TOPIC: SLURM on HPC help with submitting jobs

SLURM on HPC help with submitting jobs 3 years 10 months ago #37613

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
Hi all,

I am trying to get my first ever TELEMAC (v8p1) run on a HPC with SLURM. I am sort of a rookie with compiling and HPCs but learned a lot thanks to this FORUM.

I used the systel.edf.cfg for an exmaple and edited it to compile with Intel compilers [siku.intel.dyn]. My cfg file is attached.

Compilation went fine, I am having trouble submitting a job.

From the following lines in the cfg file
hpc_runcode_edf: bash <root>/scripts/submit_slurm.sh <id_log>
par_cmd_exec_edf: srun -n 1 -N 1 <config>/partel < <partel.par> >> <partel.log>

I created a submit_slurm.sh in <root>/scripts/ (not in <root>/scripts/python3/ - maybe this is the problem?) which is also attached.

Then I created another sh file in my run directory, TMAC_sbatch.sh also attached.

I guess the main question is how do I properly start a run now? Sorry if this is very basic, I am doing my homework as much as possible searching the forum and online but could not solve it yet. I did not share the errors I get since I am not even sure which one is meaningful. But I suppose the main error is this:
File "/home/user/v8p1/scripts/python3/execution/run.py", line 182, in run_code
    raise TelemacException('Fail to run\n'+exe)
utils.exceptions.TelemacException: Fail to run
bash /home/user/v8p1/scripts/submit_slurm.sh <id_log>


Cheers
Onur
The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 3 years 10 months ago #37632

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Hi Onur,

First maybe you should try the latest version (v8p2r0), which is always recommended by the forum.

Since you finished the compilation, did you successfully run your test case in sequential? If you did please try the MPI test before launching your case with multiple cores, to make sure the parallel configuration works properly.

I'm also using a HPC Cluster with SLURM so to my limited experience the error probably came from your .cfg file, I shared my .cfg file and sbatch file in the attachment for your reference, hope it helps.

Regards,
Yunhao
Attachments:
The administrator has disabled public write access.
The following user(s) said Thank You: Kurum

SLURM on HPC help with submitting jobs 3 years 10 months ago #37645

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
Hi Yunhao,

thank you so much for sharing your cfg file - indeed, I found my mistake. I forgot to load the intelmpi module. I realized that when I saw the line:
cmd_obj: mpiifort ....

I was basically compiling with intel fortran + openmpi .

Fixed my cfg, re-compiled and all is good.
The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 3 years 10 months ago #37647

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
I might have spoken too soon, my run starts fine, I see all the partitioning done properly then the run "finishes" immediately. Like below:
Submitted batch job 157451
... Your simulation (t2d_PLNG_S127_Q400_wsetup.cas) has been launched through the queue.
 
   +> You need to wait for completion before re-collectingfiles using the option --merge
 
 
 
My work is done

I check the queue (qs) and nothing is running. Did you run into this issue Yunhao?
The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 3 years 10 months ago #37648

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
The information you listed is normal for SLURM, 'My work is done' means your job has been successfully submitted to the queue. Your model would start running only if there're enough CPUs idle for your request. Then you could use command like 'sacct' or 'squeue' to check the state of your job. Did you check if there're any file named 'slurm-157451.out' or 't2d_PLNG_S127_Q400_wsetup.cas.sortie' generated in your running directory? The output listing of TELEMAC would be there no matter your job is running, failed or completed.

Cheers,
Yunhao
The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 3 years 10 months ago #37652

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
Hi Yunhao,

I just checked the slurm-157451.out right now:
+> module: ad / api / artemis / bief
               damocles  / gaia  / gretel  / hermes
               identify_liq_bnd  / khione  / mascaret  / nestor
               parallel  / partel  / postel3d  / sisyphe
               special  / stbtel  / telemac2d  / telemac3d
               tomawac / waqtel


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



My work is done


Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(143)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:


Does this mean anything to you?
The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 3 years 10 months ago #37653

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Hi Onur,

It's highly recommended to run the MPI test before running your cases as I mentioned in #37632, especially when you just finished the compilation of a new version of TELEMAC. The error may came from the MPI setting in your configuration file, if you are using openmpi, replacing mpiifort in 'cmd_obj' and 'cmd_exe' with mpif90 should solve the problem. Hope this helps!

Yunhao
The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 3 years 10 months ago #37664

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
Hi again Yunhao,

I think the intel mpi works fine.
[okurum@login1 work]$ mpiifort hello_world.f90 -o hello_world
[okurum@login1 work]$ mpiexec -n 4 ./hello_world
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
 node           0 : Hello world
 node           3 : Hello world
 node           2 : Hello world
 node           1 : Hello world

I looked up the error I was getting, i.e. "Fatal error in PMPI_Init: Other MPI error, error stack"

The error seems to be caused by not using the correct nodes (the bold text below) from Intel's website.
I forwarded this problem to the cluster's support. I will let you know what they come up with.
I am happy to try any other ideas you or anybody else has.

thanks for all your help
Cheers
Onur
Error Message: Fatal Error
Case 1
Error Message

Abort(1094543) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

MPIR_Init_thread(653)......:

MPID_Init(860).............:

MPIDI_NM_mpi_init_hook(698): OFI addrinfo() failed

(ofi_init.h:698:MPIDI_NM_mpi_init_hook:No data available) Cause The current provider cannot be run on these nodes. The MPI application is run over the psm2 provider on the non-Intel® Omni-Path card or over the verbs provider on the non-InfiniBand*, non-iWARP, or non-RoCE card.
Solution
[b]Change the provider or run MPI application on the right nodes.[/b] Use fi_info to get information about the current provider.
The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 3 years 10 months ago #37665

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
I should have added this.
mpiexec works but srun doesn't. It's the same error I get from TELEMAC.

srun -n 4 ./hello_world
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
The administrator has disabled public write access.

SLURM on HPC help with submitting jobs 3 years 10 months ago #37668

  • Kurum
  • Kurum's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 23
  • Thank you received: 2
I found this on stackoverflow:
"If you want to use srun with Intel MPI, an extra step is required. You first need to
export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so"

so I did that in my cfg and this got rid of the "Fatal error in PMPI_Init: Other MPI error, error stack".
srun -n 4 ./hello_world
 node           2 : Hello world
 node           1 : Hello world
 node           3 : Hello world
 node           0 : Hello world

So I added the line below to my config
export I_MPI_PMI_LIBRARY=/opt/software/slurm-20.11.0/lib/libpmi.so

You would think by now the issue would be solved...

Here is the final HPC_STDIN:
#!/bin/bash
#SBATCH --ntasks=28
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=4000M
#SBATCH --time=01:00:00
#SBATCH -o OK-%j.out               # Write job output
#SBATCH -o OK-%j.err               # Write job error
module restore telemacv8p2_modules
source /home/okurum/v8p1/configs/pysource.SikuAceNet.sh
config.py
export I_MPI_PMI_LIBRARY=/opt/software/slurm-20.11.0/lib/libpmi.so
srun -n 28 /home/okurum/test_v8p1/t2d_test.cas_2021-01-19-06h07min15s/out_user_fortran
exit

sq shows that the run is going but after the initial partitioning, nothing gets updated in /home/okurum/test_v8p1/t2d_test.cas_2021-01-19-06h07min15s

checking the logs, it seems the model hangs here. My tpxo data is there, the path to it is correct in my cas. I am not sure why it hangs like this.




*************************************
* END OF MEMORY ORGANIZATION: *
*************************************

INITIALIZING TELEMAC2D FOR
INBIEF (BIEF): NOT A VECTOR MACHINE (ACCORDING TO YOUR DATA)
FONSTR : FRICTION COEFFICIENTS READ IN THE
GEOMETRY FILE
STRCHE (BIEF): NO MODIFICATION OF FRICTION


NUMBER OF LIQUID BOUNDARIES: 2

CORFON (TELEMAC2D): NO MODIFICATION OF BOTTOM

INITIALISATION BASED ON TPXO:
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.