Welcome, Guest
Username: Password: Remember me

TOPIC: V8P1R0 parallel issue

V8P1R0 parallel issue 3 years 10 months ago #37517

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
From the hello world example it looks that on your cluster srun -n 4 is not is not equivalent to mpirun -n 4 but only launc 4 times hello world

you should not be using srun for the mpi_cmdexec.
You should be using mpirun or the command to submit mpi job for your cluster.

As for why it is worling with python27 it could be the same thing.

Hope it helps.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 10 months ago #37521

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
On my cluster the srun and sbatch job script was recommended for launching MPI programs.

I tried with mpirun command for the mpi_cmdexec and error occurred as attached, indicating the same issue of mismatched ncsize with more information listed.

While using mpiexec the segmentation fault appeared, please see the second error log...
Attachments:
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 10 months ago #37525

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
I think you should try getting the hello world example to work first.

Maybe the command name is not mpirun can you check that mpirun is indeed the one associated to the mpiifort you are using by running:
which mpiifort
which mpirun
which mpiexec

Are you using the same version of mpi for the python27 config ?
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 10 months ago #37527

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
The version of MPI used was intel/2019, after switching to the same version (2017) for the python27 config the hello_world example worked with srun, as shown in Fig.1

While the 'More processors requested than permitted' error still emerged when running the t3d example, which is the same situation as I have posted in #37516. It seemed that, in my case with python3 the right ncsize cannot be passed to job-submitting system...
Attachments:
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 10 months ago #37528

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Could you post the HPC_STIN as well as your command to run ?

Also can you try adding --ncsize=4 --nctile=4 to your command ?
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 10 months ago #37529

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Please fine the generated HPC_STDIN and sbatch file used to run the example attached.

I talked to the expert of my cluster and he thought the problem was that the info below was missing in the generated HPC_STDIN.
#SBATCH -p hpxg
#SBATCH -n <ncsize>

In the HPC_STDIN of previous version with python27 config you can find the two lines existing, please see the HPC_STDIN(py27).txt
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 10 months ago #37530

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Ok I get the issue.
There is an issue with lines starting with # in python3 for the systel.cfg file.
You can have a look at systel.edf.cfg for a work around (search sbatch_tag).
basicaly you define a custom keyword as # and use it in the definition of hpt_stdin.

That should do the trick
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: Yunhao Song

V8P1R0 parallel issue 3 years 10 months ago #37532

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
The problem was finally solved...!

I should have carefully read the systel.edf.cfg as it has already pointed out the solution:
# Dirty hack as there is a bug withing configparser in py3 that removes lines starting with #

Thank you Yugi, really appreciate your help! :laugh:
Yunhao
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 3 months ago #38927

  • captain
  • captain's Avatar
  • OFFLINE
  • Junior Boarder
  • Posts: 31
Hi,I have the same problem.I replace '#PBS' with '<PBS>' ,but it doesn't seem to work.I want to know how to modify the #PBS format.Thanks.
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 3 months ago #38928

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Hi,

You can try to use the format below in your configuration file to bypass the issue.

sbatch_tag:#PBS
hpc_stdin_edf: #!/bin/bash
  [sbatch_tag] --job-name=<jobname>
  [sbatch_tag] --output=<jobname>-<time>.out
  [sbatch_tag] --time=<walltime>
  [sbatch_tag] --ntasks=<ncsize>
  [sbatch_tag] --partition=<queue>
  [sbatch_tag] --exclusive
  [sbatch_tag] --nodes=<ncnode>
  [sbatch_tag] --ntasks-per-node=<nctile>
  source <root>/configs/pysource.<configName>.sh
  <py_runcode>

Hope this helps.
The administrator has disabled public write access.
The following user(s) said Thank You: captain
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.