Welcome, Guest
Username: Password: Remember me

TOPIC: V8P1R0 parallel issue

V8P1R0 parallel issue 4 years 9 months ago #35297

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Hi Yoann,

I tried your suggestions and with --ncsize=4 the example started to run but was extremely slow, it took more than 2 hours to finished while in sequential mode it only needs 5 sec...please refer to HPC_STDIN attached. Then I turned to python27 when compiling v8p1r0 and it worked well, I'll just stick to it for the moment.

Best,
Yunhao
Attachments:
The administrator has disabled public write access.

V8P1R0 parallel issue 4 years 9 months ago #35298

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
This seems weird.
Is your HPC_STDIN the same in python2 ?
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

V8P1R0 parallel issue 4 years 9 months ago #35299

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Almost the same, with Python27 the srun was used, please check the file below.
Attachments:
The administrator has disabled public write access.

V8P1R0 parallel issue 4 years 9 months ago #35300

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Ok I think the issue could come from anaconda.
I had some issues sometimes because anaconda contains a lot of pre-compiled library that can cause some issues.
TAs it is the only difference between the two.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: Yunhao Song

V8P1R0 parallel issue 3 years 11 months ago #37362

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Dear Yugi,

I allowed myself continuing this post as the parallel-related problem hasn't been solved when using the latest v8p2 on HPC. As you mentioned in the previous posts that this could be caused by Anaconda, should I give up using Anaconda 3.7 and install python3 and Numpy on the HPC myself?

Before this update we have been running parallel models using v8p0 with Anaconda 2.7 and intel parallel studio and everything went well, just FYI.

Any suggestion is appreciated.
Yunhao
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 11 months ago #37363

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
If you are not using the API there are less issues with anaconda.

Internally we use a local installation of Python and use pip to install addtional packages.
This does the job as you can tell at the installation what compiler/lib to use.

You quickly have issues combining telemac and anaconda because Anaconda will use its own Fortran compiler and its own MPI. We also have issue with hdf5 as well as it is using its own.

So yes I would recommend installing it yourself.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: Yunhao Song

V8P1R0 parallel issue 3 years 11 months ago #37401

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Hi Yugi,

To check out if the problem in my case was caused by anaconda, I installed Python 3.9.1 locally and used pip3.9 to install NumPy 1.19.4, but the error was still there.
srun: error: Unable to create step for job 14383883: More processors requested than permitted

The command used to run the example case is
python3 $HOMETEL/scripts/python3/telemac3d.py t3d_waq3d_aed2.cas --ncsize=4

The maximum number of CPUs allocated for my account is 16 so it's really weird...Please find the STDIN, pysource, config, bash_profile and partel log files in the attachment, now I really don't know where to start...
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 10 months ago #37497

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
HI,

Could you post the run log ?
Also can you try running a simple program with the same HPC_STDIN ? something like a simple hello world.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

V8P1R0 parallel issue 3 years 10 months ago #37516

  • Yunhao Song
  • Yunhao Song's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 118
  • Thank you received: 9
Hi yugi,

Does the run log mean the sortie file? In the slurm-[jobID].out file there is nothing but:
srun: error: Unable to create step for job 14383883: More processors requested than permitted

I did the hello world mpi test as you mentioned and the result seemed suspicious either, the node id is the same 0 instead of (0;1;2;3) as shown in the snapshot.

To bypass this problem I switched into Python27+v8p2r0 and the telemac example was finished in parallel with no error. The comparison between slurm jobs information of Python27+v8p2r0 and Python3+v8p2r0 was displayed in the second picture, it's noticeable that Python3+v8p2r0 failed to pass the value of [ncsize] to the job submitting system thus only one cpu was allocated, triggering the error
More processors (ncsize=4) requested than permitted (Allocated CPU=1)
I think this is where the problem came from...Could you shed more light on how to solve this?

Many thanks,
Yunhao
Attachments:
The administrator has disabled public write access.

V8P1R0 parallel issue 4 years 9 months ago #35302

  • Chen
  • Chen's Avatar
yugi wrote:
And what command did you type ?

I used the command: telemac2d.py -s CASFILE.cas
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.