Welcome, Guest
Username: Password: Remember me

TOPIC: Running parallel mode using Linux Cluster

Running parallel mode using Linux Cluster 6 years 2 months ago #31418

  • MohdAlaa
  • MohdAlaa's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 76
  • Thank you received: 1
Hi,

I'm trying to run telemac in parallel mode on the cluster, but it fails to run.

I have attached the config file and the error report.

Any advice?

I'm using qsub:

#$ -S /bin/bash
#$ -pe mpi 20
#$ -l intel=true
#$ -j y
#$ -o /home/ma211/OWHS/
#$ -N telemac2D20
#$ -V

###module load mpi
module add mpi

###module load anaconda
source activate mypython
source /home/ma211/cfg/pysource.sh

#echo Start: 'date'
echo "Start: " $(date)

cd /home/ma211/OWHS/
#telemac2d.py OWHS.cas --ncsize=20
telemac2d.py OWHS.cas --ncsize=20

echo "End: " $(date)
#echo "End: " 'date'

source deactivate mypython


Thanks,
Mohammed

File Attachment:

File Name: systel.gfort.cfg
File Size: 2 KB
The administrator has disabled public write access.

Running parallel mode using Linux Cluster 6 years 2 months ago #31421

  • MohdAlaa
  • MohdAlaa's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 76
  • Thank you received: 1
I couldn't upload the error report, I will copy it here:

Running your simulation(s) :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



mpiexec -machinefile MPI_HOSTFILE -n 20 /home/ma211/OWHS/OWHS.cas_2018-09-11-00h07min44s/out_telemac2d


There are not enough slots available in the system to satisfy the 20 slots
that were requested by the application:
/home/ma211/OWHS/OWHS.cas_2018-09-11-00h07min44s/out_telemac2d

Either request fewer slots for your application, or make more slots available
for use.
_____________
runcode::main:
:
|runCode: Fail to run
|mpiexec -machinefile MPI_HOSTFILE -n 20 /home/ma211/OWHS/OWHS.cas_2018-09-11-00h07min44s/out_telemac2d
|~~~~~~~~~~~~~~~~~~
|
|~~~~~~~~~~~~~~~~~~
End: Tue 11 Sep 00:08:13 BST 2018
The administrator has disabled public write access.

Running parallel mode using Linux Cluster 6 years 2 months ago #31426

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi,

The error seems to be in you qsub script.
Did you try running a simple mpi hello world program ?

Maybe you can try replacing in your configuration file the value of "mpi_cmdexec" by just "mpiexec -n <ncsize> <exename>"

By the way you can also include your qsub file within the configuration file.

Have a look at systel.edf.cfg (config athos.intel) and systel.cis-hydra.cfg
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Running parallel mode using Linux Cluster 6 years 2 months ago #31433

  • MohdAlaa
  • MohdAlaa's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 76
  • Thank you received: 1
Thanks Yugi,

The jobscript should be ok. This is the university cluster and they use this script for all other job submissions.

we changed the config to: mpi_cmdexec: mpirun --prefix /usr/lib64/openmpi -np <ncsize> <exename>

Now it is working, but I got those messages:

Running your simulation(s) :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



mpirun --prefix /usr/lib64/openmpi -np 48 /home/ma211/Orkney_450m/OW36_450m.cas_2018-09-11-17h38min15s/out_telemac2d


node20.22502hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22507hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node19.20595hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22503hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22504hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25920hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22509hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22511hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node19.20597hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22512hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node19.20598hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25918hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node19.20596hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6557hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22518hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25921hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25922hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25923hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6565hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22516hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22520hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22514hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6559hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22521hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6564hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6567hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22523hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6562hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25925hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node19.20599hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node19.20600hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node19.20601hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6558hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6568hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22525hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25924hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6570hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25926hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6575hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25928hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6574hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node19.20603hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6580hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22524hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node20.22527hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25933hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node24.25930hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
node22.6577hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
MASTER PROCESSOR NUMBER 0 OF THE GROUP OF 48
EXECUTABLE FILE: /home/ma211/Orkney_450m/OW36_450m.cas_2018-09-11-17h38min15s/A.EXE

Nevertheless, it worked using 48 cores and completed the simulation normally. What do those messages mean?

Thanks,

Mohammed
The administrator has disabled public write access.

Running parallel mode using Linux Cluster 6 years 2 months ago #31440

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
This seems to be a bug from redhat.
It is explained here:
bugzilla.redhat.com/show_bug.cgi?id=1408316

The solution seems to be updating the library libfabric.
You can transfer that information to th IT service of the cluster.

At least your computation is working.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: MohdAlaa
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.