Welcome, Guest
Username: Password: Remember me

TOPIC: Running TELEMAC-2D on hpc cluster

Running TELEMAC-2D on hpc cluster 7 years 8 months ago #25744

  • drslump
  • drslump's Avatar
Hello,

I'm trying to run TELEMAC-2D on a intra-university hpc cluster with scientific-Linux and Torque (PBS) for job scheduling. From what I have read, I need to compile the code and submit the job. I'm wondering if it's possible to submit the job simply by writing a job script that directs to the .cas file.

Any and all help is much appreciated.

Sho
The administrator has disabled public write access.

Running TELEMAC-2D on hpc cluster 7 years 8 months ago #25745

  • sebourban
  • sebourban's Avatar
  • OFFLINE
  • Administrator
  • Principal Scientist
  • Posts: 814
  • Thank you received: 219
Hello,

The TELEMAC system can already do that -- have a look at the configuration file systel.cis-hydra.cfg, where 5 options are shown. Let me know if you use something else than qsub (#PBS).

Hope this helps,
Sébastien.
The administrator has disabled public write access.
The following user(s) said Thank You: drslump

Running TELEMAC-2D on hpc cluster 7 years 8 months ago #25746

  • drslump
  • drslump's Avatar
Dear Sebastien,

Thank you for your help. After looking at the codes, I've decided to use the HYDRY (Python, HPC queue). I understand that I should replace all the contents in <> but I'm not sure what the contents of <> are.

Is there perhaps a list of descriptions/explanations, like a dictionary for these?

<ncsize>
<exename>
<config>
<PARTEL.PAR>
<partel.log>
<hpc_stdin>
<jobid>
<mods>
<incs>
<f95name>
<objs>
<libs>
brief: parallel mode on the HPC queue, using python script within
   the queue.
   In that case, the file partitioning and assembly are done by
   the python within the HPC queue.
   The only difference with hydra is the call to <py_runcode> within
   the HPC_STDIN instead of <mpi_cmdexec>.
   Note also that hpc_runcode replaces hpc_cmdexec
#
mpi_hosts:    mgmt01
mpi_cmdexec: /apps/openmpi/1.6.5/gcc/4.7.2/bin/mpiexec -wdir <wdir> -n <ncsize> <exename>
#
par_cmdexec:   <config>/partel  < PARTEL.PAR >> <partel.log>
#
hpc_stdin: #!/bin/bash
   #PBS -S /bin/sh
   #PBS -o <sortiefile>
   #PBS -e <exename>.err
   #PBS -N <jobname>
   #PBS -l nodes=<nctile>:ppn=<ncnode>
   #PBS -q highp
   source /etc/profile.d/modules.sh
   module load gcc/4.7.2 openmpi/1.6.5/gcc/4.7.2 python/2.7.2
   PATH=$PATH:$HOME/bin:~/opentelemac/trunk/scripts/python27
   export PATH
   cd <wdir>
   <py_runcode>
   exit
#
hpc_runcode:   chmod 755 <hpc_stdin>; qsub <hpc_stdin>
#
hpc_depend: -W depend=afterok:<jobid>
#
cmd_obj:    gfortran -c -O3 -fconvert=big-endian -DHAVE_MPI -frecord-marker=4 <mods> <incs> <f95name>
cmd_exe:    /apps/openmpi/1.6.5/gcc/4.7.2/bin/mpif90 -fconvert=big-endian -frecord-marker=4 -lpthread -v -lm -o <exename> <objs> <libs>
#

Best regards,
Sho
The administrator has disabled public write access.

Running TELEMAC-2D on hpc cluster 7 years 8 months ago #25748

  • sebourban
  • sebourban's Avatar
  • OFFLINE
  • Administrator
  • Principal Scientist
  • Posts: 814
  • Thank you received: 219
Hello,

Good choice.

Actually you should not replace anything - TELEMAC will do that automatically.
However you need a few more command options as you run your simulation. Something like:
telemac2d.py -s CASFile  -c hydry --ncsize 96 --nctile 8 --jobname myTELEMACname
where:
  • --ncsize 96 is your chosen total number of cores / domain splits, for your particular simulation. This 96 value will be also place in your CASFile with the appropriate keyword.
  • --nctile 8 is your number of nodes over which your cores are going to spread, i.e. 96 / 8 = 12, you have 8 time 12-core computers in a rack
  • -c hydry is your specific configuration, that is if your config file has more than one. If you have only one configuration in your config file and if your config file is called systel.cfg, then this will be your default and you don't need the -c hydry option.
  • --jobname myTELEMACname
  • is the reference name you see in your queue once launched. This is optional and is only used to track jobs.
  • ...

That should do. Except for these extra options, running TELEMAC is kept the same whether you run locally or on an cluster.

hope this helps,

Sébastien.
The administrator has disabled public write access.
The following user(s) said Thank You: drslump

Running TELEMAC-2D on hpc cluster 7 years 8 months ago #25758

  • drslump
  • drslump's Avatar
Dear Sebastien,

Thank you for the detailed explanation.

The configuration file is now being picked up by the scheduler, but it runs for a few seconds and kills the job.

Here is the error file returned from the hpc
/var/spool/pbs/mom_priv/jobs/5819754.egeon2.SC: line 1: telemac2d.py: command not found
/var/spool/pbs/mom_priv/jobs/5819754.egeon2.SC: line 3: mpi_hosts:: command not found
/var/spool/pbs/mom_priv/jobs/5819754.egeon2.SC: line 4: syntax error near unexpected token `<'
/var/spool/pbs/mom_priv/jobs/5819754.egeon2.SC: line 4: `mpi_cmdexec: /apps/openmpi/1.6.5/gcc/4.7.2/bin/mpiexec -wdir <wdir> -n <ncsize> <exename>'

It seems that it is not able to find telemac2d.py. The telemac software is in the directory /TELEMAC/v7p2r0, but I am not sure where to put this information in the configuration file.

The cluster I am using has the following MPI implementation: OpenMPI, MPICH2, MVAPICH2.

Here is my configuration file
telemac2d.py -s goulais_manning.cas --ncsize 36 --nctile 3 --jobname telemac_test_case
#
mpi_hosts:    mgmt01
mpi_cmdexec: /apps/openmpi/1.6.5/gcc/4.7.2/bin/mpiexec -wdir <wdir> -n <ncsize> <exename>
#
par_cmdexec:   <config>/partel  < PARTEL.PAR >> <partel.log>
#
hpc_stdin: #!/bin/bash
   #PBS -S /bin/sh
   #PBS -o <sortiefile>
   #PBS -e <exename>.err
   #PBS -N <jobname>
   #PBS -l nodes=<nctile>:ppn=<ncnode>
   #PBS -q highp
   source /etc/profile.d/modules.sh
   module load gcc/4.7.2 openmpi/1.6.5/gcc/4.7.2 python/2.7.2
   PATH=$PATH:$HOME/bin:~/opentelemac/trunk/scripts/python27
   export PATH
   cd <wdir>
   <py_runcode>
   exit
#
hpc_runcode:   chmod 755 <hpc_stdin>; qsub <hpc_stdin>
#
hpc_depend: -W depend=afterok:<jobid>
#
cmd_obj:    gfortran -c -O3 -fconvert=big-endian -DHAVE_MPI -frecord-marker=4 <mods> <incs> <f95name>
cmd_exe:    /apps/openmpi/1.6.5/gcc/4.7.2/bin/mpif90 -fconvert=big-endian -frecord-marker=4 -lpthread -v -lm -o <exename> <objs> <libs>
#

Best regards,
Sho
The administrator has disabled public write access.

Running TELEMAC-2D on hpc cluster 6 years 3 months ago #31192

  • Spinel
  • Spinel's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 25
Good morning,

I allow myself continuing this post to submit my problem.
Main pb is that I can not run telemac2d in parrallel mode in a hpc cluster.

I have no pb for running in serial mode (ncsize = 1).

The command I use:
telemac2d.py -s Tent.cas -c altixgforopmpihpc --ncsize 2 --jobname SebJobName --email This email address is being protected from spambots. You need JavaScript enabled to view it.

Here attached are the error and the systel files.

My environnment:
HPC cluster (www.lncc.br/altix-xe/)
OS: GNU/Linux (Linux altix-xe.hpc.lncc.br 2.6.18-194.el5 #1 SMP Fri Apr 2 14:58:14 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux)
GNU Fortran (GCC) 4.7.2
Python 2.7.13
mpirun (Open MPI) 1.8.5

Could it be a pb of MPI library? a pb with the hostfile (it is not clear what exactly it is) ? mpi_host ? mpirun vs mpixec (In the HPC website, it is mencionned that mpirun should be used) ?

Any help/suggestion will be welcome. :ohmy:
Attachments:
The administrator has disabled public write access.

Running TELEMAC-2D on hpc cluster 6 years 3 months ago #31193

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi

If you take time to look the error file you will find the message
ERROR -1 DURING CALL OF GET_DATA_TIME_SRF:READ
ERROR TEXT: UNKNOWN ERROR

PLANTE: PROGRAM STOPPED AFTER AN ERROR
RETURNING EXIT CODE: 2

This means it's not a problem of mpi or something similar but a problem for Telemac to read some values in a selafin file...

As you wrote it works on serial I suppose there is a problem during the partitioning step. And as there is only one selafin file in your case, the geometry you should look into the partel_T2DGEO.log

Regards
Christophe
The administrator has disabled public write access.

Running TELEMAC-2D on hpc cluster 6 years 3 months ago #31194

  • Spinel
  • Spinel's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 25
Thank you for the help.

It seems that partel is not encountering any pb...
(---- PARTEL: NORMAL TERMINATION ----+ at the end of the partel_T2DGEO.log)

I also mention that on the systel file, I have 2 version (1: direct, 2: with qsub)
When I run the first in parallel mode, I receive the following warnings (but it runs):
mlx4: There is a mismatch between the kernel and the userspace libraries: Kernel does not support XRC. Exiting.
CMA: unable to open RDMA device

The second version (which is the one I need to) stop with the error mentioned in the previous message
Attachments:
The administrator has disabled public write access.

Running TELEMAC-2D on hpc cluster 6 years 3 months ago #31195

  • Spinel
  • Spinel's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 25
I also add that
As advised by the error message, I modified the mpi_cmdexec to

mpi_cmdexec: /hpc/openmpi-1.8.5-gcc47/bin/mpiexec --mca orte_base_help_aggregate -wdir <wdir> -n <ncsize> <exename>.

I received a new error message(here attached)

Any advise?
Attachments:
The administrator has disabled public write access.

Running TELEMAC-2D on hpc cluster 6 years 3 months ago #31200

  • Spinel
  • Spinel's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 25
Just to add some information.
I fynally changed the configuration in the systel file from a Python-HPC queue to a Mpiexec-HPC queue. (inspired from systel.cis-hydra.cfg)

It works.
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.