Welcome, Guest
Username: Password: Remember me

TOPIC: Runnin Telemac -3d with python script on cluster

Runnin Telemac -3d with python script on cluster 12 years 2 months ago #5605

  • sumit
  • sumit's Avatar
Forgot the attchemnet

File Attachment:

File Name: systel.cis-Sumit.txt
File Size: 1 KB
The administrator has disabled public write access.

Runnin Telemac -3d with python script on cluster 12 years 2 months ago #5606

  • sebourban
  • sebourban's Avatar
  • OFFLINE
  • Administrator
  • Principal Scientist
  • Posts: 814
  • Thank you received: 219
Hello,

you seem to use the following command for the mpirun (which calls mpexec):
mpirun -machinefile "$TMPDIR/machines" -np $NSLOTS ./xhpl

but I am not sure whether the variables $TMPDIR and $NSLOTS are replaced properly.

Try using the following instead, where <ncsize> and <wdir> and <exename> are replaced automatically by the correct values by runcode.py:

<mpi_cmdexec>
in your script, which is replaced by (or just directly put the value of the key below in your script):

mpi_cmdexec: mpiexec -wdir <wdir> -n <ncsize> <exename>

...
Hope this helps.
Sébastien
The administrator has disabled public write access.

Runnin Telemac -3d with python script on cluster 12 years 2 months ago #5607

  • sebourban
  • sebourban's Avatar
  • OFFLINE
  • Administrator
  • Principal Scientist
  • Posts: 814
  • Thank you received: 219
... or maybe you are not using an HPC queue system, in which case you should just use:

mpi_cmdexec: mpiexec -wdir <wdir> -n <ncsize> <exename>

then runcode.py will replave the values of the fields <...>

Hope this helps,

Sébastien
The administrator has disabled public write access.

Runnin Telemac -3d with python script on cluster 12 years 2 months ago #5608

  • sebourban
  • sebourban's Avatar
  • OFFLINE
  • Administrator
  • Principal Scientist
  • Posts: 814
  • Thank you received: 219
Ok - I ust saw your config file:

here with a few correction -- hope this helps:
mpi_cmdexec: mpirun -machinefile "<wdir>/../machines" -np <ncsize> <exename>
#
##### remove this command ... par_cmdexec:   <config>/partel_prelim; python <root>/pytel/utils/partitioning.py
#
# <jobname> and <email> need to be provided on the TELEMAC command line #BSUB -u <email> \n #BSUB -N
hpc_stdin: #!/bin/bash
   #$ -V                                # forward your current environment to the execution environment
   #$ -pe openmpi 12                    # no of cores requested
   #$ -S /bin/bash                      # shell it will be executed in
   #$ -l h_rt=6:00:00                   # time it will be executed for
   #$ -j y                             # standard out
   #$ -o MBNB.out
   <mpi_cmdexec>
   exit
#
hpc_cmdexec:   chmod 755 <hpc_stdin>; qsub < <hpc_stdin>
The administrator has disabled public write access.

Runnin Telemac -3d with python script on cluster 12 years 2 months ago #5612

  • sumit
  • sumit's Avatar
Hello Sébastien,

Thanks a lot for all your help. I believe that I have almost solved the problem. Attached is my latest configuration file. If you would be kind enough to take a look at the .cfg file, you will see that I am requesting for 24 cores. Now when I submit the job through the command line, I can see 24 instances of executable running through the, "top" command on the terminal but I don't see anything on qstat.

File Attachment:

File Name: systel.cis-Sumit_2012-09-17.txt
File Size: 1 KB


I have also written to my cluster admin and am hoping that this will be resolved soon. If you have any more suggestions, please let me know.

Best regards,
Sumit
The administrator has disabled public write access.

Runnin Telemac -3d with python script on cluster 12 years 1 month ago #5723

  • sumit
  • sumit's Avatar
Hello Sébastien,

After lot of trials and tribulations based on the suggestion of my system admin, we have configured the shell script in such a manner that it call the python script from inside the shell script. Please see the attachment.

This script works fine till 12 processors, the system on which I am running TELEMAC is having a single core with 12 processors. But when I go beyond 12 processors, 24, 36 ....the simulation is not picked up.

If you have any further suggestion kindly let me know, I am very thankful for all your help

Best regards
Sumit

File Attachment:

File Name: CLUST.txt
File Size: 2 KB
The administrator has disabled public write access.

Runnin Telemac -3d with python script on cluster 9 years 9 months ago #15619

  • julesleguern
  • julesleguern's Avatar
Hello Sébastien,

I'm using a Linux cluster and I compile Telemac (v7p0, python) with this systel.cfg file successfully.


File Attachment:

File Name: systel_2015-01-28.txt
File Size: 1 KB



Now, when I launch the malpasset test case, I have the following error :

core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 192003
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 192003
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
mpdtrace: cannot connect to local mpd (/tmp/13687.1.all.q/mpd2.console_comcluster17_jleguern); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)


Loading Options and Configurations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

... parsing configuration file: systel.cfg


Running your CAS file for:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+> configuration: debgfopenmpi
+> root: /export/home/jleguern/v7p0
+> version v7p0


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


... reading the main module dictionary

... processing the main CAS file(s)
+> simulation en Francais

... checking parallelisation

... handling temporary directories

... checking coupling between codes

... first pass at copying all input files
copying: geo_malpasset-small.slf /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-01-28-17h34min57s/T2DGEO
copying: t2d_malpasset-small.f /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-01-28-17h34min57s/t2dfort.f
copying: geo_malpasset-small.cli /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-01-28-17h34min57s/T2DCLI
copying: f2d_malpasset-small.slf /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-01-28-17h34min57s/T2DREF
re-copying: /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-01-28-17h34min57s/T2DCAS
copying: telemac2d.dico /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-01-28-17h34min57s/T2DDICO

... checking the executable
re-copying: t2d_malpasset-small /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-01-28-17h34min57s/out_t2d_malp
asset-small

... modifying run command to MPI instruction

... modifying run command to PARTEL instruction
partitioning: T2DGEO
+> /export/home/jleguern/v7p0/builds/debgfopenmpi/bin/partel < PARTEL.PAR >> partel_T2DGEO.log
... The following command failed for the reason above
/export/home/jleguern/v7p0/builds/debgfopenmpi/bin/partel < PARTEL.PAR >> partel_T2DGEO.log


I don't know if it come from my cfg file or from my script.sh.


File Attachment:

File Name: script_2015-01-28.txt
File Size: 2 KB



Have you any suggestion or more currents scripts and cfg file?

Best regards.

Jules
The administrator has disabled public write access.

Runnin Telemac -3d with python script on cluster 9 years 9 months ago #15624

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi

Are you able to run the partel command? It seems it doesn't work.

Another point, why don't you use the hpc configuration in your systel.cfg file?

Regards
Christophe
The administrator has disabled public write access.

Runnin Telemac -3d with python script on cluster 9 years 9 months ago #15643

  • julesleguern
  • julesleguern's Avatar
Hello,

Thank you for your reply. Effectively there is a problem with PARTEL. I don't know why but partel stop after called Metis.

I don't use hpc configuration because I don't know how to configure it. Is there a manual which explain all options of the cfg file? It's not very easy to have a good configuration with all options that Telemac need.

I don't know if I do thr right thing by compile Telemac with the cfg file (see above) then lauch the command : qsub script.sh
Whci refer to the script.sh above.

In examples of cfg file, the script.sh is include in the cfg.

Regards.

Jules
The administrator has disabled public write access.

Runnin Telemac -3d with python script on cluster 9 years 9 months ago #15885

  • konsonaut
  • konsonaut's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 413
  • Thank you received: 144
Hello,

I encounter the same problem as Jules when launching with Python a simulation with the newly installed Telemac version v7p0r0 on our HPC cluster.
The error message is:

File Attachment:

File Name: VSC2_Error_message.txt
File Size: 2 KB


and here the Partel log file:

File Attachment:

File Name: partel_T2DGEO.txt
File Size: 67 KB


which says at the end:
BEGIN PARTITIONING WITH METIS
ERROR: TRY TO RUN PARTEL WITH A SERIAL CONFIGURATION
PLANTE: PROGRAM STOPPED AFTER AN ERROR

The config file is:

File Attachment:

File Name: systel.cis-vsc2.cfg
File Size: 5 KB


For any help I would be very glad.


Best regards,
Clemens
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.