Welcome, Guest
Username: Password: Remember me

TOPIC: Problem launching a T3D case on a Linux cluster

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19685

  • pilou1253
  • pilou1253's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 584
  • Thank you received: 106
Hi all!

I am trying for the first time to launch a T3D case on a Linux cluster (enCORE, Hartree Centre facility).

My case has already sucessfully run on my Windows PC in parallel. I use version 7.0.1 but it fails on the cluster. The program stops in Damocles, see error log in attachement.

The shell command is pointing at the config file, so it is strange that the program is not finding the config information.

I don't think the problem comes from the cluster installation, but please check the attached config file.

I also attach my steering file.

Does someone have an idea of why my case is not running?

Thank you in advance!

Best regards
PL


File Attachment:

File Name: runcode_sh.txt
File Size: 1 KB


File Attachment:

File Name: errorlog.txt
File Size: 2 KB


File Attachment:

File Name: cas3d_2016-02-09.txt
File Size: 11 KB


File Attachment:

File Name: systel.cis-redhat.cfg
File Size: 3 KB
The administrator has disabled public write access.

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19688

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Are you using Perl ?
Because in your script you are launch telemac3d and not telemac3d.py

It looks like you are lauching the telemac3d executable and not the environement scripting hence the error.

I would recommend using Python.

Hope it helps.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19691

  • pilou1253
  • pilou1253's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 584
  • Thank you received: 106
Hi,

Thanks for your reply. I actually followed the existing telemac instructions for this cluster, and have support from them for trying to launch the case (but they are now short on ideas).

I then assume they use perl. Anyway, I tried with python instead but they don't seem to have all the required modules (at least numpy is missing), hope they can check that.

Can the error I get be compatible with this script issue?

Best regards
PL
The administrator has disabled public write access.

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19709

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Numpy is a dependencies so it is normal that Python is not working.

As for the error you have with the perl.
I think the error comes from your runcode_sh script.
Try replacing the line:
export PATH=/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/builds/encore/bin:/bin:${PATH}
with
export PATH=/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/scripts/perl5:/bin:${PATH}

That should solve the problem.

Hope it helps.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19711

  • pilou1253
  • pilou1253's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 584
  • Thank you received: 106
Hi,

Thanks. They have used python, and they need to add the matplotlib module that was missing. I hope this fix the problem.

Best regards
PL
The administrator has disabled public write access.

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19769

  • pilou1253
  • pilou1253's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 584
  • Thank you received: 106
Hi again,

Unfortunately the problem was not fixed.
The program can be launched but it seems that there is a problem with Partel / Metis.

This is the error message I get from python:
... reading the main module dictionary

... processing the main CAS file(s)
    +> running in English

... checking parallelisation

... handling temporary directories

... checking coupling between codes

... first pass at copying all input files
    re-copying:  /gpfs/stfc/local/HCP012/jjd66/pxl28-jjd66/cas.cas_2016-02-15-17h55min08s/T2DCAS
       copying:  randvillkor.cli /gpfs/stfc/local/HCP012/jjd66/pxl28-jjd66/cas.cas_2016-02-15-17h55min08s/T2DCLI
       copying:  geo.slf /gpfs/stfc/local/HCP012/jjd66/pxl28-jjd66/cas.cas_2016-02-15-17h55min08s/T2DGEO
       copying:  telemac2d.dico /gpfs/stfc/local/HCP012/jjd66/pxl28-jjd66/cas.cas_2016-02-15-17h55min08s/T2DDICO

... checking the executable
    re-copying:  telemac2d /gpfs/stfc/local/HCP012/jjd66/pxl28-jjd66/cas.cas_2016-02-15-17h55min08s/out_telemac2d

... modifying run command to MPI instruction

... modifying run command to PARTEL instruction

... partitioning base files (geo, conlim, sections and zones)
    +>  /gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/builds/encore/bin/partel < PARTEL.PAR >> partel_T2DGEO.log
... The following command failed for the reason above
/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/builds/encore/bin/partel < PARTEL.PAR >> partel_T2DGEO.log

I also attach the partel log, the config file (config [encore]) and my launch command script.

Any help would be greatly appreciated!

Thank you in advance!

Best regards
PL


File Attachment:

File Name: partel_T2DGEO.log_2016-02-15.txt
File Size: 3 KB


File Attachment:

File Name: PARTEL.PAR.txt
File Size: 0 KB


File Attachment:

File Name: systel.cis-redhat.cfg.txt
File Size: 3 KB


File Attachment:

File Name: runcode.sh.txt
File Size: 1 KB
The administrator has disabled public write access.

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19775

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
You get this error if your compiled without the option -DHAVE_MPI in your cmd_obj in your systel.cfg

Hope it helps.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: pilou1253

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19779

  • pilou1253
  • pilou1253's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 584
  • Thank you received: 106
Thanks.
Now we have a segmentation fault:
... partitioning base files (geo, conlim, sections and zones)
+> /gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/builds/encore2/bin/partel < PARTEL.PAR >> partel_T2DGEO.log

Program received signal 11 (SIGSEGV): Segmentation fault.

partel_T2DGEO.log is empty.

Any idea?

Thanks in advance!
PL
The administrator has disabled public write access.

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19780

  • pilou1253
  • pilou1253's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 584
  • Thank you received: 106
Hi again,

Here is the traceback error:

... modifying run command to MPI instruction

... modifying run command to PARTEL instruction

... partitioning base files (geo, conlim, sections and zones)
+> /gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/builds/encore2/bin/partel < PARTEL.PAR >> partel_T2DGEO.log

Program received signal 11 (SIGSEGV): Segmentation fault.
^CTraceback (most recent call last):
File "/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/scripts/python27/runcode.py", line 1470, in <module>
main(None)
File "/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/scripts/python27/runcode.py", line 1452, in main
runCAS(cfgname,cfg,codeName,casFiles,options)
File "/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/scripts/python27/runcode.py", line 1084, in runCAS
runPartition(parcmd,GLOGEO,CONLIM,ncsize,options.bypass,section_name,zone_name)
File "/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/scripts/python27/runcode.py", line 615, in runPartition
runPARTEL(partel,geom,conlim,ncsize,bypass,section_name,zone_name)
File "/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/scripts/python27/runcode.py", line 659, in runPARTEL
tail,code = mes.runCmd(p,bypass)
File "/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/scripts/python27/utils/messages.py", line 137, in runCmd
returncode = sp.call(exe,shell=True)
File "/gpfs/stfc/local/apps/gcc/python/2.7.8/lib/python2.7/subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
File "/gpfs/stfc/local/apps/gcc/python/2.7.8/lib/python2.7/subprocess.py", line 1376, in wait
pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
File "/gpfs/stfc/local/apps/gcc/python/2.7.8/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
return func(*args)
KeyboardInterrupt


Any tips?

Thanks in advance!
PL
The administrator has disabled public write access.

Problem launching a T3D case on a Linux cluster 8 years 9 months ago #19798

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi,

Could you try to rerun the command:
/gpfs/stfc/local/HCP012/jjd66/cxh71-jjd66/v7p0r1/builds/encore2/bin/partel < PARTEL.PAR
in your temporary folder and show the listing.

What version of metis are your using the one in optionals ?
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.