Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1
  • 2

TOPIC: Problem with PARTEL (Linux cluster python v7p0)

Problem with PARTEL (Linux cluster python v7p0) 9 years 10 months ago #15541

  • julesleguern
  • julesleguern's Avatar
Hi everyone,

I'm trying to use Telemac on a Linux cluster. I compiled with succes Telemac on the cluster. Here my .cfg file :

File Attachment:

File Name: systel.txt
File Size: 1 KB



Now, I'm trying to launch an exemple (confluence). Not with a SGE script but in front just to see if Telemac works well. I use this command :

python /export/home/jleguern/v7p0/scripts/python27/runcode.py telemac2d -c debgfopenmpi -f systel.cfg -s /export/home/jleguern/v7p0/confluence/t2d_confluence.cas

And I change PARALLEL PROCESSORS = 8 in the .cas file

I attached here the log of the error :

File Attachment:

File Name: log.txt
File Size: 9 KB


The simulation stop when PARTEL read the T2DGEO file.

Then, this is an exemple of the script.sh that I would like to launch.


File Attachment:

File Name: script.txt
File Size: 2 KB




If someone have a suggestion, it will be appreciated.

Best regards.
Jules
The administrator has disabled public write access.

Problem with PARTEL (Linux cluster python v7p0) 9 years 10 months ago #15542

  • julesleguern
  • julesleguern's Avatar
I forgot the partel_T2DGEO.log file


File Attachment:

File Name: partel_T2DGEO.txt
File Size: 4 KB
The administrator has disabled public write access.

Problem with PARTEL (Linux cluster python v7p0) 9 years 10 months ago #15550

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi
It's clear there is a problem with the partitioning step.
Did you generate the geometry file yourself ? If yes, how?
It seems partel is unable to read the first line (Title).
If you downloaded it as it seems you're running on linux, maybe the transfer doesn't work well and a dos2unix could solve the problem.

Hope this helps
Christophe
The administrator has disabled public write access.

Problem with PARTEL (Linux cluster python v7p0) 9 years 10 months ago #15552

  • Lufia
  • Lufia's Avatar
This looks like a Little-endian / Big-endian problem. So your SELAFIN-file has the wrong Endianness and the Partel fails when reading your file.

Maybe ir works if you put the following line in your .bashrc?

export F_UFMTENDIAN=little
The administrator has disabled public write access.

Problem with PARTEL (Linux cluster python v7p0) 9 years 10 months ago #15556

  • riadh
  • riadh's Avatar
Hello


Did you try to run non-parallel case ? (comment PARALLEL PROCESSOR = ... or put it =0)
In any cases, Set DEBUGGER = 1 to see where it stops exactly.

with my best regards
Riadh
The administrator has disabled public write access.

Problem with PARTEL (Linux cluster python v7p0) 9 years 10 months ago #15565

  • julesleguern
  • julesleguern's Avatar
Hi everyone,

Thank you for your answers.

So, I try a dos2unix command on the geometry file but it's a binary file so it didn't work. Then I try to put export F_UFMTENDIAN=big and it works. Partel don't stop when reading the tittle but it stop few step after when reading the coordinates x and y :


File Attachment:

File Name: partel_confluence.txt
File Size: 2 KB



So I try another example, jsut to see. I try malpasset example and this case don't stop at the same step than confluence case :

File Attachment:

File Name: partel_malpasset.txt
File Size: 3 KB


Finally, I try to launch confluance case with 0 parallel processors :

File Attachment:

File Name: steering_file_confluence_np0.txt
File Size: 5 KB



I don't know why Partel stop after differents step according to the test case.

Any idea?

Best regards.

Jules
The administrator has disabled public write access.

Problem with PARTEL (Linux cluster python v7p0) 9 years 9 months ago #15674

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi,

You should try adding the appropriate command to your compiler to switch to big endian instead of using the environment variable.
-convert big_endian
for the ifort compiler.

You should try to recompile the code with the following modifications in your systel:
[debgfopenmpi]
#
options:    parallel mpi
#
par_cmdexec:   <config>/partel < PARTEL.PAR >> <partel.log>
#
mpi_cmdexec:   mpirun -wdir <wdir> -n <ncsize> <exename>
mpi_hosts:
#
cmd_obj:    mpiifort -c -O3 -convert big_endian -DHAVE_MPI <mods> <incs> <f95name>
cmd_lib:    ar cru <libname> <objs>
cmd_exe:    mpiifort -o <exename> <objs> <libs>
#
mods_all:   -I <config>
#
libs_partel:        /export/home/jleguern/metis/lib/libmetis.a
libs_all:    -lz -lstdc++ -lm

Hope it helps.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Problem with PARTEL (Linux cluster python v7p0) 9 years 9 months ago #15676

  • julesleguern
  • julesleguern's Avatar
Hello Yugi,

I recompile Telemac with your cfg file. Now, when I launch the malpasset example, I have this :


core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 192003
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 192003
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
mpdtrace: cannot connect to local mpd (/tmp/13778.1.all.q/mpd2.console_comcluster08_jleguern); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)


Loading Options and Configurations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

... parsing configuration file: systel.cfg


Running your CAS file for:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+> configuration: debgfopenmpi
+> root: /export/home/jleguern/v7p0
+> version v7p0


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


... reading the main module dictionary

... processing the main CAS file(s)
+> simulation en Francais

... checking parallelisation

... handling temporary directories

... checking coupling between codes

... first pass at copying all input files
copying: geo_malpasset-small.slf /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-02-02-10h27min09s/T2DGEO
copying: t2d_malpasset-small.f /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-02-02-10h27min09s/t2dfort.f
copying: geo_malpasset-small.cli /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-02-02-10h27min09s/T2DCLI
copying: f2d_malpasset-small.slf /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-02-02-10h27min09s/T2DREF
re-copying: /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-02-02-10h27min09s/T2DCAS
copying: telemac2d.dico /export/home/jleguern/v7p0/malpasset/t2d_malpasset-small.cas_2015-02-02-10h27min09s/T2DDICO

... checking the executable
... The following command failed for the reason above
mpiifort -c -O3 -convert big_endian -DHAVE_MPI -I /export/home/jleguern/v7p0/builds/debgfopenmpi/lib/utils/special -I /export/ho
me/jleguern/v7p0/builds/debgfopenmpi/lib/utils/parallel -I /export/home/jleguern/v7p0/builds/debgfopenmpi/lib/utils/damocles -I
/export/home/jleguern/v7p0/builds/debgfopenmpi/lib/utils/bief -I /export/home/jleguern/v7p0/builds/debgfopenmpi/lib/sisyphe -I /
export/home/jleguern/v7p0/builds/debgfopenmpi/lib/tomawac -I /export/home/jleguern/v7p0/builds/debgfopenmpi/lib/telemac2d t2dfo
rt.f



The simulation stop before PARTEL.

Have you another idea?

Best regards

Jules
The administrator has disabled public write access.

Problem with PARTEL (Linux cluster python v7p0) 9 years 9 months ago #15679

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Can you try the following actions:

1. run
mpd &
before running the case

2. post the return of the command
mpiifort -show

Hope it helps.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Problem with PARTEL (Linux cluster python v7p0) 9 years 9 months ago #15685

  • julesleguern
  • julesleguern's Avatar
Hello,

So I try to run mpd & before running the case and I have the same error. Here is the return of mpd & command.

[jleguern@comcluster ~]$ mpd &
[1] 3861
[jleguern@comcluster ~]$ /export/apps/intel/impi/3.2.0.011/bin64/mpdlib.py:27: DeprecationWarning: The popen2 module is deprecated. Use the subprocess module.
import sys, os, signal, popen2, socket, select, inspect
/export/apps/intel/impi/3.2.0.011/bin64/mpdlib.py:37: DeprecationWarning: the md5 module is deprecated; use hashlib instead
from md5 import new as md5new

Then, this is the return of the mpiifort -show command :

[jleguern@comcluster ~]$ mpiifort -show
ifort -I/export/apps/intel/impi/3.2.0.011/include64 -I/export/apps/intel/impi/3.2.0.011/include64 -L/export/apps/intel/impi/3.2.0.011/lib64 -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker $libdir -Xlinker -rpath -Xlinker /opt/intel/mpi-rt/3.2 -lmpi -lmpiif -lmpigi -lrt -lpthread -ldl


Thanks Yogi
The administrator has disabled public write access.
  • Page:
  • 1
  • 2
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.