Welcome, Guest
Username: Password: Remember me

TOPIC: Installation problem - parallel Ubuntu 18.04

Installation problem - parallel Ubuntu 18.04 5 years 3 months ago #34256

  • BCPServerTeam
  • BCPServerTeam's Avatar
Hello,

Firsty thanks to the forum posts and documentation, I've been able to get quite close to a working system.

But like others, I too am having trouble getting Telemac to compile. Specifically, parallel mode on Ubuntu 18.04.

I want to run telemac in parallel across a number of servers, each with many cores - we don't have an HPC setup here, but will utilise openmpi.

I believed I got Telemac to compile, but could see it was only using 1 core. Upon investigating, I found these lines echoed out by Telemac2d.py when running the t2d_malpasset.cas example file:

... modifying run command to MPI instruction

... modifying run command to PARTEL instruction

... partitioning base files (geo, conlim, sections, zones and weirs)
    +> /home/mpiuser/telemac-mascaret/v8p0r2/builds/ubugfmpich2/bin/partel < PARTEL.PAR >> partel_T2DGEO.log
STOP 1
runPartition:
   |runPARTEL: Could not split your file T2DGEO (runcode=1) with the error as follows:
   |
   |... The following command failed for the reason above (or below)
   |/home/mpiuser/telemac-mascaret/v8p0r2/builds/ubugfmpich2/bin/partel < PARTEL.PAR >> partel_T2DGEO.log
   |
   |      You may have forgotten to compile PARTEL with the appropriate compiler directive
   |        (add -DHAVE_MPI to your cmd_obj in your configuration file).
   |
   |


I spotted a typo in my systel cfg file, it said: "-DAHVE_MPI". I corrected this, and re-compiled but am now left with the following error:

When I recompile, using 'compileTELEMAC.py --clean' I get the following output:


   - completed: .../v8p0r2/sources/utils/special/declarations_special.F
[                                                                ]   0%  | ---s

   - completed: .../v8p0r2/sources/utils/partel/c_binding.F
[\\\\\\\\\\\                                                       ]  16%  | 3s

   - completed: .../v8p0r2/sources/utils/special/check_allocate.f
[\\\\\\\\\\\\\\\\\\\\\\                                            ]  33%  | 2s

   - completed: .../v8p0r2/sources/utils/special/plante.F
[\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\                                 ]  50%  | 1s

   - completed: .../v8p0r2/sources/utils/partel/partel_prelim.F
[\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\                      ]  66%  | 1s

   - completed: .../v8p0r2/sources/utils/partel/partitioner.F
[\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\           ]  83%  | 1s
ar: `u' modifier ignored since `D' is the default (see `U')
ar: `u' modifier ignored since `D' is the default (see `U')
Driving: /usr/bin/gfortran -fconvert=big-endian -frecord-marker=4 -lpthread -v -o /home/mpiuser/telemac-mascaret/v8p0r2/builds/ubugfmpich2/bin/partel_prelim c_binding.o partitioner.o partel_prelim.o /home/mpiuser/telemac-mascaret/v8p0r2$Using built-in specs.
COLLECT_GCC=/usr/bin/gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-vers$Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
Reading specs from /usr/lib/gcc/x86_64-linux-gnu/7/libgfortran.spec
rename spec lib to liborig
COLLECT_GCC_OPTIONS='-fconvert=big-endian' '-frecord-marker=4' '-v' '-o' '/home/mpiuser/telemac-mascaret/v8p0r2/builds/ubugfmpich2/bin/partel_prelim' '-L/usr/lib/openmpi/lib/libmpi.so' '-L/home/mpiuser/telemac-mascaret/metis-5.1.0/lib/l$COMPILER_PATH=/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/7/../../../../lib/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/u$COLLECT_GCC_OPTIONS='-fconvert=big-endian' '-frecord-marker=4' '-v' '-o' '/home/mpiuser/telemac-mascaret/v8p0r2/builds/ubugfmpich2/bin/partel_prelim' '-L/usr/lib/openmpi/lib/libmpi.so' '-L/home/mpiuser/telemac-mascaret/metis-5.1.0/lib/l$ /usr/lib/gcc/x86_64-linux-gnu/7/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/7/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper -plugin-opt=-fresolution=/tmp/ccGZDDW3.res -plugin-opt=-pass-through=-lgcc_s -plu$c_binding.o: In function `__c_binding_MOD_mymetis_partmeshdual':
c_binding.F:(.text+0x1f): undefined reference to `METIS_PartMeshDual'
collect2: error: ld returned 1 exit status

I am guessing this is something to do with metis - I guess I will have to recompile it using a different compiler? But am unsure how to tell cmake to do this but I could be completely wrong about this.

Thanks for any help, in advance.

Regards
The administrator has disabled public write access.

Installation problem - parallel Ubuntu 18.04 5 years 3 months ago #34268

  • BCPServerTeam
  • BCPServerTeam's Avatar
Yesterday I spent time building a new server, and ran through a fresh build. I thought this was worth doing in case something got messed up with the previous servers - I think loading directories from NFS may have caused some issues with clock skew.

So this time it is a standalone server, working with local data.

It comes up with a different issue now (see attached file)

I built the server with the following commands:
#configure IPv4 details
sudo nano /etc/netplan/01-netcfg.yaml
sudo netplan apply
sudo apt-get update && sudo apt-get upgrade

#set timezone
sudo timedatectl set-timezone Europe/London

#disable IPv6
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo nano /etc/default/grub
sudo update-grub

# add software to server.
sudo apt-get install openmpi* libopenmpi-dev gfortran python subversion build-essential mayavi2 nfs-common
sudo apt-get install python-numpy python-scipy python-matplotlib

#make folder and get telemac.
mkdir ~/telemac-mascaret && cd ~/telemac-mascaret
svn co --username ot-svn-public --password telemac1* http://svn.opentelemac.org/svn/opentelemac/tags/v8p0r2

# Copy metis out of telemac optionals directory into ~/source/metis-5.1.0 and build it, installing output to ~/telemac-mascaret/metis-5.1.0:
mkdir ~/source/ && cd ~/telemac-mascaret/v8p0r2/optionals && cp -R ./metis-5.1.0/ ~/source/ && cd ~/source
cmake -D CMAKE_INSTALL_PREFIX=~/telemac-mascaret/metis-5.1.0 .
make
make install

copy & edit the pysource template:
cd ~/telemac-mascaret/v8p0r2/configs/ && cp pysource.template.sh pysource.bop.sh
nano pysource.bop.sh

configure systelcfg:
nano /home/ot/telemac-mascaret/v8p0r2/configs/systel.bop.cfg

sudo reboot

config.py
compileTELEMAC.py

and from this, I get the following error (attached, as it is quite long)

Does anyone have any idea why this is? Again it looks something to do with metis and partel perhaps.

I did wonder if it is related to this post - www.opentelemac.org/index.php/kunena/12-...mpatible-with-partel but this is some time ago, affecting v6 primarily so I am guessing this is long since resolved throughout the development of v7 and v8.
Attachments:
The administrator has disabled public write access.

Installation problem - parallel Ubuntu 18.04 5 years 3 months ago #34270

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi,

I think the issue is in your configurations file (systel.bop.cfg).
Could you post it here ?
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Installation problem - parallel Ubuntu 18.04 5 years 3 months ago #34271

  • BCPServerTeam
  • BCPServerTeam's Avatar
Sure.

It's a bit of a mess - I have tried many things and read a lot before finally posting for help on the forum :)

a lot of it is commented out!

I got the system working in 'ubugfortrans' mode which I believe is non-parallel. We are going to be running a big simulation I believe, so need parallel. Hoping to use MPI across some nodes - we should be able to use 40 or so CPUs.

The problem is with the ubugfmpich2 bit, which is the one I'm interested in.

I changed the name of the file to systel.txt but it's really systel.bop.cfg

Thanks
Attachments:
The administrator has disabled public write access.

Installation problem - parallel Ubuntu 18.04 5 years 3 months ago #34277

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Could try with the systel file I attached.
I made a few corrections that should help.
Attachments:
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: BCPServerTeam

Installation problem - parallel Ubuntu 18.04 5 years 3 months ago #34280

  • BCPServerTeam
  • BCPServerTeam's Avatar
Great! That worked by the look of it. It compiled, and I've got it running the t2d_malpasset-fine.cas example as I write this, and all of the CPUs are 100% with telemac processes.

So the main problem, was to swap to 'mpif90' and don't put '-L <lib name>' in the libs_all argument, just the path to metis itself, and get rid of incs_all altogether.

Thanks for your help Yugi. Hope this thread helps others too..
The administrator has disabled public write access.

Installation problem - parallel Ubuntu 18.04 5 years 3 months ago #34281

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
The -L should have a library dir and not a library file after it. As for using mpif90 it ensure you are using the right mpi path (it contains the -I and -L for mpi).
As for the incs_all be cause i used mpif90 we do not need it anymore.

Bu it think that the only thing that was making it crash was the -L it was canceling the lib after it.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: Htun Pyae Sone, BCPServerTeam

Installation problem - parallel Ubuntu 18.04 5 years 3 months ago #34283

  • BCPServerTeam
  • BCPServerTeam's Avatar
Thanks Yugi, that's great information :-)
The administrator has disabled public write access.

Installation problem - parallel Ubuntu 18.04 5 years 3 months ago #34309

  • flanagan
  • flanagan's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 28
  • Thank you received: 8
Thanks a lot for your answers.
I will check on Monday if the configuration of the model corresponds to what you advise me to do.
cheers
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.