Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1
  • 2

TOPIC: can't run in parallel - maybe installation errors

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10863

  • 716469
  • 716469's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 303
  • Thank you received: 6
Dear Users,

Can anyone please advise me on below.
I succesfully run Telemac on my own pc for many months. Now I am trying to install it on cluster to run it in parallel. The platform is Linux Ubuntu, the compiler is gfortran and library is openmpi_gnu-1.6.4.
I have compiled it fine. It runs on cluster as serial process but it does not run in parallel. I think I have not configured systel.cis-ubuntu.cfg and systel-all.ini files correctly in terms of parallel setting.
Also for some reasons when I run it on cluster in serical it works for telemac cas files but when I try to run it for postel cas files (runcode.py postel3d cas.file) it does not work, just message that CAS fail to run.
So I have two issues here: first is - cannot run in parrallel and second - cannot run postel in serial when telemac runs ok.
I have attached two systel files, maybe somebody could tell me what I have done wrong there. Also are there are more environment variables to be set up on cluster apart of PATH=/..../pytel and SYSTELCFG=/...../systel.cis-ubuntu.cfg ?

Thanks in advance.

Violet
Attachments:
The administrator has disabled public write access.

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10864

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi,

You are using v6p2 right?

Since v6p2 it is the version 5.x.x of metis required for Telemac-Mascaret. That could be a source of error.

Could you post the error you obtain on the cluster for a parallel run? and the error for postel as well.

For the environment variables you can add USETELCFG=ubugfopenmpi which specify which configuration to use.

Hope it helps
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: 716469

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10865

  • 716469
  • 716469's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 303
  • Thank you received: 6
Thanks Yugi,

Yes, you are right I am using v6p2, I have metis-4.0.3 but I see there is newer version 5.1.0. I will try this now and see if it works, I hope it will. If I get any error I will post them on the forum. Thank you very much for your advice, much appreciated.

Kind Regards!

Violet
The administrator has disabled public write access.

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10867

  • 716469
  • 716469's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 303
  • Thank you received: 6
HI Yugi,

I have installed metis-5.1.0, I hope I did everything correctly but it was little bit messy in comparison with older version metis-4.0.3.
I have compiled Telemac and it was fine, and I can run cas files on cluster in serial but not in parallel. And also the postel3d failure still persists when I run on cluster in serial.
When I run in parallel it goes on queue but nothing happens and I do not get any result/error messages. Maybe it is a cluster local problem that I did not write the PBS script correctly.
I attache the error when I try to run postel3d but it does not really show much, also there are systel-ubuntu.cfg and systel.all.cfg, and PBS script (change to cfg type as might not be able to open) just in case.
Please advise. Thanks for your help.

Kind Regards!

Violet
Attachments:
The administrator has disabled public write access.

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10873

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi,

A few things:
- In the .cfg file you posted there is still a link to metis 4.1 did you change the value of libs_parallel in the .cfg file
- In the pbs script the command should be :
python runcode.py telemac2d -s t2d_wave.cas --host=$PBS_NODEFILE -np ${NPROCS}
instead of:
${MPIRUN} -machinefile $PBS_NODEFILE -np ${NPROCS} python runcode.py telemac2d -s t2d_wave.cas
- I could not find the postel3d error file only a systel.ini file
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: 716469

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10876

  • 716469
  • 716469's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 303
  • Thank you received: 6
Hi Yugi,

Thanks a mil for help with PBS script and looking into my problem. Actually, sorry I sent you old systel.cis-ubuntu.cfg file that had metis-4.0.3, I changed to meetis -5.1.0 yesterday and it was not working. But at least I know now that my PBS script is right due to you as I did not know how to do that correctly.
In my systel.cis-ubuntu.cfg I updated ubugfortran and ubugfopenmpi configurations only and did not do anything with ubunsun etc but still left them there, maybe I should get rid of them? Aslo in file systel-all.ini I am not sure what option apply to my cluster mashine as I have Linux + OpenMPI and there is some close to my is Linux debian GFORTRAN + MPICH unless I have to create one for Linux + OpenMPI, also I have HOSTTYPE=gfortran_linux and it is for Linux debian GFORTRAN + MPICH and maybe I have to enter something else there. I have not really amended the systel-all file except HOSTTYPE and maybe I had to do some more work there? Or maybe there are some missing environmental variables specificaly to be set up for parallel?

And why my postel3d does not work on cluster in serial is mistery to me too as telemac3d and telemac2d run on cluster in serial with no problems so postel should run too.

Just one remark, my desk pc is 32bit but cluster is 64bit, do you think it could cause the problem?

Thank you so much for your time, it is really much appreciated.

Kind Regards!

Violet
Attachments:
The administrator has disabled public write access.

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10877

  • 716469
  • 716469's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 303
  • Thank you received: 6
HI Yugi,

sorry I just read your email again and see that I have never sent you my postel3d error message. Please see attached. Thanks.

Kind Regards!

Violet
Attachments:
The administrator has disabled public write access.

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10879

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Thanks for the information.
The systel.ini is not used by telemac with python so do not bother with that.
For the other configuration in the .cfg file they do not matter so do not bother with them either.

Can you run in interactif mode on your cluster? (i.e. directly connect to a node so that you can launch program directly without going through pbs)

As for postel3d could you try to go in the temporary folder generated by postel3d.py and rerun the execution file.

Just a quick question have you tried the new version of Telemac v6p3r1 it has a better handling of error.

Hope it helps
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: 716469

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10881

  • 716469
  • 716469's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 303
  • Thank you received: 6
Hi Yugi,

Thanks again for helping me. Yes, I can run telemac2d and telemac3d from terminal in interactif mode ( I was calling it in serial on cluster, I think it is same thing), but I cannot run postel3d there.
I have not run postel from temporary folder before so I hope I have done it correctly now: so there is some output file with timing and I went there then there are some files:.sortie; CONFIG; out-postel3dv6p2; POSCAS; POSDICO; POSGEO and POSPRE. So I rerun: runcode.py postel3d -s out-postel3dv6p2 and got some activities on the screen but basicaly telling me that there are no value to keyword ......(long list of codes). I probably picked wrong file to run. Sorry.

I tried with v6p3 as well but got lots of warning when i was compiling, however i did it when I had metis-4.0.3 but I can try now again with metis-5.1.0.

Thanks for the directions, your help is much appreciated.

Kind Regards!

Violet
The administrator has disabled public write access.

can't run in parallel - maybe installation errors 11 years 2 weeks ago #10885

  • 716469
  • 716469's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 303
  • Thank you received: 6
Hi Yugi,

Sorry to bother you again. I followed your advice and compiled v6p3 and it went fine. Thanks a mil. I can run in interactive mode on cluster (in serial) telemac2d and telemac 3d but not postel again. I got some message error that I think it might be simple to sort. I got the same when I run cas file and I just entered Parallel Processors = 1 and it run fine even it was not on parallel, but how to do it for postel cas file I do not know as there is no such option. I attach the postel error message.
However I still cannot run cas file on cluster in parallel, it goes to jobs queue but nothing happens. Maybe I need to specify some cluster directories in systel.cfg file?
Thanks for helping me.

Violet
The administrator has disabled public write access.
  • Page:
  • 1
  • 2
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.