TELEMAC-MASCARET Forum: Recollection Problem (1/1)

Recollection Problem 12 years 8 months ago #4084

bhunter

Not necessarily only a Telemac3D problem but this is the program I am working with. I am having a problem with the "recollection" of the results after running using multiprocessors. This is occurring on one of my machines and not on another.

On the first machine, the end of the run is as follows:
--- WATER ---
INITIAL MASS : 9100673.
FINAL MASS : 8795247.
MASS LEAVING THE DOMAIN (OR SOURCE) : 305552.9
MASS LOSS : -127.7485

EXITING MPI

CORRECT END OF RUN

ELAPSE TIME :
36 MINUTES
16 SECONDS
recollectioning: T3DHYD
recollectioning: T3DRES
copying: t-2d-10m-1sf.res
copying: t-10m-1sf.res

My work is done

On the second machine it gets to the message "EXITING MPI" and exits to the terminal command prompt without any error messages (neither when DEBUGGER variable is on in the steering file). The temporary directory remains and the partitioned results files are present and the computation has finished.

I have tried to identify why the recollection routine ("gretel" I believe) is being called on one machine and not on the other without any luck. The message "EXITING MPI" is not in the runcode.py script and I haven't yet found where it is in the source code.

Does anyone have any suggestions?
If nothing else; the command line format to run the recollection routine and any libs that may have to be copied to the directory containing the computation partitions in order to run the routine.

Machine and software information below

Thanks in advance
Bruce

The two machines:
Machine 1 (recollection finishes) - notebook with Intel Core2 CPU (dual core)
Machine 2 (no recollection) - desktop with AMD 4 core processor

Both machines should be configured the same - running Ubuntu 11.10, telemac installation with Python (2.7.2) and OpenMPI. The only difference is that Machine 1 has MPICH2 also installed but from what I can see all links point to OPENMPI. This has been identified as a problem in other discussions but should not be a concern here as the computations is running over multiple processors.

The installation compiled successfully on both machines. On Machine 2 I had to removed from the systelcfg file the "-lz" variable in the "cmd_exe =" command as per other discussions. I am not sure why this had to be done on the second machine and not on the first.

The administrator has disabled public write access.

Re: Recollection Problem 12 years 8 months ago #4087

sebourban
OFFLINE
Administrator
Principal Scientist
Posts: 814
Thank you received: 219

Hello,

Indeed GRETEL is the program that re-assemble results once the simulation is completed. Here is a little more information on the procedure ...

The running script (in Python in your case, telemac3d.py, which calls runcode.py) acts only as a manager for the whole procedure. It creates the temporary directory, copies the input files, call on PARTEL (to split the domain if parallel), calls on the executable of the TELEMAC module, calls on GRETEL (to assemble the results if parallel), copies the output files and finally delete the temporary directory. If running in scalar mode, the call to the executable of TELEMAC is just that. If running in parallel mode, that call is in fact a call to your MPI command/service (mpiexec...) which itself include the name of the TELEMAC executable.

This is why you will not find TELEMAC error message in the Python running scripts, and also why you will not find MPI error messages in neither TELEMAC nor the running scripts.

The problem you have seems to be related to your installation of MPI (openMPI in your case) -- it is possible that it is processor dependent. Is is possible that the versions differ ? Or can you set your gfortran optimisation to not higher than -O3.

It seems your simulation fails before it finishes and therefore before it goes into GRETEL. (I am not 100% sure myself why the -lz has to be removed on some computers)

Please note that every (binary) files in the temporary directory are also SELAFIN files (T2DRES and T2DHYD, etc...). You can therefore look at these individually.

Finally, here is the information to run GRETEL (on Linux from within your temporary directory). You can find this in the runGRETEL function of runcode.py:

gretel_autop < gretel.par
where gretel.par is a local file including 3 lines (1: name of the file to be recollected; 2: name of the global GEO file, i.e. T2DGEO; 3: the number splits), for instance:
T2DRES
T2DGEO
2

Hope this helps.

Sébastien.

The administrator has disabled public write access.

Re: Recollection Problem 12 years 7 months ago #4207

ails
OFFLINE
Senior Boarder
Posts: 140
Thank you received: 17

Hello,

A few more comments :

a) "-lz" stands for the ZLIBrary (data compression).
It's only requested by Telemac when support for the MED library is asked (not your case for instance).
This library is not standard on Linux (supported on Debian at least).

b) "EXITING MPI" is a fortran message which can be found in the parallel library (p_exit I think). It is written by every process in the PE*log files, located in the temporary directory.
As Sebastien told you, it may happen that your job stops before exiting MPI, preventing the job to go the Gretel step.

Regards,

Fabien

The administrator has disabled public write access.

Re: Recollection Problem 12 years 7 months ago #4212

bhunter

Sebastien/Fabien

Thanks for the information.

Regarding the "-lz" variable. On my laptop I have been playing around with a number of programs and I would guess that the required libs were install which allowed the telemac system to compile with the variable included in the systel.cfg file. The desktop only has the OS and telemac and therefore the missing libs.

For your and other's reference regarding exiting before recollection of the partitions. I ran a couple of tests recompiled the telemac routines using MPICH and/or compiling at a lower optimzation. None of the scenario corrected the problem. When I have the time I will attempt this on another couple of boxes to see if it is a function of the processor type (AMD vs Intel). If not I will have a look into the routine. However, it is not a big problem performing the recollection manually.

Thanks again

Bruce

The administrator has disabled public write access.

Re: Recollection Problem 12 years 7 months ago #4228

c.coulet
OFFLINE
Moderator
Posts: 3722
Thank you received: 1031

HI
Gretel don't use MPI functions neither metis. I'm quite sure taht compiler option on gretel doesn't change the result.

It's strange it works on one machine and it doesn't works on the other.
As you said it works manually I'm thinking on a problem of configuration. Maybe you shold have a look on the configuration of each computer to see the differences?
Is you problem fully reproductible?

Regards

Christophe

The administrator has disabled public write access.

Re: Recollection Problem 12 years 7 months ago #4229

bhunter

As I can use Gretel maunually, this problem is not a big concern for me presently. However, if or when I set up systems for others I may have to look into things more deeply. I was hoping that this problem had been seen previously and there was a known solution. When I have the time, I will experiment to try and find the cause.
Areas to look at: Like the old DOS days, add a pause between routines to allow for memory flushing and disk access to finsh; recompile with options for specific processor, try different compiler.

Additional background information.
Since I performed the installation and based on an quick review of the two machines, they should be configured the same (same versions of OS (xubuntu), gfortran, python) with the same directory structure and configuration files (except for the "-lz" variable removed in the systel.cfg on the desktop) and latest source files. The only major differences being the processors and RAM available. As I indicated above, the desktop is a clean machine and I would have expected to have no problems with it.

The problem is consistant on the desktop for all runs performed with multiple processors (2-4 processors). I have run the same models on the Laptop and have not experienced the problem so I don't think it is a error code generated by one of the routines.

I have an additional problem on the desktop that may be related. At the completion of a run using Telemac3d coupled with Sisyphe, the result files are copied from the temp directory but the temp directory is not removed and some of the summary information (time of run) is not printed to the screen. Sorry, I don't have the error message but will post it when my model has fininhed. These models are NOT run with multi processors as I have portions of the bed being non-erodable and I have not gone through the code to figure out how to transfer this information to the multiple partitions. Telemac3d run uncoupled finishes in a normal manner the on both machines.

The administrator has disabled public write access.