Welcome, Guest
Username: Password: Remember me

TOPIC: TELEMAC 3D hanging

TELEMAC 3D hanging 11 years 11 months ago #6693

  • duttas
  • duttas's Avatar
Hello,


I have been running TELEMAC 3D on our cluster. Till now it has been doing fine, but in the last few simulations the code just hangs just before the simulation finishes.

Elucidating: Say the simulation has to run for 79800 secs and it is running using a 0.5 secs time step (so total 159600 iterations). Also the graphical out ("graphical printout") put interval has been assigned as 1200 iterations. what is happening is, the simulation after running fine for 158392 iterations is not progressing; there is no output of the result (.slf) files nor there is addition to the "***.sh.o****" file which contains the "listing printout" (updated every 2 iteration).

Then if I stop the simulations using "qdel ***job number***" , a error file is generated saying "close failed, error no. 32, broken pipe"
The administrator has disabled public write access.

TELEMAC 3D hanging 11 years 11 months ago #6694

  • pham
  • pham's Avatar
  • OFFLINE
  • Administrator
  • Posts: 1559
  • Thank you received: 602
Hello duttas,

Can you send the whole error file and the last lines of the listing printout file please (writing every 2 iterations for a 159,000 iteration run is quite frequent, your file must be big!)?
Moreover, can you look in the temporary directory the PE***.LOG files (one for each subdomain), in particular their size. If one of them has a different size, this subdomain is suspicious (except if you have written something special for a specific location, some error messages can be written for specific domains not all, please read carefully the suspicious files). The listing printout file is only one of the different PE***.LOG for a specific subdomain, not the whole one.

Hope this helps,

Chi-Tuan
The administrator has disabled public write access.

TELEMAC 3D hanging 11 years 11 months ago #6704

  • duttas
  • duttas's Avatar
Thank You for your answer.

I checked the PE***.LOG files, there is one file which Is slightly larger (domain 7 out of 15(i was using 16 processors)) than the others and the only difference in them is in it 1 extra output has been recorded (but not completed).

The general and the domain specific LOG files do not have any error messages but they just stop updating. The funny thing is. though the job doesn't seem to be moving forward, the job is not expelled from the queue of the cluster.

The last line of the listing printout file is

ADVECTION-DIFFUSION OF VELOCITIES STEP
PROPAGATION AND DIFFUSION WITH WAVE EQUATION
The administrator has disabled public write access.

TELEMAC 3D hanging 11 years 11 months ago #6708

  • pham
  • pham's Avatar
  • OFFLINE
  • Administrator
  • Posts: 1559
  • Thank you received: 602
Hello,

When you say "in the last few simulations the code just hangs just before the simulation finishes.", is it the same simulation you try to run or different ones? Does it stop at the same physical time (if it is a different simulation or not)? What did you change between now and when it worked? If you change the number of time steps (less or more than the number of iterations when it stops), does your simulation stop at the same time step?
Before deleting the computation with qdel, in the temporary directory, can you check the hour when the last changed appeared (it is a way to see if there is a trouble in your computation, one core (at least) waiting for another one.
You do not seem to reach the wall time, but I just want to check with you that it is not the problem (it was the problem of one previous user, he confessed).

Anyway, can you try to run the same simulation with another number of cores (e.g. 8, 12 or 32 if you can) and check if the problem still occurs? Otherwise, you can try to compute just before the iteration when your problem occurs and try to restart from this field.

Hope this helps,

Chi-Tuan
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.