Welcome, Guest
Username: Password: Remember me

TOPIC: Parallel Simulation

Parallel Simulation 12 years 1 month ago #5862

  • sumit
  • sumit's Avatar
Dear All,

If some one can please take a look at my script file and let me know why my jobs are not picked up when I go beyond 12 processors, I will be greatly obliged.

For your kind information the cluster that I am using is having 12 processors on one node. My system admin is saying to put some flags in TELAMAC so that I can know what's going wrong. But I am not sure how to put such flags.

Attached "Bode.out"

File Attachment:

File Name: CLUST_2012-10-09.txt
File Size: 2 KB



file shows what happens when I increase the number of procs from 12 to 24. Clust.txt is my submission script files.

Any and all help is greatly appreciated.

Best regards,
Sumit
The administrator has disabled public write access.

Parallel Simulation 12 years 1 month ago #5863

  • sumit
  • sumit's Avatar
the out file

File Attachment:

File Name: Bode.txt
File Size: 1 KB
The administrator has disabled public write access.

Parallel Simulation 12 years 1 month ago #5879

  • ails
  • ails's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 140
  • Thank you received: 17
Dear Sumit,

I'm a bit suspicious about the machinefile you're using. Nothing happens after launching mpirun.

- What contains $TMPDIR/machines ? -machinefile expects a file containing the list of the nodes provided by the job scheduler (with as many lines as $ncsize).
- The pe_hostfile you posted looks right. Is $pe_hostfile variable refering to this file ? If yes, mpirun -machinefile $pe_hostfile ... may work.
- Can you do some interactive jobs on your cluster ? It will make easier to debug the mpirun sequence.

A backup solution. With OpenMPI, you can also use orterun instead of mpirun. The sequence is very similar : orterun -hostfile machinefile -np ...

Best regards,

Fabien Decung
The administrator has disabled public write access.

Parallel Simulation 12 years 1 month ago #5890

  • sumit
  • sumit's Avatar
Dear Fabien,

I am attaching my script file as well as my out file. I have not been able to solve the problem till now.

File Attachment:

File Name: CLUST_2012-10-11.txt
File Size: 2 KB

If you see in the CLUST.txt the line number 35 is having -machinefile "$TMPDIR/machines" but this is commented and instead of this I have the next line.

Strangely the line 35 works fine and I am able to submit jobs with 12 procs, my sys-admin tells me that $TMPDIR/machines is generated by the scheduler and I don't need to worry about that. The problem is that neither line 35 nor 36 is able to submit jobs for more than 12 procs. Using line 36 I get the message which is in the attached .out file.

File Attachment:

File Name: Bode_2012-10-11.txt
File Size: 1 KB


Please let me know if you have any more ideas.

Best regards,
Sumit
The administrator has disabled public write access.

Parallel Simulation 12 years 1 month ago #5892

  • ails
  • ails's Avatar
  • OFFLINE
  • Senior Boarder
  • Posts: 140
  • Thank you received: 17
Dear Sumit,

I can only suggest you to try with a lighter script.
Because the errors may come from your script (machinefile?), the MPI library (though I'm pretty confident in your admin) or from Telemac itself.

So, you should separate these tasks :
(1) submit a job with with the -t option o runcode.py. It will generate both TMP directory and the executable ;
(2) write a very minimal script in which you directly run the executable with mpirun from the TMP directory (see the attached script). Try with 12 and 24 procs.

=> If it doesn't work, try to adapt the mpirun syntax.
=> If it works straightforward, I guess there's something in your config or/and the runcode.py script.

Some ideas / improvements :
(1) Delete anything related to the HPC mode in your config file ("hpc_cmdexec: chmod 755 <hpc_stdin>; qsub < <hpc_stdin>", HPC in "options")
(2) Instead, add the QSUB submission at the end of your bash script where you only run "telemac2d.py Bode2DNew.cas" (see the attached sample script, not tested however)
(3) And then, I see no reason for re-generating the config file.

Best regards,

Fabien Decung
Attachments:
The administrator has disabled public write access.

Parallel Simulation 12 years 1 month ago #5907

  • sumit
  • sumit's Avatar
Dear Fabien,

Thanks a ton for all your suggestion, I had another round of talk with my sys-admin and we decided to go with gcc compiler and everything worked out like a song :)

Once again thanks a lot,

Best regards,
Sumit
The administrator has disabled public write access.

User-Subroutine-Compilation-Pr​oblem 12 years 1 month ago #5988

  • sumit
  • sumit's Avatar
Dear All,

I have a minor problem and may be some can help me. Attached please find my submission script, "CLUST-new.txt"

File Attachment:

File Name: CLUST-new.txt
File Size: 2 KB


I am trying to run telemac3d together with a user subroutine, when I run things without any user subroutine everything works out but when I try to add a user subroutine I get the following out files.

File Attachment:

File Name: MBNBOUT.txt
File Size: 7 KB


Please let me know if you have any suggestion about how can I fix this.

Thanks,
Sumiy
The administrator has disabled public write access.

Parallel Simulation 12 years 1 month ago #5989

  • sebourban
  • sebourban's Avatar
  • OFFLINE
  • Administrator
  • Principal Scientist
  • Posts: 814
  • Thank you received: 219
Hello,

this is strange, since if it work when compiling your entire system, it should work the same way when compiling at run time.

You error is due to the "cannot find -lz", or zlib library.
I do not remember when exactly you need that library (please check other answer by Fabien on this forum), but if you need it try to search for zlib1g-dev with your syste yum or aptitude etc., and install.

I then recommend you recompile your system with -m "clean system"

Hope this helps.
Sébastien.
The administrator has disabled public write access.

Parallel Simulation 12 years 1 month ago #5990

  • sumit
  • sumit's Avatar
Hello Sebastien,

Thanks a ton for your suggestion. But this is strange, when I say, "which lz" on the command line it points to /usr/bin/lz but the error is still there.

And again if I don't use the user subroutine everything runs just fine.

Thanks,
Sumit
The administrator has disabled public write access.

Parallel Simulation 12 years 1 month ago #5991

  • sebourban
  • sebourban's Avatar
  • OFFLINE
  • Administrator
  • Principal Scientist
  • Posts: 814
  • Thank you received: 219
Yes I thought I'd remember Fabien saying you might not need the library.

Note that the executable lz (you found in /usr/bin/) has nothing to do with the library "z" (also called zlib I believe). With gfortran, you add a library to your compilation with the optionj "-l" followed by the name of the library "z", i.e. "-lz".

Glad to see this helped.

Sébastien.
The administrator has disabled public write access.
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.