Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1
  • 2

TOPIC: Parallel-failed after adding Batch scripts into configuration file

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25060

  • huyquangtran
  • huyquangtran's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 271
  • Thank you received: 23
Hi,

Previously, I was sucessful to run TELEMAC up to 8 processors, but with a single node. Now I want to run TELEMAC using multi-nodes (here is just for 2 nodes) and try to add the following commands into the configuration file:

hpc_stdin: #!/bin/bash
#SBATCH -p physical
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --mem 16G
#SBATCH --time=0:60:00

I got errors below. Could someone help? Thanks a lot

Best Regards
Huy

The MPI_Comm_f2c() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[spartan-rc001.hpc.unimelb.edu.au:14593] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
There are not enough slots available in the system to satisfy the 16 slots
that were requested by the application:
/home/huyquangtran/telemac/v7p2/LINUX-RUN/T3D.cas_2017-01-31-17h43min34s/out_tide_wind

Either request fewer slots for your application, or make more slots available
for use.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

... merging separated result files

+> T3D.cas
collecting: T3DRES
runRecollection:
|runGRETEL: Could not split your file T3DRES (runcode=1) with the error as follows:
|
|... The following command failed for the reason above (or below)
|/home/huyquangtran/telemac/v7p2/builds/parallel/bin/gretel < gretel_T3DRES.par >> gretel_T3DRES.log
|
|
|Here is the log:
|
|
| +
+
|
| GRETEL: TELEMAC MERGER
|
|
|
| VERSION V7P2R0
|
| HOLGER WEIL BEER (BAW)
|
| JEAN-MICHEL HERVOUET (LNHE)
|
| YOANN AUDOUIN (LNHE)
|
| GRETEL (C) COPYRIGHT 2003-2012
|
| BUNDESANSTALT FUER WASSERBAU, KARLSRUHE
|
|
|
| +
+
|
|
|
|
|
| MAXIMUM NUMBER OF PARTITIONS: 100000
|
|
|
| +
+
|
|
|
| --GLOBAL GEOMETRY FILE:
|
| INPUT: T3DGEO
|
| --GEOMETRY FILE FORMAT <FFORMAT> [MED,SERAFIN,SERAFIND]:
|
| INPUT: SERAFIN
|
| --RESULT FILE:
|
| INPUT: T3DRES
|
| --RESULT FILE FORMAT <FFORMAT> [MED,SERAFIN,SERAFIND]:
|
| INPUT: SERAFIN
|
|--NUMBER OF PARTITIONS <NPARTS> [2 -100000]:
|
| INPUT: 16
|
| --NUMBER OF PLANES:
|
| INPUT: 10
|
| ERREUR 2 LORS DE L APPEL A OPEN_MESH_SRF:OPEN
|
| TEXTE DE L'ERROR : UNKNOWN_ELT_TYPE_ERR
|
|
|
|
|
|
|
| PLANTE: PROGRAM STOPPED AFTER AN ERROR
|
| RETURNING EXIT CODE: 2
|
slurmstepd: error: Exceeded step memory limit at some point.
srun: error: spartan-rc001: task 0: Exited with exit code 1
The administrator has disabled public write access.

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25061

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Hi,

Could you show what is in your temporary folder ?
This is an error you usually get when a file is missing or incorrect.
Could you check that all the PE000...log finished properly.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25062

  • huyquangtran
  • huyquangtran's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 271
  • Thank you received: 23
Hi Yoann,

The list of files in temp. folder is attached.

The total size ~ 5.1GB.

Thanks
Huy

1_2017-01-31.jpg
The administrator has disabled public write access.

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25063

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
There does not seem to be any T3DRES0000...
It looks like your job did not run in parallel.
Look at the beginning of the listing you should see a warning telling you that.
You can find the listing in the file T2D_cas...sortie
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25064

  • c.coulet
  • c.coulet's Avatar
  • OFFLINE
  • Moderator
  • Posts: 3722
  • Thank you received: 1031
Hi
Maybe there is also a problem with MPI installation...
If the daemon doesn't run on all nodes, the MPI execution abort!
Regards
Christophe
The administrator has disabled public write access.

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25065

  • huyquangtran
  • huyquangtran's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 271
  • Thank you received: 23
Dear Yoann and Christope,

I was able to run parallel by using cloud partition, running on several nodes, sometimes it takes too slow (running on single node with 8 processors is no problem). My HPC admin advices not to run multi-nodes using cloud partition, so I go back with physical partition.

Do you think if the error caused by the space limitation? For example, I have only 10GB to save data on HPC?

Thanks & Best Regards
Huy
The administrator has disabled public write access.

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25073

  • huyquangtran
  • huyquangtran's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 271
  • Thank you received: 23
Hi Yoann,

the listing in the file T2D_cas...sortie shows:


There are not enough slots available in the system to satisfy the 16 slots
that were requested by the application:
/home/huyquangtran/telemac/v7p2/LINUX-RUN/T3D.cas_2017-02-01-14h21min07s/out_tide_wind

Either request fewer slots for your application, or make more slots available
for use

Do you have any ideas to solve the problem?

Best Regards
Huy
The administrator has disabled public write access.

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25074

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
Looks like multi node is node working.
You should try running a basic hello world
Something like that:
!  Fortran example  
   program hello
   include 'mpif.h'
   integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
   
   call MPI_INIT(ierror)
   call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
   print*, 'node', rank, ': Hello world'
   call MPI_FINALIZE(ierror)
   end
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
The following user(s) said Thank You: huyquangtran

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25075

  • huyquangtran
  • huyquangtran's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 271
  • Thank you received: 23
Hi yoann,

Do you mean: trying to test hello.f in fortran by running the command:

gfortran hello.f -o executable?

Thanks
Huy
The administrator has disabled public write access.

Parallel-failed after adding Batch scripts into configuration file 7 years 9 months ago #25076

  • yugi
  • yugi's Avatar
  • OFFLINE
  • openTELEMAC Guru
  • Posts: 851
  • Thank you received: 244
I mean :
* Compiling the hello fortran like you said.
* And trying to run it on multiple nodes using your batch file for telemac.
There are 10 types of people in the world: those who understand binary, and those who don't.
The administrator has disabled public write access.
  • Page:
  • 1
  • 2
Moderators: borisb

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.