TELEMAC-MASCARET Forum: Telemac in Cloud services user experiences (1/1)

Welcome, Guest

TOPIC: Telemac in Cloud services user experiences

Telemac in Cloud services user experiences 7 years 6 months ago #26299

josekdiaz
OFFLINE
Expert Boarder
Posts: 161
Thank you received: 48

Dear Forum,

Recently I've been asked about user's experiences in any type of cloud computing services like Amazon's, google or any other kind. Searching in the forum I've only found threads which are 2-4+ years old that maybe do not represent the current state of the art in that matter.

So basically it would be a huge help if someone could give any feedback in this topic. Some questions that may help to pinpoint the ideas are:

Have you used a cloud computing (HPC or not) service to install/run Telemac? can you list them?
Maybe you have tested a few...Which one do you prefer the most? Why?
Do you have any contact in that service that is familiar with the Telemac setup (just to make the communication easier)?
Pricing is a sensitive issue, so maybe an estimate or any kind on info in that regard are highly welcomed!.

The question arrises because we are evaluating the alternatives of buying new machines (and potentiate our current small beowulf cluster) or to purchase this kind of services only when they are needed, like critical models or huge runs.

I appreciate any kind of help!

Regards,

José Díaz.

The administrator has disabled public write access.

Telemac in Cloud services user experiences 7 years 6 months ago #26382

robs

Hi José

Touching on a lot of points here, so please let me know if you'd like more detail in any particular area.

We've tried a few external resources including ARCHER and AWS and also have an internal Cluster.

If you're reasonably familiar with installing OpenTELEMAC on Linux, getting compiled and running on most clusters is fairly straight forward. Figuring out the compilers, mpi versions and the job submission process is probably the most time-consuming part.

ARCHER have recently reduced their prices to 5p per core hour which makes it very easy to justify. We've never had any problems getting the capacity we need, though do note they're very much optimised for short, huge scale runs. - If you have a job that will take 24 hours it might wait 24 hours before it starts. But if you can run the same job efficiently on 24 times the number of nodes, you'll probably see it start within an hour.

We've used AWS with some Python wrappers from MIT called StarCluster which allow us to easily spin up multiple nodes and setup a cluster on demand. For OpenTELEMAC, we've found that these provide great performance if you only use 1 node per run. However, when testing multi-node scale runs, performance tailed off very quickly to the point where even a 2 node job was not worth running. This might be a fixable problem but with the speed and cost of ARCHER, we've not really looked into it much. But for small jobs, it's so easy to setup, it's definitely worth a mention.

Internally we have a cluster with InfiniBand networking running on CentOS 7 based on xCAT and the SLURM resource manager which covers our base requirements. If internal utilisation is high we will usually use ARCHER for OpenTELEMAC.

Pricing estimates are hard - it entirely depends on your modelling demands. Very roughly, a single compute node with ~30 cores may cost around £3-4k. Though note, not all nodes are equal. There are a few additional costs like networking and disks but the biggest cost should be compute nodes.

Potential utilisation is probably the most important thing to consider. If you have an entirely flat demand of 120 cores 24/7, buy the hardware. Use it for 3 years plus and you can probably max it out at near 100% at less than half the cost of the equivalent external resource, including power, cooling and admin time.

But more likely you'll have much less predictable demand. I've found in our case anything above 75% is unsustainable - unless you have very uniform demand you may get a maximum of 50% to 75% utilisation over time. And there could still be peaks that you'll need to use external resources for. Of course, this depends heavily on the urgency of your runs too - if you don't have deadlines, you can submit jobs to a queue and not worry how long it takes them to complete. Then you'll be able to keep the queue full and get a much higher utilisation.

Hope this helps,
Rob

The administrator has disabled public write access.