Hi José
Touching on a lot of points here, so please let me know if you'd like more detail in any particular area.
We've tried a few external resources including ARCHER and AWS and also have an internal Cluster.
If you're reasonably familiar with installing OpenTELEMAC on Linux, getting compiled and running on most clusters is fairly straight forward. Figuring out the compilers, mpi versions and the job submission process is probably the most time-consuming part.
ARCHER have recently reduced their prices to 5p per core hour which makes it very easy to justify. We've never had any problems getting the capacity we need, though do note they're very much optimised for short, huge scale runs. - If you have a job that will take 24 hours it might wait 24 hours before it starts. But if you can run the same job efficiently on 24 times the number of nodes, you'll probably see it start within an hour.
We've used
AWS with some Python wrappers from MIT called
StarCluster which allow us to easily spin up multiple nodes and setup a cluster on demand. For OpenTELEMAC, we've found that these provide great performance if you only use 1 node per run. However, when testing multi-node scale runs, performance tailed off very quickly to the point where even a 2 node job was not worth running. This might be a fixable problem but with the speed and cost of ARCHER, we've not really looked into it much. But for small jobs, it's so easy to setup, it's definitely worth a mention.
Internally we have a cluster with InfiniBand networking running on CentOS 7 based on xCAT and the SLURM resource manager which covers our base requirements. If internal utilisation is high we will usually use ARCHER for OpenTELEMAC.
Pricing estimates are hard - it entirely depends on your modelling demands. Very roughly, a single compute node with ~30 cores may cost around £3-4k.
Though note, not all nodes are equal. There are a few additional costs like networking and disks but the biggest cost should be compute nodes.
Potential utilisation is probably the most important thing to consider. If you have an entirely flat demand of 120 cores 24/7, buy the hardware. Use it for 3 years plus and you can probably max it out at near 100% at less than half the cost of the equivalent external resource, including power, cooling and admin time.
But more likely you'll have much less predictable demand. I've found in our case anything above 75% is unsustainable - unless you have very uniform demand you may get a maximum of 50% to 75% utilisation over time. And there could still be peaks that you'll need to use external resources for. Of course, this depends heavily on the urgency of your runs too - if you don't have deadlines, you can submit jobs to a queue and not worry how long it takes them to complete. Then you'll be able to keep the queue full and get a much higher utilisation.
Hope this helps,
Rob