Asking for more cores from HPC - could not replicate local machine performance

Question

Asking for more cores from HPC - could not replicate local machine performance

asked Jan 20, 2024 by kmagtibay (570 points)

Piggybacking from an earlier question (re: new HPC user)

Running my script in a local machine with 32 cores, using the command

python3 ./atrialBARS_reEntry_v1.1.py --np 32

The simulation time is only ~40 mins as seen here.

However, I cannot reproduce the same run time in an HPC. I'm using the Graham cluster in Ontario, Canada. Our HPC uses SLURM as a resource manager and I have the following job script.

#!/bin/bash

#SBATCH --account=def-someuser

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --cpus-per-task=32

#SBATCH --mem-per-cpu=4G

#SBATCH --time=0-02:00:00

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 petsc/3.15.0 scipy-stack/2023b opencarp/13.0

mpiexec python3 atrialBARS_reEntry_v1.1.py --overwrite-behaviour overwrite

This run has an estimated completion time of 6 hours, although I have seen instances where it goes down to 3 hours.

Now, when I request for "./atrialBARS_reEntry.py --np 32" as I do in my local machine, "I get an error that there aren't enough slots in the system available to satisfy the 32 slots requested by the application"

I've also tried mpirun --oversubscribe to brute force my way in to no avail. I've already tried various iterations of mpi and python flags, I just can't seem to find the right one.

I've reached out to support for our HPC but am equally clueless as to what to tell them about the problem other than that I cannot replicate the performance of my local machine in an HPC. Moreover, is this openCARP specific? OpenMPI specific? HPC specific?

Any help would be appreciated.

Karl

related to an answer for: Running tuneCV in a HPC

1 Answer

answered Jan 21, 2024 by Aurel Neic (8.6k points)
selected Jan 21, 2024 by kmagtibay

Best answer

Hi!

Assuming your script atrialBARS_reEntry_v1.py will run the openCARP binary, here are some performance considerations:

As a starting point, you want to use as many processes (slurm calls them tasks) as you have *physical* cores on a node. That is, you dont want to count hyperthreading-cores as physical cores. Also, OPENMP is not used in openCARP (but in some utility binaries such as igb*).

Your cluster seems to mostly use the E5-2683 v4 CPU in a dual-socket config. Thus, you have 32 physical CPU cores per node. I suggest to change your slurm options to:

#SBATCH --account=def-someuser
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --time=0-02:00:00
export OMP_NUM_THREADS=1

python3 ./atrialBARS_reEntry_v1.1.py --np 32

commented Jan 21, 2024 by kmagtibay (570 points)
edited Jan 21, 2024 by kmagtibay

Thank you so much, Aurel!

As you know, the difference between cpus and tasks wasn't very clear to me. I had the impression that cpus = physical cores. I need to learn more about slurm directives.

My run time in the HPC is now better than my local machine by 30 mins. Although I get this warning

stty: 'standard input': Inappropriate ioctl for device
stty: 'standard input': Inappropriate ioctl for device
stty: 'standard input': Inappropriate ioctl for device
stty: 'standard input': Inappropriate ioctl for device
stty: 'standard input': Inappropriate ioctl for device

Is this something I should be worried about?

I will most likely ask for more cores (up to 128) in the future as I make more complex simulations.

If I can only ask for 32 physical CPUs per node, could I change to --nodes=2 to recruit more physical CPUs? or would this only distribute the number of physical CPUs across two compute nodes?

commented Jan 22, 2024 by Aurel Neic (8.6k points)

To use CPUs over multiple nodes you need to change the "--nodes" slurm option and to adjust the "--np" option of the python script to match the number-of-nodes times the tasks-per-node.

E.g. for 128 processes use

#SBATCH --account=def-someuser
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --time=0-02:00:00
export OMP_NUM_THREADS=1

python3 ./atrialBARS_reEntry_v1.1.py --np 128

The fastest simulation time is achieved when using approx. 30K degrees of freedom per CPU core. As such, you can compute something like:

(Number of compute nodes to use) = floor((Number of vertices in your mesh) / 30000) / 32)

E.g., for a 5 million vertices mesh, that would be --nodes=5.

The number of CPUs used does not need to be a power of 2, although mesh partitioning is faster and better if it is.

commented Jan 22, 2024 by kmagtibay (570 points)

Asking for more cores from HPC - could not replicate local machine performance

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Asking for more cores from HPC - could not replicate local machine performance

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions