First time here? Checkout the FAQ!
x
0 votes
by (120 points)
Hello,

We have been using openCARP to perform S1-S2 threshold activation studies. We are noticing some weird behaviour on our clusters and wondering if you could point us in the right direction.

When we run our simulations on a single node, we see very good scaling up to 80 cores on our large memory node. However, when we try to use two or three 32 core nodes together performance is significantly worse. For example a simulation for just S1 takes 8 minutes with 80 cores, when we use 3 nodes (96 cores) it takes 30 minutes. Do you have any ideas on what can be going wrong here?

We have compiled openCARP with the instructions given in the documentation. For Petsc the only change we have made is that we use openMPI as that is what is installed on our clusters. The specific commit of the code we are using is: 675501a5e0e0fac521aa3e1ef2950c2f9012457b. We have also tried to use different meshes for our simulations and observe the same behaviour.

Please let me know if you require any additional information.

Thanks!

2 Answers

0 votes
by (180 points)
Hey!

the first thing I would check is the architecture of the different nodes. Do you know what kind of hardware is used on the different nodes? It might be as simple as the 3 smaller nodes are older and have slower CPUs overall compared to the single large memory node.

By using multiple nodes you introduce different types of additional overhead such as communication between nodes and synchronization between MPI processes. Depending on the size of your problem this can mean that communication time increases significantly more than computation time.

You could try profiling tools to see where the extra time is spent in your simulation.

For now that's all I could think of. Hope it helps!

Best,

Tobias
by (120 points)
Hi Tobias,

Thanks for response. The 3 nodes we are using are identical and connected with 100 Gbit Infiniband. When I use 32 cores of our the large memory node, the simulation takes the same time as with our smaller compute nodes.

We have had some help from our cluster support team in benchmarking the code and they have not noticed anything obvious.

We will continue to look into profiling. We just find this very strange because the slow down happens as soon as we introduce an additional node. Do you think the mesh we are using can have any effect on this?

Thanks,
Kyle Klenk
by (180 points)
Can you give some details for this simulation? Mesh degrees of freedom, simulation parameters (especially IO related settings). Maybe that can lead us in the right direction.
by (120 points)
I hope this is helpful for the mesh as my colleague generated the mesh, and is more familiar with the specifics. This is what they had written about our mesh for me:

All simulations are performed on a 10 mm × 10 mm × 1 mm cuboidal domain The spatial domain is discretized with openCARP’s default settings, that is, with piecewise linear tetrahedral finite elements and a mesh resolution of 0.1 mm. When applied to the considered domain, this leads to a discretization of 112,211 nodes and 500,000 elements. The default settings of openCARP are also used for the temporal discretization.

As for the settings, here is what is in our .par file:

############### physical regions ##############

num_phys_regions     = 2

phys_region[0].name  = "Intracellular domain"

phys_region[0].ptype = 0

phys_region[0].num_IDs = 1

phys_region[0].ID[0] = 1

phys_region[1].name  = "Extracellular domain"

phys_region[1].ptype = 1

phys_region[1].num_IDs = 1

phys_region[1].ID[0] = 1

############### ionic setup ###################

num_imp_regions      = 1

imp_region[0].im     = Shannon

############## stimulus setup #################

num_stim             =      3

stimulus[0].name     = "S1"

stimulus[0].stimtype =      1

stimulus[0].duration =      2.

stimulus[0].start    =      0.

stimulus[0].npls     =      1

stimulus[0].x0       = -50.0 #in um

stimulus[0].xd       = 323.6 #thickness in um

stimulus[0].y0       = -50.0

stimulus[0].yd       = 323.6

stimulus[0].z0       = 950.0

stimulus[0].zd       = 100.0

stimulus[1].name     = "Ground"

stimulus[1].stimtype =      3

stimulus[1].x0       = -50.0

stimulus[1].xd       = 10100.0

stimulus[1].y0       = -50.0

stimulus[1].yd       = 10100.0

stimulus[1].z0       = -50.0

stimulus[1].zd       = 100.0

################# Simulation parameters #################

bidomain = 1

tend    =  70.

spacedt = 1.0

timedt = 1.0

parab_solve = 1

vofile = "vm.igb"

# Number of events to detect

num_LATs  = 1

# Event 1: activation

lats[0].ID         = ACTs

lats[0].all        = 1

lats[0].measurand  = 0

lats[0].threshold  = 0

lats[0].mode       = 0

# Event Monitor

sentinel_ID = 0

t_sentinel = 10.0

t_sentinel_start = 0.0

I hope this is what you were looking for.
by (180 points)
There seem to be some parameters missing for the stimulus, since I cannot get it to activate the tissue and the simulation stops early due to the sentinel. If you give me the correct parameters I can check if I get the same behavior on our HPC system.

On a side note, please change to the stim[] definitions for stimuli in future simulations, since stimulus[] is deprecated.
by (120 points)
My apologies, we have been passing in the stimulus as a parameter. We run the simulation with a lower bound of 22500000 and an upper bound of 72500000 as we search for S1. The full command we use to start the simulation is below.

mpirun -n 96 ./openCARP +F s1.par -stimulus[0].strength 72500000
by (180 points)
Hey!

I have run the experiment on our HPC system in different configurations and I can confirm your observations. What I observed was an excessive increase in computation time of the ionic models (look out for the ODE_stats.dat file written into your output directory) as soon as you move to 2 or more nodes.

I currently suspect that something goes wrong in the partitioning of the mesh and I will look at it more closely. Essentially, I noticed that when you use multiple nodes with X tasks each, the mesh is only partitioned into X blocks instead of X*nodes blocks. If you want to see it yourself, you can use the -gridout_p and -output_level parameters to get more output with regard towards the partitioning. However, I am not quite sure yet how it is connected to the increased computation times in the ODEs.
0 votes
by (8.1k points)
Hi!

On most HPC systems, there is a scheduler taking care of distributing the MPI processes across the allocated machines / compute nodes. Without it, you would need to use a hostfile to tell MPI how to distribute the processes, which is rarely used in practice.

The used scheduling system and its configuration have not been adressed in the question or the prior answers. I would suggest to have a look there.

To be able to help you troubleshoot, you would need to post the used submission script as well.

Best, Aurel
by (180 points)

In my case, when I tried to reproduce it, it was on bwunicluster using SLURM with a script like this:

#!/bin/bash

#SBATCH --job-name=multiple_np80

#SBATCH --nodes=2

#SBATCH --ntasks=40

#SBATCH --cpus-per-task=1

#SBATCH --output=multiple_np80.out

#SBATCH --time=0:10:00

#SBATCH --partition=multiple

source /home/USER/.bashrc

# No Hybrid MPI/OpenMP

export OMP_NUM_THREADS=1

cd EP_benchmark_uc2

mpirun --bind-to core --map-by core openCARP +F parameters.par -simID multiple_np80

This is pretty much what the carputils template for bwunicluster would do. 

by (8.1k points)
Please try

#!/bin/bash
#SBATCH --job-name=multiple_np80
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH --output=multiple_np80.out
#SBATCH --time=0:10:00
source /home/$USER/.bashrc
# No Hybrid MPI/OpenMP
export OMP_NUM_THREADS=1
cd EP_benchmark_uc2
mpirun openCARP +F parameters.par -simID multiple_np80

Cheers, Aurel
by (120 points)

In my case I was using a cluster with slurm and doing the following:
 

salloc --account=hpc_c_giws_clark --mem=0 --nodes=3 --ntasks-per-node=32 --time=4:00:00

mpirun -n 96 ./openCARP +F /globalhome/kck540/HPC/openCARP-Projects/s1_simulations/s1.par -stimulus[0].strength 7250000
by (180 points)
Thanks Aurel! In my case this is the solution. Turns out I actually misunderstood the --ntasks option for longer than I'd like to admit...
by (8.1k points)
Kyle, your parameters look good. The poor scaling must be due to the combination of small local problem size and sub-optimal interconnect. A 100Gbit interconnect sounds a lot, but throughput is less relevant than latency. As such, solving the problem in parallel requires the exchange of many small messages with minimal latency. For this, infini-band is much better than Gbit Ethernet. As such, you might only scale to larger local problem sizes (e.g. 100K - 200K) compared to what is reported in literature (20K - 50K).

Please report the local problem sizes you have. You can see them using the "-output_level 5" option. You should see a small table with the ranks and their local number of elements / indices. The distribution should be pretty even, so you only need to post one set of values. Also, index multiplicities are worth reporting.

Best, Aurel
Welcome to openCARP Q&A. Ask questions and receive answers from other members of the community. For best support, please use appropriate TAGS!
architecture, carputils, documentation, experiments, installation-containers-packages, limpet, slimfem, website, governance
...