Introduction
To maximize a HPC applications memory bandwidth and floating-point performance, the processes need to be distributed evenly on the VM, utilizing all sockets, NUMA domains and L3caches.
In hybrid parallel applications, each process has several threads associated with it and it’s recommended to have a process and its threads on the same L3cache to maximize data sharing and re-use.
Optimal process/thread placement on Azure AMD processor VM’s (e.g. HB120rs_v2, HBv3 series and NDv4 etc) is further complicated because these VM’s can have 4 to 30 NUMA domains and many L3caches.
There are several linux tools (e.g numactl and taskset) that can control the placement of processes, MPI libraries also provide arguments and environmental variables to control process pinning.
We will be focusing on leveraging the HPCX and Intel MPI pinning/mapping syntax.
In this post we will discuss a tool which can help HPC applications pin processes/threads in an optimal manner on HPC specialty VM’s. The tool can show you the VM NUMA topology, where your processes are currently pinned (with warnings if suboptimal pinning is detected) and print the Optimal MPI pinning syntax for HPCX and Intel MPI libraries.
HPC application pinning tool
The tool is called “check_app_pinning.py” and is in the azurehpc github repository.
git clone https://github.com/Azure/azurehpc.git
See the experimental/check_app_pinning directory.
Tool syntax
./check_app_pinning.py -h
usage: check_app_pinning.py [-h] [-anp APPLICATION_PATTERN]
[-ppa] [-f]
[-tnp TOTAL_NUMBER_PROCESSES] [-tnp TOTAL_NUMBER_PROCESSES] [-ntpp NUMBER_THREADS_PER_PROCESS] [-mt {openmpi,intel}] optional arguments: -h, --help show this help message and exit -anp APPLICATION_PATTERN, --application_name_pattern APPLICATION_PATTERN Select the application pattern to check [string] (default: None) -ppa, --print_pinning_syntax Print MPI pinning syntax (default: False)
-f, --force Force printing MPI pinning syntax (i.e ignore
warnings) (default: False)
-tnp TOTAL_NUMBER_PROCESSES, --total_number_processes TOTAL_NUMBER_PROCESSES Total number of MPI processes per VM(used with -ppa) (default: None) -ntpp NUMBER_THREADS_PER_PROCESS, --number_threads_per_process NUMBER_THREADS_PER_PROCESS Number of threads per process (used with -ppa) (default: None) -mt {openmpi,intel}, --mpi_type {openmpi,intel} Select which type of MPI to generate pinning syntax (used with -ppa) (default: None)
Tool prerequisites
The “check_app_pinning.py” requires that python3 and the hwloc package be installed before running the tool.
On CentOS-HPC, these packages can be installed using yum as follows
sudo yum install -y python3 hwloc
Some examples
You need to run the tool on the HPC compute VM running your application. If you want to run on all your HPC VM’s involved in your MPI application, then you can use a parallel shell like pdsh or pssh.
Generate MPI pinning for hybrid parallel application running on HB120-64rs_v3
You are using 4 HB120-64rs_v3 VMs and would like to know the correct HPCX MPI syntax to pin a total of 64 MPI processes and 4 threads per process (i.e 16 MPI processes per VM).
check_app_pinning.py -ppa -tnp 16 -ntpp 4 Virtual Machine (Standard_HB120-64rs_v3, cghb64v3) Numa topology NumaNode id Core ids GPU ids ============ ==================== ========== 0 ['0-15'] [] 1 ['16-31'] [] 2 ['32-47'] [] 3 ['48-63'] [] L3Cache id Core ids ============ ==================== 0 ['0-3'] 1 ['4-7'] 2 ['8-11'] 3 ['12-15'] 4 ['16-19'] 5 ['20-23'] 6 ['24-27'] 7 ['28-31'] 8 ['32-35'] 9 ['36-39'] 10 ['40-43'] 11 ['44-47'] 12 ['48-51'] 13 ['52-55'] 14 ['56-59'] 15 ['60-63'] Process/thread openmpi MPI Mapping/pinning syntax for 16 processes and 4 threads per process --map-by ppr:4:numa:pe=4
The first section of output shows the NUMA topology (i.e. how many NUMA domains, how many L3caches and how many and which core id’s are in each NUMA domain and L3cache.). It also shows how many GPU’s you have and which NUMA domain each gpu id belongs to.
You will see some warnings if the combination of processes and threads is not optimal for this VM. e.g too many (or too few) processes/threads, number of threads will not fit in L3cache etc). By default, if you get a warning the MPI placement syntax will not be generated until the warning is corrected, but you can override this default with the (-f or -force option), and ignore the warning and generate the MPI placement syntax.
In the above example you would cut and paste the MPI HPCX pinning syntax and launch your application like this.
mpirun -np 64 --map-by ppr:4:numa:pe=4 ./mpi_executable
If you would prefer Intel MPI placement syntax, then just add the -mt intel option and the following Intel MPI pinning syntax will be generated.
export I_MPI_PIN_DOMAIN=4:compact
In this case add the above syntax in your script before your mpirun command.
Show where Hybrid MPI application is running on HB120rs_v2
You have a hybrid parallel MPI application (called hpcapp) running on 8 HB120rs_v2 VM’s, you would like to check and see where the processes and threads are running and are the processes and threads placed optimally on HB120rs_v2.
On one of the HB120rs_v2 VM’s, execute.
check_app_pinning.py -anp hpcapp Virtual Machine (Standard_HB120_v2) Numa topology NumaNode id Core ids GPU ids ============ ==================== ========== 0 ['0-3'] [] 1 ['4-7'] [] 2 ['8-11'] [] 3 ['12-15'] [] 4 ['16-19'] [] 5 ['20-23'] [] 6 ['24-27'] [] 7 ['28-31'] [] 8 ['32-35'] [] 9 ['36-39'] [] 10 ['40-43'] [] 11 ['44-47'] [] 12 ['48-51'] [] 13 ['52-55'] [] 14 ['56-59'] [] 15 ['60-63'] [] 16 ['64-67'] [] 17 ['68-71'] [] 18 ['72-75'] [] 19 ['76-79'] [] 20 ['80-83'] [] 21 ['84-87'] [] 22 ['88-91'] [] 23 ['92-95'] [] 24 ['96-99'] [] 25 ['100-103'] [] 26 ['104-107'] [] 27 ['108-111'] [] 28 ['112-115'] [] 29 ['116-119'] [] Application (hpcapp) mapping/pinning PID Total Threads Running Threads Last core id Core id mapping Numa Node ids GPU ids ============ ================= ================= ============== ================= =============== =============== 13405 7 4 0 0 [0] [] 13406 7 4 4 4 [1] [] 13407 7 4 8 8 [2] [] 13408 7 4 12 12 [3] [] Warning: 4 threads are mapped to 1 core(s), for pid (13405) Warning: 4 threads are mapped to 1 core(s), for pid (13406) Warning: 4 threads are mapped to 1 core(s), for pid (13407) Warning: 4 threads are mapped to 1 core(s), for pid (13408)
Note: you do not need to provide the full application name, just enough of the application name pattern to uniquely identify it.
The first part of the output is the same as before, it shows the VM NUMA topology.
The second section of the output titled “Application (hpcapp) mapping/pinning”, shows the details of how many processes and threads are running and on which core_ids.
PID: Refers to the processor Identification, a unique number to identify each process.
Total Threads: Is the total number of threads associated with each PID.
Last core_id: Identify’s the last core_id the PID was running on.
Core id mapping: Shows the PID’s CPU affinity, i.e which core_ids the PID can run on.
Numa Node ids: Show which NUMA domains the core id’s identified by “Core id mapping” corresponds to.
GPU ids: Show which GPU ids correspond to the NUMA domains identified by “Numa Node ids”.
Running Threads: Is the actual number of threads corresponding to each PID that are currently in a run state.
If you see some warnings at the end of the report, that indicates that the tool has identified some possible suboptimal process placement. Check the warnings to make sure your application is running as expected.
Check HPC application running on ND96asr_v4 (A100)
We can check where the processes and threads (from application called hello) are running on a VM with GPU's like ND96asr_v4 and if the processes are pinned optimally to utilize all the GPU's.
./check_app_pinning_new.py -anp hello
Virtual Machine (Standard_ND96asr_v4) Numa topology NumaNode id Core ids GPU ids ============ ==================== ========== 0 ['0-23'] [3, 2] 1 ['24-47'] [1, 0] 2 ['48-71'] [7, 6] 3 ['72-95'] [5, 4] L3Cache id Core ids ============ ==================== 0 ['0-3'] 1 ['4-7'] 2 ['8-11'] 3 ['12-15'] 4 ['16-19'] 5 ['20-23'] 6 ['24-27'] 7 ['28-31'] 8 ['32-35'] 9 ['36-39'] 10 ['40-43'] 11 ['44-47'] 12 ['48-51'] 13 ['52-55'] 14 ['56-59'] 15 ['60-63'] 16 ['64-67'] 17 ['68-71'] 18 ['72-75'] 19 ['76-79'] 20 ['80-83'] 21 ['84-87'] 22 ['88-91'] 23 ['92-95'] Application (hello) Mapping/pinning PID Total Threads Running Threads Last core id Core id mapping Numa Node ids GPU ids ============ ================= ================= ============== ================= =============== =============== 32473 6 0 0 0 [0] [3, 2] 32474 6 2 24 24 [1] [1, 0] 32475 6 2 48 48 [2] [7, 6] 32476 6 2 72 72 [3] [5, 4] Warning: 2 threads are mapped to 1 core(s), for pid (32474) Warning: 2 threads are mapped to 1 core(s), for pid (32475) Warning: 2 threads are mapped to 1 core(s), for pid (32476) Warning: Virtual Machine has 8 GPU's, but only 6 threads are running
In this case we see that the tool has identified 8 A100 GPU’s, which NUMA domain each GPU id is located and detected possible suboptimal pinning with warnings.
Summary
Proper placement of processes and threads on HPC VM’s is important to get optimal performance.
It is more complicated to figure out the correct placement of processes/threads on VM with many NUMA domains and L3caches like the HB, HBv2, HBv3 series and NDv4 series.
A tool is discussed to assist in Optimal placement of processes/threads on Azure HPC VM’s.
Posted at https://sl.advdat.com/3Ddx4w8