Advanced Data Solutions : Optimal MPI Process Placement for Azure HB Series VMs

For MPI applications, optimal pinning of processes can lead to significant application performance improvements for under subscribed systems. Before AMD introduced the Chiplet design a few years back, to get the optimal performance the user just needed to decide if there application performed better running all on the same socket or equally balanced across the sockets. However, with the introduction of the Chiplet design, it became more complicated. The following is a link to a diagram that may help to better understand the chiplet design

In the chiplet design, AMD has essentially integrated a bunch of smaller CPUs together to provide a socket with 64 cores (8 - 16 smaller CPUs with 4-8 cores each). To maximize the performance from each core it is important to balance the amount of L3 cache and memory bandwidth per core. We will discuss how to do this below for the following Azure HB VM types using IntelMPI and OpenMPI/HPC-X.

Azure HB VM:

This instance comes with 60 AMD Naples cores. Each socket contains 8 numa domain with 4 cores each. One 4 core numa domain is held back for the hypervisor leaving 15 numa domains for the user. When undersubscribing the VM to get the desired resources/core it is desirable to equally balance the L3 cache and memory bandwidth between cores. To do this the user will need to select either 15, 30, 45, or 60 cores per node.

Metrics	Azure
Metrics	HB60rs	HB60rs	HB60rs	HB60rs
Cores (Physical)	15	30	45	60
RAM (GB)	224	224	224	224
Network (BW)	100	100	100	100
Memory BW (GB/s)	250	250	250	250
RAM/Core	14.93	7.47	4.98	3.73
Network BW/Core	6.67	3.33	2.22	1.67
Memory BW/Core	16.67	8.33	5.56	4.17

OpenMPI 4 / HPC-X:

Note: To print out the placement of the cores before the application is run add the flag --report-bindings

--bind-to core --map-by ppr:1:numa (30 cores)

--bind-to core --map-by ppr:2:numa (60 cores)

--bind-to core --map-by ppr:3:numa (90 cores)

Intel MPI:

Note: To print out the placement of the cores before the application is run add the environment variable I_MPI_DEBUG=4

15 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<60;i+=4) for (j=0;j<1;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

30 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<60;i+=4) for (j=0;j<2;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

45 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<60;i+=4) for (j=0;j<3;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

Azure HBv2 VM:

This instance comes with the 120 AMD Rome cores. Each socket contains 15 numa domain with 4 cores each. Two 4 core numa domain are held back for the hypervisor. When undersubscribing the HBv2 VM to get the desired resources/core it is desirable to equally balance the L3 cache and memory bandwidth between cores. To do this the user will need to select either 30, 60, 90, or 120 cores per node.

Metrics	Azure
Metrics	HB120rs_v2	HB120rs_v2	HB120rs_v2	HB120rs_v2
Cores (Physical)	30	60	90	120
RAM (GB)	448	448	448	448
Network (BW)	200	200	200	200
Cost/Hr	3.92	3.92	3.92	3.92
Memory BW (GB/s)	345	345	345	345
RAM/Core	14.93	7.47	4.98	3.73
Network BW/Core	6.67	3.33	2.22	1.67
Memory BW/Core	11.50	5.75	3.83	2.88

If you want to undersubscribe your VM to get the optimal about of resources per core for you application then you can pin your processes to get the optimal placement for the 30, 60, or 90 cores. To do this you will need to add the following environment variables to your MPI jobs.

OpenMPI 4 / HPC-X:

Note: To print out the placement of the cores before the application is run add the flag --report-bindings

--bind-to core --map-by ppr:1:numa (30 cores)

--bind-to core --map-by ppr:2:numa (60 cores)

--bind-to core --map-by ppr:3:numa (90 cores)

Intel MPI:

Note: To print out the placement of the cores before the application is run add the environment variable I_MPI_DEBUG=4

30 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<120;i+=4) for (j=0;j<1;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

60 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<120;i+=4) for (j=0;j<2;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

90 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<120;i+=4) for (j=0;j<3;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

Azure HBv3 VM:

This instance comes with the 120 AMD Milan cores. Each socket contains 2 numa domain with 30 cores each. 2 cores from 4 chiplets are held back for the hypervisor. When undersubscribing the HBv3 VM to get the desired resources/core it is desirable to equally balance the L3 cache and memory bandwidth between cores. To do this the user will need to select either 16, 32, 64, 96, or 120 cores per node. To simplify the optimal process placement for our customers, we have provided additional HBv3 VM sizes (HB120-16rs_v3, HB120-32rs_v3, HB120-64rs_v3, HB120-96rs_v3) than the standard HB120rs_v3 size. Below you can see a table of the resources per core when using the various sizes.

Metrics	Azure
Metrics	HB120-16rs_v3	HB120-32rs_v3	HB120-64rs_v3	HB120-96rs_v3	HB120rs_v3
Cores (Physical)	16	32	64	96	120
RAM (GB)	448	448	448	448	448
Network (BW)	200	200	200	200	200
Cost/Hr	3.92	3.92	3.92	3.92	3.92
Memory BW (GB/s)	345	345	345	345	345
RAM/Core	28.00	14.00	7.00	4.67	3.73
Network BW/Core	12.50	6.25	3.13	2.08	1.67
Memory BW/Core	21.56	10.78	5.39	3.59	2.88

If you are using the HBv120rs_v3 size and you want to undersubscribe your VM to get the optimal about of resources per core for you application then you can pin your processes to the same cores used by the 16, 32, 64, or 96 core VM sizes. To do this you will need to add the following environment variables to your MPI jobs.

OpenMPI 4 / HPC-X:

Note: To print out the placement of the cores before the application is run add the flag --report-bindings

16 PPN:

--bind-to cpulist:ordered --cpu-set 0,8,16,24,30,38,46,54,60,68,76,84,90,98,106,114

32 PPN:

--bind-to cpulist:ordered

--cpu-set 0,1,8,9,16,17,24,25,30,31,38,39,46,47,54,55,60,61,68,69,76,77,84,85,90,91,98,99,106,107,114,115

64 PPN:

--bind-to cpulist:ordered

--cpu-set 0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27,30,31,32,33,38,39,40,41,46,47,48,49,54,55,56,57,60,61,62,63,68,69,70,71,76,77,78,79,84,85,86,87,90,91,92,93,98,99,100,101,106,107,108,109,114,115,116,117

96 PPN:

--bind-to cpulist:ordered

--cpu-set 0,1,2,3,4,5,8,9,10,11,12,13,16,17,18,19,20,21,24,25,26,27,28,29,30,31,32,33,34,35,38,39,40,41,42,43,46,47,48,49,50,51,54,55,56,57,58,59,60,61,62,63,64,65,68,69,70,71,72,75,76,77,78,79,80,81,84,85,86,87,88,89,90,91,92,93,94,95,98,99,100,101,102,103,106,107,108,109,110,111,114,115,116,117,118,119

Intel MPI:

Note: To print out the placement of the cores before the application is run add the environment variable I_MPI_DEBUG=4

16 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST= 0,8,16,24,30,38,46,54,60,68,76,84,90,98,106,114

32 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST= 0,1,8,9,16,17,24,25,30,31,38,39,46,47,54,55,60,61,68,69,76,77,84,85,90,91,98,99,106,107,114,115

64 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27,30,31,32,33,38,39,40,41,46,47,48,49,54,55,56,57,60,61,62,63,68,69,70,71,76,77,78,79,84,85,86,87,90,91,92,93,98,99,100,101,106,107,108,109,114,115,116,117

96 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,4,5,8,9,10,11,12,13,16,17,18,19,20,21,24,25,26,27,28,29,30,31,32,33,34,35,38,39,40,41,42,43,46,47,48,49,50,51,54,55,56,57,58,59,60,61,62,63,64,65,68,69,70,71,72,75,76,77,78,79,80,81,84,85,86,87,88,89,90,91,92,93,94,95,98,99,100,101,102,103,106,107,108,109,110,111,114,115,116,117,118,119

Posted at https://sl.advdat.com/3vEdMLA

Friday, June 18, 2021

Optimal MPI Process Placement for Azure HB Series VMs