Traditional OpenMP* is a fork-join parallel programming technology. First, program runs with a single master thread which is serial code, later, master thread assigns sub-tasks on created multiple threads where are in parallel region, master thread waits until all threads complete sub-tasks to meet at a barrier then be terminated, then master thread continue to run serial code again.
Usually we need to optimize OpenMP code on these sides. 1. Reduce serial portion if possible; 2. Workloads of sub-tasks are balanced in threads; 3. Reduce wait time / wait counts, sometime using right granularity for sub-tasks is important; 4. Reduce unnecessary locks.
All of above are based on algorithm tuning level.
In this article, three development tools will be used – they are, Intel® C++ Compiler, Intel® MPI library, Intel® VTune™ Amplifier XE in Intel(R) Cluster Studio XE.
First at all, set these tools’ environment:
> source /opt/intel/ics/2015.0.3.032/composer_xe_2015/bin/compilervars.sh intel64
> source /opt/intel/ics/2015.0.3.032/impi_latest/bin64/mpivars.sh
> source /opt/intel/ics/2015.0.3.032/vtune_amplifier_xe_2015/amplxe-vars.sh
I used Intel Cluster Studio 2015.0.3. 032, which includes:
icc version 15.0.3 (gcc version 4.4.7 compatibility)
Intel(R) VTune(TM) Amplifier XE 2015 Update 3 (build 403110)
Intel(R) MPI Library for Linux* OS, Version 5.0 Update 3 Build 20150128 (build id: 11250)
Here is a simple program to compute Pi with OpenMP implementation (pi_omp.c).
Build this program with Intel C++ compiler
> icc -g -O2 -openmp pi_omp.c -o pi_omp
Run program on cluster
> time OMP_NUM_THREADS=4 ./pi_omp
Computed value of Pi by using OpenMP: 3.141592654
real 0m0.147s
Run VTune Amplifier XE
> OMP_NUM_THREADS=4 amplxe-cl -c advanced-hotspots -knob collection-detail=stack-and-callcount -- ./pi_omp
In bottom-up report, changing “OpenMP Region/Thread/Function Call Stack” to “Grouping:”, observe “Potential Gain” and know it was caused by Imbalanced or Serial Spinning. In this case, there was only 4.2% potential gain in theory.
Another approach is to use MPI hybrid code to parallel-computing Pi (omp_mpi.c). This MPI application can work as multi-processes on cluster system.
Build this program with Intel C++ compiler with Intel MPI library v5.0.2 or above.
> mpiicc -g -O2 pi_mpi.c -o pi_mpi
Run program on Cluster
> time mpirun -n 8 ./pi_mpi
Computed value of Pi is by using Intel MPI: 3.141592654
real 0m0.212s
Run VTune Amplifier XE
> mpirun -gtool "amplxe-cl -r result3 -c advanced-hotspots:all=node-wide" -n 8 ./pi_mpi
Let’s see what results are from VTune Amplifier XE’s report.
I started up eight processes which are MPI hybrid code, actually workloads of eight processes were balanced. You can drill down to function level to know hotspots and narrow down to source level.
Observe that traffic of processes’ communications from pmi_proxy process was low.
Next, you may want to know CPU consumption in this cluster system when running MPI program.
Based on above report, you can know – there were two threads to work on core 5, 9, 13 respectively, almost two threads worked on core 1 but still had little works on core 2.
In order to use eight processes on each physical core (nodes on system had HT enabled), I used –
> I_MPI_PIN_DOMAIN=2 mpirun -gtool "amplxe-cl -r result5 -c advanced-hotspots:all=node-wide" -n 8 ./pi_mpi
So, eight processes ran on eight cores and workloads were balanced.