Use VTune™ Amplifier XE 2015 to analyze MPI Hybrid code

Traditional OpenMP* is a fork-join parallel programming technology. First, program runs with a single master thread which is serial code, later, master thread assigns sub-tasks on created multiple threads where are in parallel region, master thread waits until all threads complete sub-tasks to meet at a barrier then be terminated, then master thread continue to run serial code again.

Usually we need to optimize OpenMP code on these sides. 1. Reduce serial portion if possible; 2. Workloads of sub-tasks are balanced in threads; 3. Reduce wait time / wait counts, sometime using right granularity for sub-tasks is important; 4. Reduce unnecessary locks.
All of above are based on algorithm tuning level.

In this article, three development tools will be used – they are, Intel® C++ Compiler, Intel® MPI library, Intel® VTune™ Amplifier XE in Intel(R) Cluster Studio XE.

First at all, set these tools’ environment:

> source /opt/intel/ics/2015.0.3.032/composer_xe_2015/bin/compilervars.sh intel64
> source /opt/intel/ics/2015.0.3.032/impi_latest/bin64/mpivars.sh
> source /opt/intel/ics/2015.0.3.032/vtune_amplifier_xe_2015/amplxe-vars.sh

I used Intel Cluster Studio 2015.0.3. 032, which includes:
icc version 15.0.3 (gcc version 4.4.7 compatibility)
Intel(R) VTune(TM) Amplifier XE 2015 Update 3 (build 403110)
Intel(R) MPI Library for Linux* OS, Version 5.0 Update 3 Build 20150128 (build id: 11250)

Here is a simple program to compute Pi with OpenMP implementation (pi_omp.c).
Build this program with Intel C++ compiler
> icc -g -O2 -openmp pi_omp.c -o pi_omp
Run program on cluster
> time OMP_NUM_THREADS=4 ./pi_omp
Computed value of Pi by using OpenMP: 3.141592654
real 0m0.147s

Run VTune Amplifier XE
> OMP_NUM_THREADS=4 amplxe-cl -c advanced-hotspots -knob collection-detail=stack-and-callcount -- ./pi_omp

In bottom-up report, changing “OpenMP Region/Thread/Function Call Stack” to “Grouping:”, observe “Potential Gain” and know it was caused by Imbalanced or Serial Spinning. In this case, there was only 4.2% potential gain in theory.

Another approach is to use MPI hybrid code to parallel-computing Pi (omp_mpi.c). This MPI application can work as multi-processes on cluster system.
Build this program with Intel C++ compiler with Intel MPI library v5.0.2 or above.
> mpiicc -g -O2 pi_mpi.c -o pi_mpi
Run program on Cluster
> time mpirun -n 8 ./pi_mpi
Computed value of Pi is by using Intel MPI: 3.141592654

real 0m0.212s

Run VTune Amplifier XE

> mpirun -gtool "amplxe-cl -r result3 -c advanced-hotspots:all=node-wide" -n 8 ./pi_mpi

Let’s see what results are from VTune Amplifier XE’s report.

I started up eight processes which are MPI hybrid code, actually workloads of eight processes were balanced. You can drill down to function level to know hotspots and narrow down to source level.
Observe that traffic of processes’ communications from pmi_proxy process was low.

Next, you may want to know CPU consumption in this cluster system when running MPI program.

Based on above report, you can know – there were two threads to work on core 5, 9, 13 respectively, almost two threads worked on core 1 but still had little works on core 2.

In order to use eight processes on each physical core (nodes on system had HT enabled), I used –
> I_MPI_PIN_DOMAIN=2 mpirun -gtool "amplxe-cl -r result5 -c advanced-hotspots:all=node-wide" -n 8 ./pi_mpi

So, eight processes ran on eight cores and workloads were balanced.

Use VTune™ Amplifier XE 2015 to analyze MPI Hybrid code

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List