Quantcast
Channel: Intel Developer Zone Blogs
Viewing all articles
Browse latest Browse all 91

Use VTune™ Amplifier XE 2015 to analyze MPI Hybrid code

$
0
0

Traditional OpenMP* is a fork-join parallel programming technology. First, program runs with a single master thread which is serial code, later, master thread assigns sub-tasks on created multiple threads where are in parallel region, master thread waits until all threads complete sub-tasks to meet at a barrier then be terminated, then master thread continue to run serial code again.

Usually we need to optimize OpenMP code on these sides. 1. Reduce serial portion if possible; 2. Workloads of sub-tasks are balanced in threads; 3. Reduce wait time / wait counts, sometime using right granularity for sub-tasks is important; 4. Reduce unnecessary locks. 
All of above are based on algorithm tuning level.

In this article, three development tools will be used – they are, Intel® C++ Compiler, Intel® MPI library, Intel® VTune™ Amplifier XE in Intel(R) Cluster Studio XE.

First at all, set these tools’ environment:

> source /opt/intel/ics/2015.0.3.032/composer_xe_2015/bin/compilervars.sh intel64
> source /opt/intel/ics/2015.0.3.032/impi_latest/bin64/mpivars.sh
> source /opt/intel/ics/2015.0.3.032/vtune_amplifier_xe_2015/amplxe-vars.sh

I used Intel Cluster Studio 2015.0.3. 032, which includes:
icc version 15.0.3 (gcc version 4.4.7 compatibility)
Intel(R) VTune(TM) Amplifier XE 2015 Update 3 (build 403110)
Intel(R) MPI Library for Linux* OS, Version 5.0 Update 3 Build 20150128 (build id: 11250)

Here is a simple program to compute Pi with OpenMP implementation (pi_omp.c).
Build this program with Intel C++ compiler
> icc -g -O2 -openmp pi_omp.c -o pi_omp
Run program on cluster
> time OMP_NUM_THREADS=4 ./pi_omp
Computed value of Pi by using OpenMP:  3.141592654
real    0m0.147s

Run VTune Amplifier XE
> OMP_NUM_THREADS=4 amplxe-cl -c advanced-hotspots -knob collection-detail=stack-and-callcount -- ./pi_omp

 
In bottom-up report, changing “OpenMP Region/Thread/Function Call Stack” to “Grouping:”, observe “Potential Gain” and know it was caused by Imbalanced or Serial Spinning. In this case, there was only 4.2% potential gain in theory.

Another approach is to use MPI hybrid code to parallel-computing Pi (omp_mpi.c). This MPI application can work as multi-processes on cluster system.
Build this program with Intel C++ compiler with Intel MPI library v5.0.2 or above.
> mpiicc -g -O2 pi_mpi.c -o pi_mpi 
Run program on Cluster
> time mpirun -n 8 ./pi_mpi
Computed value of Pi is by using Intel MPI:  3.141592654

real    0m0.212s

Run VTune Amplifier XE

> mpirun -gtool "amplxe-cl -r result3 -c advanced-hotspots:all=node-wide" -n 8 ./pi_mpi

Let’s see what results are from VTune Amplifier XE’s report.

I started up eight processes which are MPI hybrid code, actually workloads of eight processes were balanced. You can drill down to function level to know hotspots and narrow down to source level.  
Observe that traffic of processes’ communications from pmi_proxy process was low.

Next, you may want to know CPU consumption in this cluster system when running MPI program. 

Based on above report, you can know – there were two threads to work on core 5, 9, 13 respectively, almost two threads worked on core 1 but still had little works on core 2.

In order to use eight processes on each physical core (nodes on system had HT enabled), I used –
> I_MPI_PIN_DOMAIN=2 mpirun -gtool "amplxe-cl -r result5 -c advanced-hotspots:all=node-wide" -n 8 ./pi_mpi

So, eight processes ran on eight cores and workloads were balanced.


Viewing all articles
Browse latest Browse all 91

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>