We are staying the stage where most of developers do parallel-programming on multicore platform, and preparing to step into Many-Integrated-Cores (MIC) architecture.
Intel® Xeon Phi™ coprocessor (based on MIC architecture) combined many Intel CPU cores into a single chip (MIC microarchitecture), which can be connected to an Intel Xeon processor (Host) through PCI express bus. Intel Xeon Phi coprocessor can run a full service Linux* operation system, and communicate with Host.
Intel Xeon Phi coprocessor has peak performance on FLOPS, wide memory bandwidth, many threads’ paralleling and robust Vector Processing Unit (VPU, process 512-bit SIMD).
I have one machine which is Intel® Core™ i7 CPU with 1.6GHz, 6 cores with HT technology, also there is Intel® Xeon Phi™ coprocessor 1.1GHz in this machine.
In this article, I will pilot some experiments (cases with samples) to teach:
- How to use Intel(R) C/C++ Composer XE to recompile code for Intel Xeon Phi coprocessor
- Run it on MIC device and use VTune™ Amplifier XE to analyze the performance
Preparing works:
- Ensure MIC device is running: use “service mpss status” to check, and use “service mpss start” to invoke it if it stops
- Ensure Intel® C/C++ Composer XE, Intel® MPI library and Intel® VTune™ Amplifier XE have been installed in system, then set environments of them. For example,
- source /opt/intel/composer_xe_2013.1.117/bin/compilervars.sh intel64
- source /opt/intel/impi/4.1.0/bin64/mpivars.sh
- source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh
Case 1: Use OpenMP* for Pi calculation, to run on Xeon Host and MIC device
- Compile, run Pi-OMP code on multicore system, analyze performance
- # icc -g -O3 -openmp -openmp-report omp_pi.c -o pi
omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED
b. Run program
# time ./pi
Computed value of Pi by using OpenMP: 3.141592654
real 0m11.008s
user2m8.496s
sys 0m0.179s
c. Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect lightweight-hotspots -- ./pi
Observe the result which is opened by amplxe-gui, we know:
√ Workloads on threads are balanced
√ Each core is full-utilized
2. Compile Pi-OMP code on Host and run it on MIC, analyze performance
a. #icc -g -O3 -mmic -openmp -openmp-report omp_pi.c -o pi
omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED
# scp pi mic0:/root ; copy program to the device
You have to copy MIC libraries to the device first, before running native MIC program
# scp -r /opt/intel/composer_xe_2013.1.117/compiler/lib/mic/* mic0:/lib64
b. Run program
#time ssh mic0 /root/pi
Computed value of Pi by using OpenMP: 3.141592654
real 0m2.524s
user 0m0.010s
sys 0m0.003s
c. Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi
Observe the result which is opened by amplxe-gui, we know:
√ There are 241 threads paralleling-working
√ Workloads on threads are balanced
√ Each threads took ~2s, including time spent in Linux and OMP library
√ Each core is not full-utilized
Case 2: Use MPI for Pi calculation, to run on Xeon Host and MIC device
- Compile, run Pi-MPI code on multicore system, analyze performance
- # mpiicc -g -O3 mpi_pi.c -o pi
b. Run program
# time mpirun -n 12 ./pi
Computed value of Pi by using MPI: 3.141592654
Elapsed time: 21.72 second
real 0m21.760s
user 4m20.592s
sys 0m0.104s
c. Use VTune™ Amplifier XE to analyze (note that lightweight-hotspots is not supported on single node for MPI program, PUM resource cannot be reused)
# mpirun -n 12 amplxe-cl -r mpi_res_host -collect hotspots -- ./pi
(There will be 12 result directories generated, for 12 processes - you can pick up anyone to analyze)
Observe the result which is opened by amplxe-gui, we know:
√ MPI program (12 threads) ran on 12 cores respectively, with full-core-utilized
√ Each core ran single thread
2. Compile Pi-MPI code on Host and run it on MIC, analyze performance
a. #mpiicc -g -O3 -mmic mpi_pi.c -o pi
#scp pi mic0:/root ; copy program to the device
Copy impi bins/libraries onto the device
# scp /opt/intel/impi/4.1.0.024/mic/bin/* mic0:/bin
# scp /opt/intel/impi/4.1.0.024/mic/lib/* mic0:/lib64
b. Run program
# time ssh mic0 /bin/mpiexec -n 240 /root/pi
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 14.95 seconds
real 0m19.570s
user 0m0.010s
sys 0m0.003s
c. Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect knc_lightweight_hotspots -r mpi_res_target -search-dir all:rp=. -- ssh mic0 /bin/mpiexec -n 240 /root/pi
(it is quite different like Host, all threads info is stored in one result directory)
Observe the result which is opened by amplxe-gui, we know:
√ In most of time, MPI program are parallel-working
√ All cores are full-utilized, in most of time
√ Pi calculation itself is ~13s only
√ But vmlinux & OMP libraries takes more time, probably “reduction” work between threads
Case 3: Use Threading Building Block (TBB) for Pi calculation, to run on Xeon Host and MIC device
- Compile, run Pi-TBB code on multicore system, analyze performance
- #icpc -g -O3 -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x -ltbb_debug tbb_pi.cpp -o pi
b. Run program
# time ./pi
Computed value of Pi by using OpenMP: 3.141592654
real 0m10.887s
user2m9.637s
sys 0m0.008s
c. Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect lightweight-hotspots -- ./pi
Observe the result which is opened by amplxe-gui, we know:
√ operator() on each thread takes 10s
√ Workloads on threads are balanced
√ 12 cores are full-utilized on 12 threads
2. Compile Pi-TBB code on Host and run it on MIC, analyze performance
a. #icpc -g -O3 -mmic -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x tbb_pi.cpp /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/libtbb_debug.so.2 -o pi –lpthread
# scp pi mic0:/root ; copy program to the device
Also, need to copy TBB libraries to MIC device
# scp -r /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/* mic0:/lib64
b. Run program
#time ssh mic0 /root/pi
Computed value of Pi by using OpenMP: 3.141592654
real 0m3.265s
user 0m0.010s
sys 0m0.003s
c. Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi
Observe the result which is opened by amplxe-gui, we know:
√ There are 166 threads paralleling-working
√ Workloads on threads are balanced
√ Each threads took ~3.25s, including time spent in operator(), Linux and TBB library
√ Each core is not full-utilized
Case 4: Use OpenMP* for Matrix application, to run on Xeon Host and MIC device
- Compile, run Pi-OMP code on multicore system, analyze performance
- # icc -g -O3 -openmp -openmp-report -vec-report matrix.c -o matrix
matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED
matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED
matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED
matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED
b. Run program
# time ./matrix
real 0m7.408s
user1m17.586s
sys 0m0.344
c. Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect lightweight-hotspots -- ./matrix
Observe the result which is opened by amplxe-gui, we know:
√ Workloads of threads are balanced
√ Each core is full-utilized (~1,200% for 6 core with HT)
2. Compile Pi-OMP code on Host and run it on MIC, analyze performance
a. # icc -g -O3 -mmic -openmp -openmp-report -vec-report matrix.c -o matrix
matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED
matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED
# scp matrix mic0:/root ; copy program to the device
(You have to copy MIC libraries to the device first, before running native MIC program - if you didn't do it before)
b. Run program
#time ssh mic0 /root/matrix
real 0m1.695s
user 0m0.008s
sys 0m0.007s
c. Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/matrix
Observe the result which is opened by amplxe-gui, we know:
√ There are 242 threads paralleling-working
√ Workloads on threads are balanced
√ Each threads took ~1.08s, including time spent in Linux and OMP library
√ Each core is not full-utilized
Conclusion:
Your HPC applications might be very suitable for running on Intel Xeon Phi coprocessor, it’s time to starting to work on MIC architecture. Intel C/C++ composer assists you to generate MIC code and VTune Amplifier XE will help to analyze performance.
Icon Image:
