Ready to run applications from multicore platform onto Intel® Xeon Phi™ coprocessor

We are staying the stage where most of developers do parallel-programming on multicore platform, and preparing to step into Many-Integrated-Cores (MIC) architecture.

Intel® Xeon Phi™ coprocessor (based on MIC architecture) combined many Intel CPU cores into a single chip (MIC microarchitecture), which can be connected to an Intel Xeon processor (Host) through PCI express bus. Intel Xeon Phi coprocessor can run a full service Linux* operation system, and communicate with Host.

Intel Xeon Phi coprocessor has peak performance on FLOPS, wide memory bandwidth, many threads’ paralleling and robust Vector Processing Unit (VPU, process 512-bit SIMD).

I have one machine which is Intel® Core™ i7 CPU with 1.6GHz, 6 cores with HT technology, also there is Intel® Xeon Phi™ coprocessor 1.1GHz in this machine.

In this article, I will pilot some experiments (cases with samples) to teach:

How to use Intel(R) C/C++ Composer XE to recompile code for Intel Xeon Phi coprocessor
Run it on MIC device and use VTune™ Amplifier XE to analyze the performance

Preparing works:

Ensure MIC device is running: use “service mpss status” to check, and use “service mpss start” to invoke it if it stops
Ensure Intel® C/C++ Composer XE, Intel® MPI library and Intel® VTune™ Amplifier XE have been installed in system, then set environments of them. For example,

source /opt/intel/composer_xe_2013.1.117/bin/compilervars.sh intel64
source /opt/intel/impi/4.1.0/bin64/mpivars.sh
source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh

Case 1: Use OpenMP* for Pi calculation, to run on Xeon Host and MIC device

Compile, run Pi-OMP code on multicore system, analyze performance

# icc -g -O3 -openmp -openmp-report omp_pi.c -o pi

omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

b. Run program

# time ./pi

Computed value of Pi by using OpenMP: 3.141592654

real 0m11.008s

user2m8.496s

sys 0m0.179s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect lightweight-hotspots -- ./pi

Observe the result which is opened by amplxe-gui, we know:

√ Workloads on threads are balanced

√ Each core is full-utilized

2. Compile Pi-OMP code on Host and run it on MIC, analyze performance

a. #icc -g -O3 -mmic -openmp -openmp-report omp_pi.c -o pi

omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

# scp pi mic0:/root ; copy program to the device

You have to copy MIC libraries to the device first, before running native MIC program

# scp -r /opt/intel/composer_xe_2013.1.117/compiler/lib/mic/* mic0:/lib64

b. Run program

#time ssh mic0 /root/pi

Computed value of Pi by using OpenMP: 3.141592654

real 0m2.524s

user 0m0.010s

sys 0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi

Observe the result which is opened by amplxe-gui, we know:

√ There are 241 threads paralleling-working

√ Workloads on threads are balanced

√ Each threads took ~2s, including time spent in Linux and OMP library

√ Each core is not full-utilized

Case 2: Use MPI for Pi calculation, to run on Xeon Host and MIC device

Compile, run Pi-MPI code on multicore system, analyze performance

# mpiicc -g -O3 mpi_pi.c -o pi

b. Run program

# time mpirun -n 12 ./pi

Computed value of Pi by using MPI: 3.141592654

Elapsed time: 21.72 second

real 0m21.760s

user 4m20.592s

sys 0m0.104s

c. Use VTune™ Amplifier XE to analyze (note that lightweight-hotspots is not supported on single node for MPI program, PUM resource cannot be reused)

# mpirun -n 12 amplxe-cl -r mpi_res_host -collect hotspots -- ./pi

(There will be 12 result directories generated, for 12 processes - you can pick up anyone to analyze)

Observe the result which is opened by amplxe-gui, we know:

√ MPI program (12 threads) ran on 12 cores respectively, with full-core-utilized

√ Each core ran single thread

2. Compile Pi-MPI code on Host and run it on MIC, analyze performance

a. #mpiicc -g -O3 -mmic mpi_pi.c -o pi

#scp pi mic0:/root ; copy program to the device

Copy impi bins/libraries onto the device

# scp /opt/intel/impi/4.1.0.024/mic/bin/* mic0:/bin

# scp /opt/intel/impi/4.1.0.024/mic/lib/* mic0:/lib64

b. Run program

# time ssh mic0 /bin/mpiexec -n 240 /root/pi

Computed value of Pi by using OpenMP: 3.141592654

Elapsed time: 14.95 seconds

real 0m19.570s

user 0m0.010s

sys 0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc_lightweight_hotspots -r mpi_res_target -search-dir all:rp=. -- ssh mic0 /bin/mpiexec -n 240 /root/pi
(it is quite different like Host, all threads info is stored in one result directory)

Observe the result which is opened by amplxe-gui, we know:

√ In most of time, MPI program are parallel-working

√ All cores are full-utilized, in most of time

√ Pi calculation itself is ~13s only

√ But vmlinux & OMP libraries takes more time, probably “reduction” work between threads

Case 3: Use Threading Building Block (TBB) for Pi calculation, to run on Xeon Host and MIC device

Compile, run Pi-TBB code on multicore system, analyze performance

#icpc -g -O3 -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x -ltbb_debug tbb_pi.cpp -o pi

b. Run program

# time ./pi

Computed value of Pi by using OpenMP: 3.141592654

real 0m10.887s

user2m9.637s

sys 0m0.008s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect lightweight-hotspots -- ./pi

Observe the result which is opened by amplxe-gui, we know:

√ operator() on each thread takes 10s

√ Workloads on threads are balanced

√ 12 cores are full-utilized on 12 threads

2. Compile Pi-TBB code on Host and run it on MIC, analyze performance

a. #icpc -g -O3 -mmic -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x tbb_pi.cpp /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/libtbb_debug.so.2 -o pi –lpthread

# scp pi mic0:/root ; copy program to the device

Also, need to copy TBB libraries to MIC device

# scp -r /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/* mic0:/lib64
b. Run program

#time ssh mic0 /root/pi

Computed value of Pi by using OpenMP: 3.141592654

real 0m3.265s

user 0m0.010s

sys 0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi

Observe the result which is opened by amplxe-gui, we know:

√ There are 166 threads paralleling-working

√ Workloads on threads are balanced

√ Each threads took ~3.25s, including time spent in operator(), Linux and TBB library

√ Each core is not full-utilized

Case 4: Use OpenMP* for Matrix application, to run on Xeon Host and MIC device

Compile, run Pi-OMP code on multicore system, analyze performance

# icc -g -O3 -openmp -openmp-report -vec-report matrix.c -o matrix

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

b. Run program

# time ./matrix

real 0m7.408s

user1m17.586s

sys 0m0.344

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect lightweight-hotspots -- ./matrix

Observe the result which is opened by amplxe-gui, we know:

√ Workloads of threads are balanced

√ Each core is full-utilized (~1,200% for 6 core with HT)

2. Compile Pi-OMP code on Host and run it on MIC, analyze performance

a. # icc -g -O3 -mmic -openmp -openmp-report -vec-report matrix.c -o matrix

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

# scp matrix mic0:/root ; copy program to the device

(You have to copy MIC libraries to the device first, before running native MIC program - if you didn't do it before)

b. Run program

#time ssh mic0 /root/matrix

real 0m1.695s
user 0m0.008s
sys 0m0.007s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/matrix

Observe the result which is opened by amplxe-gui, we know:

√ There are 242 threads paralleling-working

√ Workloads on threads are balanced

√ Each threads took ~1.08s, including time spent in Linux and OMP library

√ Each core is not full-utilized

Conclusion:
Your HPC applications might be very suitable for running on Intel Xeon Phi coprocessor, it’s time to starting to work on MIC architecture. Intel C/C++ composer assists you to generate MIC code and VTune Amplifier XE will help to analyze performance.

Intel Parallel Composer XE VTune Amplifier XE

Icon Image:

Attachments:

http://software.intel.com/sites/default/files/blog/355937/omp-pi.c

http://software.intel.com/sites/default/files/blog/355937/mpi-pi.c

http://software.intel.com/sites/default/files/blog/355937/tbb-pi.cpp

http://software.intel.com/sites/default/files/blog/355937/matrix.c

http://software.intel.com/sites/default/files/blog/355937/mic7.png

http://software.intel.com/sites/default/files/blog/355937/mic8.png

http://software.intel.com/sites/default/files/blog/355937/mic100.png

http://software.intel.com/sites/default/files/blog/355937/mic101.png

http://software.intel.com/sites/default/files/blog/355937/mic102.png

http://software.intel.com/sites/default/files/blog/355937/mic103.png

http://software.intel.com/sites/default/files/blog/355937/mic104.png

http://software.intel.com/sites/default/files/blog/355937/mic105.png

Technical Article

Ready to run applications from multicore platform onto Intel® Xeon Phi™ coprocessor

Icon Image:

Attachments:

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112