When the user develops multithreaded applications, the user should protect critical (sensitive) code area called by threads, so threads access shared memory without data conflict. Most of time, the user might use critical_section, mutex, semaphore, atomic, events, or other “locks” to protect critical code area and let them not re-enterable.
If one thread gets a lock, then access shared memory, other threads should wait at the Entry of critical code to wait the previous thread’s releasing the lock. So the user should be carefully to protect critical code area where might have data conflict. VTune(TM) Amplifier XE provides analysis type named LocksandWaits, which can profile/report all wait time & wait count of synchronization objects.
Thanks to Intel® Transactional Synchronization Extensions (Intel® TSX) (read this for more information) which allows the processor to use lock-free to protect critical sections. It means, with accessing transactional memory, threads no longer need to take out locks when manipulating shared data in memory. This technology is supported in Haswell processors. At these cores, operating system implements locks with pieces of memory, at instruction level, we call that Hardware Lock Elision (HLE). Accessing transactional memory without locks has been mainstream to make parallel-programming easier.
VTune Amplifier XE 2015 Beta provides two metrics: Transitional Cycles, Abort Cycles.
Transitional Cycles - measure the number of cycles spent in transactions
Abort Cycles - measure the number of cycles spent during transactions which were eventually aborted. There are many reasons to cause transitional abort. For example - one case is due to conflicting accesses that one logical processor wants to read and another logical processor wants to write to cause a unsuccessful transition. Intel TSX detects data conflicts at the granularity of a cache line, so transitional abort may also occur due to limited transaction resources. Another case is that the amount of data accessed in region may exceed an implementation-specific capacity. Some instructions such as CPUID and IO instructions, may cause a transactional execution to abort. In general, transitional aborts cause wasted cycles.
Here is a simplest example to use Intel TSX and VTune Amplifier XE 2015 Beta.
//=====Pisolution.cpp=====
#include <stdio.h>
#include <windows.h>
#include <time.h>
#include "immintrin.h"
const long num_steps = 800000000;
const int num_threads = 16;
double step = 0.0, pi = 0.0;
static CRITICAL_SECTION cs;
DWORD WINAPI threadFunction(LPVOID pArg)
{
double partialSum = 0.0, x; // local to each thread
int myNum = *((int *)pArg);
for ( int i=myNum; i<num_steps; i+=num_threads ) // use every num_threads step
{
x = (i + 0.5)*step;
partialSum += 4.0 / (1.0 + x*x); //compute partial sums at each thread
}
#ifdef TSX
_xbegin(); // this is start point of code region for transition memory
#else
EnterCriticalSection(&cs); // this is traditional mechanism to protect critical code area
#endif
pi += partialSum * step; // add partial to global final answer
#ifdef TSX
_xend();
#else
LeaveCriticalSection(&cs);
#endif
return 0;
}
void WinThread_Pi()
{
HANDLE threadHandles[num_threads];
int tNum[num_threads];
InitializeCriticalSection(&cs);
for ( int i=0; i<num_threads; ++i )
{
tNum[i] = i;
threadHandles[i] = CreateThread( NULL, // Security attributes
0, // Stack size
threadFunction, // Thread function
(LPVOID)&tNum[i],// Data for thread func()
0, // Thread start mode
NULL); // Returned thread ID
}
WaitForMultipleObjects(num_threads, threadHandles, TRUE, INFINITE);
}
int main()
{
clock_t start, stop;
// Computing pi by using Windows Threads
pi = 0.0;
step = 1.0 / (double)num_steps;
start = clock();
WinThread_Pi();
stop = clock();
printf ("Computed value of Pi by using WinThreads: %12.9f\n", pi);
printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);
return 0;
}
Set C++ composer XE & VTune Amplifier environments.
"C:\Program Files (x86)\Intel\VTune Amplifier XE 2015\amplxe-vars.bat"
"C:\Program Files (x86)\Intel\Composer XE 2015.0.070\bin\compilervars.bat" Intel64
Build binaries by using “locks” and Intel TSX:
icl /Zi /Qno-inline-function /DEBUG Pisolution.cpp /o Pi.exe
icl /Zi /Qxcore-avx2 /Qno-inline-function /DTSX /DEBUG Pisolution.cpp /o Pi_tsx.exe
Test performance:
Image may be NSFW.
Clik here to view.
Then run tsx-exploration with pi_tsx.exe:
amplxe-cl –collect tsx-exploration – pi_tsx.exe
See report from VTune Amplifier XE.
Image may be NSFW.
Clik here to view.
Note:
1. The user should include Composer XE intrinsic include file
2. Use _xbegin() and _xend() to set critical code region
3. As a limitation, VTune Amplifier XE 2015 Beta still doesn’t support tsx-exploration with stack sampling enabled