Quantcast
Viewing all articles
Browse latest Browse all 91

Using Intel® TSX with VTune(TM) Amplifier XE 2015 Beta to measure transaction time & abort in your code?

When the user develops multithreaded applications, the user should protect critical (sensitive) code area called by threads, so threads access shared memory without data conflict. Most of time, the user might use critical_section, mutex, semaphore, atomic, events, or other “locks” to protect critical code area and let them not re-enterable.

If one thread gets a lock, then access shared memory, other threads should wait at the Entry of critical code to wait the previous thread’s releasing the lock. So the user should be carefully to protect critical code area where might have data conflict. VTune(TM) Amplifier XE provides analysis type named LocksandWaits, which can profile/report all wait time & wait count of synchronization objects.

Thanks to Intel® Transactional Synchronization Extensions (Intel® TSX) (read this for more information) which allows the processor to use lock-free to protect critical sections. It means, with accessing transactional memory, threads no longer need to take out locks when manipulating shared data in memory. This technology is supported in Haswell processors. At these cores, operating system implements locks with pieces of memory, at instruction level, we call that Hardware Lock Elision (HLE). Accessing transactional memory without locks has been mainstream to make parallel-programming easier.

VTune Amplifier XE 2015 Beta provides two metrics: Transitional Cycles, Abort Cycles. 

Transitional Cycles -  measure the number of cycles spent in transactions
Abort Cycles - measure the number of cycles spent during transactions which were eventually aborted. There are many reasons to cause transitional abort. For example - one case is due to conflicting accesses that one logical processor wants to read and another logical processor wants to write to cause a unsuccessful transition. Intel TSX detects data conflicts at the granularity of a cache line, so transitional abort may also occur due to limited transaction resources. Another case is that the amount of data accessed in region may exceed an implementation-specific capacity. Some instructions such as CPUID and IO instructions, may cause a transactional execution to abort.  In general, transitional aborts cause wasted cycles.

Here is a simplest example to use Intel TSX and VTune Amplifier XE 2015 Beta.

//=====Pisolution.cpp=====
#include <stdio.h>
#include <windows.h>
#include <time.h>
#include "immintrin.h"
const long num_steps = 800000000;
const int num_threads = 16; 
double step = 0.0, pi = 0.0;
static CRITICAL_SECTION cs;
DWORD WINAPI threadFunction(LPVOID pArg)
{
    double partialSum = 0.0, x;  // local to each thread
    int myNum = *((int *)pArg);
    for ( int i=myNum; i<num_steps; i+=num_threads )  // use every num_threads step
    {
        x = (i + 0.5)*step;
        partialSum += 4.0 / (1.0 + x*x);  //compute partial sums at each thread
    }
    
#ifdef TSX
        _xbegin();  // this is start point of code region for transition memory  
#else
    EnterCriticalSection(&cs); // this is traditional mechanism to protect critical code area
#endif
    pi += partialSum * step;  // add partial to global final answer
    
#ifdef TSX
        _xend();
#else
        LeaveCriticalSection(&cs);
#endif
    return 0;
}
void WinThread_Pi()
{
    HANDLE threadHandles[num_threads];
    int tNum[num_threads];
    InitializeCriticalSection(&cs);
    for ( int i=0; i<num_threads; ++i )
    {
        tNum[i] = i;
        threadHandles[i] = CreateThread( NULL,            // Security attributes
                                         0,               // Stack size
                                         threadFunction,  // Thread function
                                         (LPVOID)&tNum[i],// Data for thread func()
                                         0,               // Thread start mode
                                         NULL);           // Returned thread ID
    }
    WaitForMultipleObjects(num_threads, threadHandles, TRUE, INFINITE);
}
 
int main()
{
    clock_t start, stop;
    // Computing pi by using Windows Threads
    pi = 0.0;
    step = 1.0 / (double)num_steps;
    start = clock();
    WinThread_Pi();
    stop = clock();
    printf ("Computed value of Pi by using WinThreads: %12.9f\n", pi);
    printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);
    return 0;

Set C++ composer XE & VTune Amplifier environments.
"C:\Program Files (x86)\Intel\VTune Amplifier XE 2015\amplxe-vars.bat"
"C:\Program Files (x86)\Intel\Composer XE 2015.0.070\bin\compilervars.bat" Intel64

Build binaries by using “locks” and Intel TSX:
icl /Zi /Qno-inline-function /DEBUG Pisolution.cpp /o Pi.exe
icl /Zi /Qxcore-avx2 /Qno-inline-function /DTSX /DEBUG Pisolution.cpp /o Pi_tsx.exe

Test performance:

Image may be NSFW.
Clik here to view.

Then run tsx-exploration with pi_tsx.exe:
amplxe-cl –collect tsx-exploration – pi_tsx.exe

See report from VTune Amplifier XE.
Image may be NSFW.
Clik here to view.
  

Note:
1.    The user should include Composer XE intrinsic include file
2.    Use _xbegin() and _xend() to set critical code region
3.    As a limitation, VTune Amplifier XE 2015 Beta still doesn’t support tsx-exploration with stack sampling enabled  


Viewing all articles
Browse latest Browse all 91

Trending Articles