Optimization of a Parallel Aplication for Multi-Core Environments

(This work was done by Vivek Lingegowda during his internship at Intel.)

From Amdahl's law we know that the performance of a parallel application in a multicore environment is limited by its non-parallel part. Parallel applications inescapably have non-parallel parts that form bottlenecks while trying to gain the performance by adding more number of cores. We must reduce the impact of these bottlenecks by either finding a way to parallelize the non-parallel parts or by making them faster by using efficient algorithms and data structures to improve the performance of the non-parallel parts.

In this blog we will look in to how we improved an application, Canneal, that was limited by its serial portion.

Canneal is an application in the PARSEC benchmarking suite. PARSEC is an opensource chip-multiprocessor benchmarking tool (http://parsec.cs.princeton.edu/). The application takes 2.5 million netlists with fanins and fanouts as input from file and calculates the minimum routing cost of a chip design. The application has a serial grid construction phase followed by parallel route calculation phase and then a serial final routing calculation phase. The graph below shows the performance scaling of the application when run with varying number of threads on a system with 80 logical cores. [Performance Ratio: Application run time with 1 thread/Application run time with x threads].

Figure 1 Pre-Optimization Performance Scaling of Canneal Application

Figure 2 below is MPSTAT data showing the CPU loads at different instants of time when the application is running with 80 threads.

Figure 2: MPSTAT data showing the CPU load at different instants of time

The serial grid construction and the serial final routing calculation amounts to 67% and 14% of the total run time of the application respectively when run with 80 threads. Grid construction uses C++ STL maps to store and search 2.5 million netlists based on netlist names; netlist names are unique lower case strings with maximum length of 5 characters. C++ STL Map internally uses Red Black tree to store data. Red-Black trees take O(nlogn) time to construct and O(logn) to search.

In the following section we will describe one approach to optimize the application.

Optimizing Grid Construction:
Here instead C++ STL map we use a five dimensional array to store and search netlists. Indexes to the array are derived from individual characters in a netlist name as below. This array based approach takes O(1) to search and O(n) for grid construction.

const int ARRAY_MAX = 27;  const int INDEX_HASH = 96;
Netlist* _netlistArray[ARRAY_MAX][ARRAY_MAX][ARRAY_MAX][ARRAY_MAX][ARRAY_MAX];
Index1 = netlistName[0] – INDEX_HASH; Index2 = netlistName[1] – INDEX_HASH;  
Index3 = netlistName[2] – INDEX_HASH; Index4 = netlistName[3] – INDEX_HASH;
Index5 = netlistName[4] – INDEX_HASH;

Optimizing Final Routing calculation:
Here we parallelize the code by dividing the array of 2.5 million netlists among threads in the application to sum the routing values. The final routing is the sum of the routing values computed by all the threads.

The below graph compares the performance scaling of the Canneal application with optimized implementation.