Inherent Parallelism and Speedup Estimation of Sequential Programs

Sesha Kalyur and Nagaraja G.S, “Inherent Parallelism and Speedup Estimation of Sequential Programs”, Annals of Emerging Technologies in Computing (AETiC), Print ISSN: 2516-0281, Online ISSN: 2516-029X, pp. 62-77, Vol. 5, No. 2, 1st April 2021, Published by International Association of Educators and Researchers (IAER), DOI: 10.33166/AETiC.2021.02.006, Available: http://aetic.theiaer.org/archive/v5/v5n2/p6.html. Research Article


Previous Work
Early Parallel Conversion of programs was entirely a manual activity. Parallel code paths in the program were identified and each path was handed out to a task. Tasks were implemented as fill fledged Processes or Threads. The latter being the more efficient Counter-part in terms of resources consumed [2]. Procedure is error-prone and tedious and so research was carried out to seek better techniques. The next step in the evolution of Parallel Programming was the advent of special parallel languages or existing language constructs, offered as directives to the compiler and parallel conversion supervised by the programmer [3][4][5][6]. Notable among them are the OpenMP and MPI which are also industry-standard technologies [7][8][9][10].
Performance estimation and measurement are important from two angles. Measurement done early in the compilation cycle can aid the choice of optimization and conversion techniques. Measurement done later in the pipeline can be more accurate and can help ascertain the quality and accurateness of earlier projections. A lot of research has been expended in the area of performance assessment, including parallel performance [52][53][54][55][56][57][58][59][60][61][62][63][64].

Asterix
Caliper is a parallel measurement, prediction and estimation module. It is part of the compilation pipeline, of Asterix our compiler, optimizer and parallel converter. We provide a highlevel view, of each of the Asterix modules next:

Paracite
This module is essentially, the front end of Asterix, where the lexical analysis, syntax analysis and semantics analysis occur. The input to this phase is the program in an imperative language, and the outcome of the phase, is the equivalent program in ASIF, the Intermediate Representation (IR) in Asterix [65].

ASIF
ASIF is an acronym and stands for Asterix Intermediate Format the language that mainly includes an IR instruction set invented for the Asterix compiler suite. It is based on the threeaddress instruction format, with explicit Operand followed by the Result, And two Source operands.

Caliper
Caliper reads the code in the ASIF format, and does a coarse estimation, of the nascent parallel opportunities, that exist in the given program. This provides a starting point, for the users, to position their reference performance. The following section discusses exhaustively on the topic. [1].

Graft
Graft performs the bulk of the analysis work, on the IR code in ASIF format. The result of the analysis is represented in the form of several tables and graphs which are consulted, for identifying code transformation opportunities, including optimizations and parallel conversions.

3PO
3PO stands for Parallel Performance Predictor and Oracle. This module is a fine grain, performance estimation and prediction module, which reports at the local block level, and also at the global program level and uses several mathematical models, one for each transformation category, for its operation. The various 3PO sub-models are categorized based on the nature of the transformation, or parallel conversion. Accordingly, we have transformations that improve instruction counts, transformations that improve cache latency, and transformations that enable other transformations including parallel conversions [66].
The main performance numbers reported are, Inherent Parallel Potential (IPO) and the Expected Speedup from Parallel Conversion (ESP) with obvious connotations for parallel conversion. For transformations, the numbers are similar but with slightly different semantics, and they are, Inherent Speedup Potential (ISP) and the Expected Speedup from Transformation (EST) using the appropriate category model. www.aetic.theiaer.org

Transgraph
Transgraph module is in charge of generating code transformations that are beneficial, from a performance perspective. Some of the transformations are solely concerned, about generating code that is parallel friendly. The input and output for the module, is IR in ASIF code, and supplementary IR structures data such as graphs and tables.

Paragraph
Paragraph module actually generates the parallel code. The basic unit of parallel code which is conceptually a task is called a Prune after morphing the phrase, Parallel IR Unit. Each Prune is assigned, to an independent processing element, in a virtual topology and this mapping is preserved, for the entire duration, of the application existence. The input for the module is IR code and IR supplements, from Transgraph. Output is IR in Prune form.

Pigeon
Pigeon is a word that originates from the phrase, Parallel Code Generator. It is the module that converts Prunes, to executable versions of Prunes. These executable Prunes are called Proxies, singular is Proxy. The name evolved from the phrase, Parallel Execution Unit, are generated and assigned, to respective execution units, in an actual physical topology in a later phase. These mappings are subject to change, during the life cycle of the application.

AIDE
AIDE stands for, Asterix Integrated Development Environment, is a graphical tool to display the important results, of the compilation process, starting from the source code, to the generation of Prunes and Proxies and their interdependence [67]. The various views include, Annotated Source and ASIF IR, Caliper Predictions, 3PO Oracles, Prunes, Proxies, their distribution and orchestration.

Concerto
This module as the name suggests is the Distributor, Coordinator and Orchestration Manager of the Proxies in action. It chooses the mapping of Proxies to their respective processing elements manages their remote executions and also provides synchronization primitives. In a NUMA distributed environment, it also decides on how to partition data, between the Proxies, manages mapping to processing elements and provides communication primitives for data sharing [68]. Actual mapping is handled by a sub module of Concerto called the Topology Mapper, TOPMAP for short and offers a choice of, different mapping algorithms. [69][70].

CALIPER
CALIPER module, is responsible for providing the user, with a base expectation of parallel performance that is inherent in the program, under consideration. This prediction can help dictate, the choice of transformations to apply on the program, including the parallel conversion decisions. The higher-level syntactic structures, of an imperative program, offer impedance, to the effective computation of, performance estimates, and prediction. Each program is unique, from the perspective of the collection of the syntactic structures, constituting the program, which offer unique difficulties, for estimation and prediction. We refer to this trait of the program, as the Shape of the program. The transformations applied to a program, to strip the Shape of a program as the Program-Shape-Flattening.
Input to the CALIPER module, consists of IR in ASIF format. It performs the following, Program-Shape-Flattening transformations such as, Function-Call-Expansion, Loop-Unrolling and Control-Predication, which are described individually later. The output from the CALIPER module is the performance estimation, in the form of Maximum-Available-Parallelism (MAP), and the performance prediction, in the form of Speedup-After-Parallel Conversion (SAP). These two terms, are described later. The following paragraphs describe the steps involved in CALIPER operation followed by the definitions of Performance Metrics reported by CALIPER.

Function Call Expansion
The purpose of Function-Call-Expansion is to replace, all function calls, with the code, that constitutes the function block. It should be noted that, it is a recursive process, and the process stops only, after all user defined functions, have been expanded. Library Functions and System Calls are normally not considered for call expansion. They are essentially treated as any other instruction, which suffices for coarse estimates. A user program that is loaded with library calls and system calls may skew the prediction somewhat, but it is usually not the case, with a majority of the realworld programs.

Loop Unrolling
As a result of Loop-Unrolling, all Loops and Multi-Loops are replaced with their respective code blocks, and the instructions making up the Entry, Exit Conditions and the Loop Back Jumps removed.

Control Predication
Control Predication is a transformation that replaces Conditional Blocks, with equivalent Predicated Blocks. The Conditional Statements are another hindrance, to the correct estimation, of performance. However, most of the architectures provide support for Predicated-Execution of instructions, with varying degree of support. However all of them support Conditional-Move instruction which is a powerful construct when used with predicates, to compute the condition of the move, and combined with regular instructions, computing to temporary result variables, offer a powerful and compelling solution, to implement Control-Predication.

Maximum Available Parallelism
Maximum-Available-Parallelism (MAP) is a metric that reports the amount of parallelism present, in a given program, as a percentage. For instance, a MAP of 33% means that, one third of the code is parallel convertible, and the other two thirds of the code, 66% is serial in nature. It should be noted, that this number, takes in to consideration, all the dependencies, that exist in the program, which includes, both the data, and the control kinds.

Speedup After Parallel Conversion
Speedup-After-Parallel Conversion {SAP}, is a metric that reports the benefits of parallel conversion. In the example discussed earlier, since 33% is subject to parallel conversion, the www.aetic.theiaer.org effective run time is determined by the 66% of the serial part, and the expected speedup, would be 1.52 and reported as a floating point number.
The Figure-2 illustrates the different steps involved, in the operation of the CALIPER module. As you can see, translated IR code in ASIF format is fed to the Inliner module, which carries out the expansion of all function calls, and this modified IR is fed to the next module in the chain, which is the Unroller. This module unrolls all loops, and its output is sent to the next module in the chain, which is the Predicator. The purpose of this module is to convert all conditionals in the IR to Predicated statements. The output from this module, is shape sanitized IR that is ready for performance estimation.

Performance Estimation Equations
Performance estimation and prediction, for both serial and parallel versions, revolve around the following parameters, which are defined below, and also given are the equations for computing them.

Serial Execution Cycles
Since we are measuring performance, in coarse fashion here, we are not accounting, for the individual instruction differences. Each instruction counts as one cycle, and we are also not considering, the memory hierarchy, into these computations. Fine grained estimations, are for a later pass, where they use the 3PO model which has an in built cycle accurate simulator, we call Kinetics, for accurate estimates. It includes hardware accurate models of cache, memory and storage supporting the simulator. The workings of 3PO and Kinetics, are subject matter of a different paper, and we shall not discuss them any further here. The following equation, describes the process, for the equation for Serial-Execution-Cycles: Here, C_CYC is the count of cycles, to run the serial version of the program, and N_INC is the instruction count, for the given program,

Parallel Execution Cycles
Computation of the parallel execution cycles, is more involved, and requires a check, for data dependence between operands and results, belonging to different instructions. Since we have eliminated, control dependencies of all kinds, through Shape-Flattening, this is not an issue any more. A later subsection, shall describe the Shape-Flattening algorithm in more detail. Calculating Parallel-Execution-Cycles involves, classifying instructions, based on their data dependence, into different equivalence classes. Instructions belonging to the same equivalence class are data dependent with one another, and so we have to honour, their ordinal order of issue, to maintain correctness. However instructions belonging to different classes, have no data dependencies, and hence allow concurrent execution between them. Once the equivalence classes, have been finalized, the execution time is dictated by, the longest running equivalence class. The algorithm for creating equivalent dependence classes shall be given later in a following subsection.
The equation for computing, the parallel execution cycles, is given below, C_PAR = MAX(EQC_1, EQC_2, ..., EQC_n) Where C_PAR is the parallel cycle count, EQC_1, EQC_2,..., EQC_n are the total cycles needed to execute the, individual equivalence class instructions in serial fashion.
The equation to compute Maximum Available Parallelism (MAP) is given on the following line: MAP = (C_SER -C_PAR) / C_SER) X 100 Where, Maximum Available Parallelism (MAP) is a measure of the inherent parallelism available in a program, and is reported as a percent of the total program instructions. C_PAR is the number of www.aetic.theiaer.org cycles required to run the parallel version of the program and C_SER is the cycle count for the serial version of the program.
The equation to compute the Speedup After Parallel Conversion (SAP) is given below: SAP = (C_SER / C_PAR) Where, Speedup After Parallel Conversion (SAP) is an estimate of how much faster the program will run, after parallel conversion, C_PAR is the number of cycles required to run the parallel version of the program and C_SER is the cycle count, for the serial version of the program.

Program Shape Flattening
As mentioned earlier, program syntax structures such as Functions, Loops and Conditionals, are a hindrance to effective estimation and predictions of performance. So as a first step, it is essential to flatten these high level language structures and then proceed with the estimation.
In the following paragraphs, we will give brief procedures in algorithmic form to perform these preparatory steps towards estimation.

Parallel Equivalence Classes
Parallel Equivalence Classes are a set of items that satisfy a single property. In the context of Parallel Conversions, it means sets of instructions that can be executed concurrently. However it should be noted that, instructions within a particular class, are to be executed in serial, to satisfy the property of an equivalence class. When the instructions of a program, are organized in to equivalence classes, the run time of the program, is reduced from the time spent, by all instructions of the program executing serially, to the run time of the longest running equivalence class.
What follows is the algorithm to create the Equivalence Classes, also referred to as Dependence Classes here. Once created, it becomes trivial to assess the run time and predict performance. The equivalence class creation algorithm is given below: Get Source1 Operand(S1, Ins) // fetch source1 operand of instruction 15: Get Source2 Operand(S2, Ins) // fetch source2 operand of instruction 16: Merge(R, S1) // merge class S1 to class R and update global parallel equiv. class list 17: Merge(R, S2) // merge class S2 to class R and update global parallel equiv. class list 18: end for 19: end procedure

Long Dependence Sequences
Certain programs exhibit long dependence sequences which can lead to loss of parallelism and produce fewer than optimal number of parallel classes. To prevent this, a heuristic based on the concept of Instruction Threshold (IT) is proposed, where IT is the number of instructions in a class which would ensure or force the class to become an independent parallel class. For instance IT which is a tuneable can be set to 32 instructions, which means that if the class size is less than IT proceed with the merger and in the other case skip merger. To implement this at the time of Parallel Class mergers a check is made to see if the class lengths meet the IT threshold. If the criterion is met then the instruction which acts as the key in both classes is hoisted out of the classes and a unique class is made with the instruction. Dependence is set from the new class with the hoisted instruction to the existing classes. New keys for the two existing classes are defined with the result operand from the least numbered instruction in both classes. This operation is recursively applied to both classes as long as the IT holds. These IT checks are enough to ensure optimum parallelization is preserved. While calculating parallel instruction count, care should be taken to add the serial paths which precede the parallel classes and add the instruction counts to the sum.

Analysis
To better understand the working of the internals of Caliper, we study a simple program with a function, loop and conditional to see how it gets transformed as it passes through the shape flattening exercises and finally analyzes the ASIF-IR program to generate the Caliper report.

Input File to Caliper (calfun.c)
Given below is a simple C program with a function, loop and condition. The program which is passed as input to Caliper is self-explanatory.

Calfun.c after Function In-lining by Caliper (calfun_inl.c)
The first transformation applied to calfun.c is the function inlining and the program listed below is output as a result of that transformation. Lines 6-9 of the program represent the function which was inlined. if (i < HALF_COUNT) 7: z += i * i; 8: else 9: z += 2 * i; 10: }

Calfun_inl.c after Control Predication by Caliper (calfun_pred.c)
The program below if output by Caliper as a result of the Control Predication transformation where the If-conditional block is predicated as seen on line 6. z += (i < HALF_COUNT)? i * i : 2 * i; 7: printf("z = %lf\n", z); 8: }

Calfun_pre.c after Loop Unrolling by Caliper (calfun_unl.c)
The final transform applied by Caliper is the loop unrolling and the following program is output as seen on lines 5-20.

Calfun_unl.c after ASIF-IR generation by Caliper (calfun.s)
The following ASIF-IR is the resulting program after all transformations and high level code are translated to IR. Lines 5-25 show the results. To save space only iterations 0, 1 and 7 are shown with the others snipped.

CALIPER Parallel Estimates (calfun.csv)
After the ASIF-IR code is passed to Caliper it creates the required Equivalence Classes and calculates the MAP and SAP metrics, and the output is generated in the form of CSV file as shown below: (1), Serial Instruction Count, SIN, 58 As seen from the table, Caliper provides parallel performance estimates which none of the other state-of-the-art compilers provide. However all of them provide optimization related diagnostics at some basic level. Based on our findings, we have to conclude that Caliper is the only working, Parallel Performance Estimation and Prediction Solution available, at this time.

Conclusion
Caliper was developed to aid the parallel programmer in his endeavours, by providing a yield estimate resulting from parallel conversion of a given program. Caliper works on programs in ASIF-IR format an internal representation developed as part of our compiler framework. Caliper as a preliminary step performs Program Shape Flattening Transformations to ease subsequent steps. It performs symbolic analysis of ASIF-IR instructions representing the given program internally, and classifies them in to Equivalence Classes based on their dependence behaviour. These classes which host dependent instructions are themselves dependence free and are eligible to operate in interleaved fashion with other classes. Once arranged in this fashion it becomes easy to compute Serial and Parallel runtimes. Serial runtime is the sequential runtime of the instructions making up the program and Parallel runtime is the runtime of the class that runs the longest. Based on these two numbers two metrics useful to the programmer are reported. Maximum Available Parallelism (MAP) points out the inherent parallel potential of a given program. Speedup after Parallelization (SAP) complements the earlier metric by reporting the estimated speedup resulting from parallel conversion. At the time of writing there are no known technologies comparable to Caliper and we conclude that Caliper is a one of its kind parallelization technology.