The Unfairness of Good Syntax

Datetime:2017-04-04 05:25:25         Topic: C Program          Share        Original >>
Here to See The Original Article!!!

The unfairness of good syntax - bad syntax is a problem; good syntax is not a solution.

The second law of thermodynamics states that as energy is transferred or transformed, more and more of it is wasted. The second law also states that there is a natural tendency of any isolated system to degenerate into a more disordered state. When looking at their teenagers’ bedrooms, most parents observe this phenomenon. We can also observe the asymmetric law of dirtiness: For an object to become clean, other objects have to become dirty, by an equal amount. It is asymmetric, since objects can get dirty without other objects becoming clean in the process.

There is also an asymmetric law of syntax: Good syntax does not solve programming problems; programming remains hard. And parallel programming remains much harder. However, bad syntax can really get in the way.

In a previousblog post I was concerned that some advancements in syntax may be overrated as if those could deliver on the promise of performance portability. I would like to balance that in this blog and show that the right capability can indeed be useful.

For that, let’s look at the Black Scholes equation. It is a partial differential equation that models the value of options.

For European options there are closed-form solutions for call options and for put options:

The Black Scholes solution is occasionally used to showcase platform capabilities such as improved performance of compiler-generated code. For that, we need to turn the formula into a benchmark. The code to implement the formula, providing results for both call and put options, could be written as:

void BlackScholesBody
    double& call, //Call option price
    double& put,  //Put option price
    double Sf,    //Current stock price
    double Xf,    //Option strike price
    double Tf,    //Option years
    double Rf,    //Riskless rate of return
    double Vf     //Stock volatility
) {
    double sqrtT = sqrtf(Tf);
    double    d1 = (logf(Sf / Xf) + (Rf + 0.5f * Vf * Vf) * Tf) / (Vf * sqrtT);
    double    d2 = d1 - Vf * sqrtT;
    double  cnd1 = CND(d1);
    double  cnd2 = CND(d2);
    //Calculate Call and Put simultaneously
    double expRT = expf(- Rf * Tf);
    call = (double)(S * cnd1 - X * expRT * cnd2);
    put  = (double)(X * expRT * (1.0f – cnd2) - Sf * (1.0f – cnd1));

In order to turn the code into a benchmark that can be scaled to run longer, we can set up arrays of input values and arrays of output values. The BlackScholesBody() function is set up to receive the elements of the call and put arrays it is supposed to populate. We then need to write a loop that iterates over the arrays and invokes the function we wrote. The loop should be both parallel and vector, and we use the syntax available in OpenMP * 4.0 to express that intent:

#pragma omp parallel simd for
    for (int i = 0; i < m_optionCount; i++) {
        BlackScholesBody(&Call[i],&Put[i],stockPrice[i],optStrike[i],optYears[i], R,V);

Now we have a parallel and vector loop calling a function. To make things interesting, let’s assume that the loop and the function are in different compilation units. That would be the case, for example, if Alice is providing a library function and Bob is a client who is using it. In order to actually execute the function as a vector function, OpenMP provides the ability to declare the function as a SIMD-enabled function. By doing that, the developer is instructing the compiler to generate two variants of the function, one for regular use, and one for use from vector loops. We then annotate the function with:

#pragma omp declare simd

void BlackScholesBody() …

In compiling the variant of the function that is intended to execute out of vector loops, the compiler is generating code as if the function was a body of a vector loop. Each parameter is treated as a short vector of parameters. For example, for execution on Intel® AVX2 ISA extension, the compiler would place four doubles in a register to be used as an argument. For example, four consecutive elements of stockPrice[] will be loaded into YMM registers and sent as an argument to the function BlackScholesBody().

Much of the benefit comes from the opportunity to turn the mathematical operations into SIMD operations. For example, the expression d2 = d1 - V * sqrtT; will now be vectored across four consecutive calls to BlackScholesBody(). Similarly, the two last expressions in the function, which have a size effect that delivers the results back to the call and put arrays, will also be vectorized. This brings up a problem: The assignment into call[] appears as four store operations to the compiler, which generate the code to the function. The compiler is not aware of any relationship between those four store operations: The four memory locations are not known to have any relationship between them. However, we know, and in our scenario, Bob knows, that these are four consecutive array elements. When Alice annotates the library function to be a SIMD-enabled function, she can use the syntax in OpenMP to qualify parameters. As written, the parameters are consecutive addresses of array elements. The syntax in OpenMP to express that relationship is the linear clause. In our example, it will look like:

#pragma omp declare simd linear (call,put)

void BlackScholesBody() …

With this information, the compiler can generate vector store instructions when assigning the values into the call and put arrays. Another alternative is to provide the base address of the arrays and an iterator. The function declaration would then look like:

#pragma omp declare simd uniform (call,put) linear(idx)
void BlackScholesBody
    double* call, //Call option price
    double* put,  //Put option price
    int    idx,    //index into the arrays
    double Sf,    //Current stock price
    double Xf,    //Option strike price
    double Tf,    //Option years
    double Rf,    //Riskless rate of retu
    double Vf     //Stock volatility

which will achieve the same effect: The stores into the call and put arrays will be implemented using vector stores instead of multiple scalar stores or scatter operations.

The performance impact of course depends on many details, but it can be as impactful as doubling the performance of the loop; a useful part of the syntax worth keeping in mind.