Last Update: 16 Sep., 2022

Changing Rounding Mode and Compile Options

Masahide Kashiwagi

1. Introduction

The kv library provides the following three types of interval arithmeric by default:

"interval arithmeric with *double* endpoint" using interval.hpp + rdouble.hpp
"interval arithmeric with *dd* endpoint" using interval.hpp + dd.hpp + rdd.hpp
"interval arithmeric with *mpfr* endpoint" using interval.hpp + mpfr.hpp + rmpfr.hpp

Here, as for the former two (double, dd), there are various methods of changing rounding mode, and they are controlled by the compile option. This document summarizes the methods of changing rounding mode and compile options corresponding to them.

2. Changing Rounding Mode and Compile Options

No Compile Option

First, with no compile option, kv library changes rounding mode using:

in Microsoft Visual C++, using _controlfp_s provided in float.h.
otherwise, using fesetround provided in fenv.h.

See kv/hwround.hpp for implementation details. If the compiler is C99 compilant, fenv.h should be able to be used, so it should work as it is in many environments.

The reason of different implementation only for Visual C++ is as follows:

Visual C++ 2012 and earlier did not implement fenv.h .
In Visual C++ 2013, 2015, there was a fatal bug that the directions of upward rounding and downward rounding are reversed.
(In Visual C++ 2017 it finally became normal.)

-DKV_FASTROUND

Intel's CPU has two types of arithmetic units, FPU and SSE2, and the register for rounding mode is different for each unit. Therefore, fesetround and _controlfp_s change the rounding mode of both registers. However, since only SSE2 unit is generally used under Intel CPU's 64bit mode, speed up is expected by changing only SSE2 register for rounding mode.

Also, fesetround and _controlfp_s do the following operations:

1. Read the control register
2. Rewrite the bits corresponding to the rounding mode of the read data.
3. Write to the control register

Here, if it is assumed that the rest bits other than the bits corresponding to the rounding mode of the control register does not change during program execution, speed up is expected by the following:

1. and 2. are executed at program start and the value of the contorl register is stored.
In 3., just write the stored value.

Using -DKV_FASTROUND speeds up using these two methods. But, It can only be used in Intel CPU's 64bit mode.

See kv/hwround.hpp for implementation details.

-DKV_NOHWROUND

Using this option, kv library no longer use hardware rouding mode change. Kv library uses emulation of directed rounding using twosum and twoproduct. Since it works without changing the rounding mode by hardware and only in the nearest rounding mode, it is expected to work in very wide environments. Speed is not expected to be fast.

For details of the algorithm, see emu.pdf (Jun 9, 2014, in Japanese).

See

kv/rdouble-nohwround.hpp (interval arithmetic with *double* endpoint)
kv/rdd-nohwround.hpp (interval arithmetic with *dd* endpoint)

for implementation details.

[Note] Emulation of directed rounding by -DKV_NOHWROUND is designed on the assumption that the operation of double conforms to IEEE 754. It works in many standard environments, but the 32bit environment of Intel CPU may not comply with IEEE754, so kv library cannot be compiled in such an environment (modified in version 0.4.54). Actually, after #include <cfloat>, the macro FLT_EVAL_METHOD is checked, and if it is not 0, it is an error and cannot be compiled. If you really want to use KV_NOHWROUND in 32bit Intel CPU environment, you can use -msse2 -mfpmath=sse compile option to use SSE2, which will set the above macro to 0 and allow you to compile. We decided not to support compilers that do not define this macro, or CPUs that do not have SSE2, because they are too old.

-DKV_USE_AVX512

On some Intel's latest CPUs, the function AVX-512 is installed. As of November 2017, the following CPUs have AVX-512:

Xeon Phi Knights Landing
Skylake-X (Core i9 7900X, etc.)
Skylake-SP (Xeon Platinum/Gold/Silver/Bronze, etc.)

AVX-512 provides vectorization with 512bit registers, but in addition to it, AVX-512 provides "addition/subtraction/multiplication/division/sqrt including the rounding mode specification to the command itself". Using this function, it is possible to implement interval arithmetic without overhead of rounding mode change. Of course, it only works with the latest CPU with AVX-512. Also, compile option is often required to generate AVX-512 instructions. As of November 2017, gcc requires -mavx512f .

See

kv/rdouble-avx512.hpp (interval arithmetic with *double* endpoint)
kv/rdd-avx512.hpp (interval arithmetic with *dd* endpoint)

for implementation details.

3. Implemented Code for Changing Roundging Mode

As an exmaple, implemented codes of "add_up" (addition with rounding upward) are shown below:

No Compile Option

#include <fenv.h>

static double add_up(const double& x, const double& y) {
    volatile double r, x1 = x, y1 = y;
    fesetround(FE_UPWARD);
    r = x1 + y1;
    fesetround(FE_TONEAREST);
    return r;
}

-DKV_FASTROUND

#include <emmintrin.h>

unsigned long int reg = _mm_getcsr();
unsigned long int nearest = (reg & ~0x00006000) | 0x00000000;
unsigned long int down = (reg & ~0x00006000) | 0x00002000;
unsigned long int up =   (reg & ~0x00006000) | 0x00004000;
unsigned long int chop = (reg & ~0x00006000) | 0x00006000;

static double add_up(const double& x, const double& y) {
    volatile double r, x1 = x, y1 = y;
    _mm_setcsr(up);
    r = x1 + y1;
    _mm_setcsr(nearest);
    return r;
}

-DKV_NOHWROUND

#include <cmath>
#include <limits>

static void twosum(const double& a, const double& b, double& x, double& y) {
    double tmp;

    x = a + b;
    if (std::fabs(a) > std::fabs(b)) {
        tmp = x - a;
        y = b - tmp;
    } else {
        tmp = x - b;
        y = a - tmp;
    }
}

static double succ(const double& x) {
    static const double th1 = std::ldexp(1., -969);
    static const double th2 = std::ldexp(1., -1021);
    static const double c1 = std::ldexp(1., -53) + std::ldexp(1., -105);
    static const double c2 = std::ldexp(1., -1074);
    static const double c3 = std::ldexp(1., 53);
    static const double c4 = std::ldexp(1., -53);

    double a, c, e;

    a = std::fabs(x);
    if (a >= th1) return x + a * c1;
    if (a < th2) return x + c2;
    c = c3 * x;
    e = c1 * std::fabs(c);
    return (c + e) * c4;
}

static double add_up(const double& x, const double& y) {
    double r, r2;

    twosum(x, y, r, r2);
    if (r == std::numeric_limits<double>::infinity()) {
        return r;
    } else if (r == -std::numeric_limits<double>::infinity()) {
        if (x == -std::numeric_limits≤double>::infinity() || y == -std::numeric_limits::infinity()) {
            return r;
        } else {
            return -(std::numeric_limits≤double>::max)();
        }
    }

    if (r2 > 0.) {
        return succ(r);
    }

    return r;
}

-DKV_USE_AVX512

#include <immintrin.h>

static double add_up(const double& x, const double& y) {
    volatile __m128d x1, y1, z1;
    double r;
    x1 = _mm_set_sd(x);
    y1 = _mm_set_sd(y);
    z1 = _mm_add_round_sd(x1, y1, _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC);
    _mm_store_sd(&r, z1);
    return r;
}

4. Using FMA for twoproduct

The twoproduct algorithm is used in the emulation of directed rounding and double-double arithmeric. It is known that the twoproduct speeds up using FMA instruction. By using -DKV_USE_TPFMA option, kv library replace the original twoproduct by the fast implementation of twoproduct using FMA instruction. Implemented Codes are shown below:

without -DKV_USE_TPFMA

static void split(const double& a, double& x, double& y) {
    static const double sigma = (double)((1L << 27) + 1);
    double tmp;

    tmp = a * sigma;
    x = tmp - (tmp - a);
    y = a - x;
}

static void twoproduct(const double& a, const double& b, double& x, double& y) {
    static const double th = std::ldexp(1., 996);
    static const double c1 = std::ldexp(1., -28);
    static const double c2 = std::ldexp(1., 28);
    static const double th2 = std::ldexp(1., 1023);

    double na, nb, a1, a2, b1, b2;

    x = a * b;
    if (std::fabs(a) > th) {
        na = a * c1;
        nb = b * c2;
    } else if (std::fabs(b) > th) {
        na = a * c2;
        nb = b * c1;
    } else {
        na = a;
        nb = b;
    }
    split(na, a1, a2);
    split(nb, b1, b2);
    if (std::fabs(x) > th2) {
        y = ((((a1 * 0.5) * b1 - (x * 0.5)) * 2 + a2 * b1) + a1 * b2) + a2 * b2;
    } else {
        y = (((a1 * b1 - x) + a2 * b1) + a1 * b2) + a2 * b2;
    }
}

with -DKV_USE_TPFMA

static void twoproduct(const double& a, const double& b, double& x, double& y) {
    x = a * b;
    y = fma(a, b, -x);
}

This option seems to be usable for Intel CPUs after Haswell and OS/compiler which are not so old. The usability of the option can be checked by the following procedure:

compile test/test-fma.cc . If you can not compile it, you can not use the f ma instruction.
If the execution result is "fma works!", there is no problem, but if it is "fma is broken!", the fma instruction does not have sufficient accu racy and the -DKV_USE_TPFMA option can not be used.
For appropriate program using dd, compare the speeds with and without the -DKV_USE_TPFMA option. If the case of a system where fma is executed by emulation, abnormal speed down will be observed.

5. Using ARM CPU

No Compile Option

Kv library changes rounding mode using fesetround provided in fenv.h.

-DKV_FASTROUND

Since version 0.4.55, this option is available on ARM 32bit/64bit. Inline Assembler is used to perform fast rounding changes.

-DKV_NOHWROUND

Using this option, kv library no longer use hardware rouding mode change. Kv library uses emulation of directed rounding using twosum and twoproduct.

6. Benchmark

Using example/test-nishi.cc as a sample program that heavily used interval<double> or interval<dd>, we examined the calculation speed of the above compile options. Note that, for test-nishi.cc, all calculations are done in dd precision by adding -DTEST_DD .

The computer environment used is as follows:

Intel Core i5 11400 2.6GHz/4.4GHz, Memory 32G
Ubuntu 22.04 LTS
gcc 11.2.0

As explained in the previous section, in some cases the calculation speed will be improved with the -DKV_USE_TPFMA option. Calculation time of both is shown in that cases.

precision (type of endpoint) compile option
(In addition to -O3 -DNDEBUG) calculation time

without -DKV_USE_TPFMA with -DKV_USE_TPFMA

double none 6.65 sec

-DKV_FASTROUND 3.74 sec

-DKV_NOHWROUND 7.88 sec 6.05 sec

-DKV_USE_AVX512 -mavx512f 2.74 sec

dd none 37.20 sec 33.58 sec

-DKV_FASTROUND 25.08 sec 21.52 sec

-DKV_NOHWROUND 92.24 sec 71.81 sec

-DKV_USE_AVX512 -mavx512f 16.47 sec

precision (type of endpoint)	compile option (In addition to -O3 -DNDEBUG)	calculation time
without -DKV_USE_TPFMA	with -DKV_USE_TPFMA
double	none	6.65 sec
-DKV_FASTROUND	3.74 sec
-DKV_NOHWROUND	7.88 sec	6.05 sec
-DKV_USE_AVX512 -mavx512f	2.74 sec
dd	none	37.20 sec	33.58 sec
-DKV_FASTROUND	25.08 sec	21.52 sec
-DKV_NOHWROUND	92.24 sec	71.81 sec
-DKV_USE_AVX512 -mavx512f	16.47 sec

Next, Benchmark results on the M1 MacBook Air are described. Currently, ARM CPUs seem to have a large penalty in execution time for hardware rounding mode changes, and emulation is the fastest, which is a disappointing result.

The computer environment used is as follows:

Apple M1 MacBook Air 3.2GHz, Memory 16G
macOS Monterey 12.4
gcc 12.2.0

precision (type of endpoint) compile option
(In addition to -O3 -DNDEBUG) calculation time

without -DKV_USE_TPFMA with -DKV_USE_TPFMA

double none 24.45 sec

-DKV_FASTROUND 22.36 sec

-DKV_NOHWROUND 9.08 sec 6.03 sec

dd none 108.8 sec 100.6 sec

-DKV_FASTROUND 102.4 sec 94.64 sec

-DKV_NOHWROUND 84.60 sec 53.76 sec