Last Update: 16 Sep., 2022

Changing Rounding Mode and Compile Options

Masahide Kashiwagi

- "interval arithmeric with *double* endpoint" using interval.hpp + rdouble.hpp
- "interval arithmeric with *dd* endpoint" using interval.hpp + dd.hpp + rdd.hpp
- "interval arithmeric with *mpfr* endpoint" using interval.hpp + mpfr.hpp + rmpfr.hpp

- in Microsoft Visual C++, using _controlfp_s provided in float.h.
- otherwise, using fesetround provided in fenv.h.

The reason of different implementation only for Visual C++ is as follows:

- Visual C++ 2012 and earlier did not implement fenv.h .
- In Visual C++ 2013, 2015, there was a fatal bug that the directions of upward rounding and downward rounding are reversed.
- (In Visual C++ 2017 it finally became normal.)

Also, fesetround and _controlfp_s do the following operations:

- 1. Read the control register
- 2. Rewrite the bits corresponding to the rounding mode of the read data.
- 3. Write to the control register

- 1. and 2. are executed at program start and the value of the contorl register is stored.
- In 3., just write the stored value.

Using -DKV_FASTROUND speeds up using these two methods. But, It can only be used in Intel CPU's 64bit mode.

See kv/hwround.hpp for implementation details.

For details of the algorithm, see emu.pdf (Jun 9, 2014, in Japanese).

See

- kv/rdouble-nohwround.hpp (interval arithmetic with *double* endpoint)
- kv/rdd-nohwround.hpp (interval arithmetic with *dd* endpoint)

[Note] Emulation of directed rounding by -DKV_NOHWROUND is designed on the assumption that the operation of double conforms to IEEE 754. It works in many standard environments, but the 32bit environment of Intel CPU may not comply with IEEE754, so kv library cannot be compiled in such an environment (modified in version 0.4.54). Actually, after #include <cfloat>, the macro FLT_EVAL_METHOD is checked, and if it is not 0, it is an error and cannot be compiled. If you really want to use KV_NOHWROUND in 32bit Intel CPU environment, you can use -msse2 -mfpmath=sse compile option to use SSE2, which will set the above macro to 0 and allow you to compile. We decided not to support compilers that do not define this macro, or CPUs that do not have SSE2, because they are too old.

- Xeon Phi Knights Landing
- Skylake-X (Core i9 7900X, etc.)
- Skylake-SP (Xeon Platinum/Gold/Silver/Bronze, etc.)

See

- kv/rdouble-avx512.hpp (interval arithmetic with *double* endpoint)
- kv/rdd-avx512.hpp (interval arithmetic with *dd* endpoint)

#include <fenv.h> static double add_up(const double& x, const double& y) { volatile double r, x1 = x, y1 = y; fesetround(FE_UPWARD); r = x1 + y1; fesetround(FE_TONEAREST); return r; }

#include <emmintrin.h> unsigned long int reg = _mm_getcsr(); unsigned long int nearest = (reg & ~0x00006000) | 0x00000000; unsigned long int down = (reg & ~0x00006000) | 0x00002000; unsigned long int up = (reg & ~0x00006000) | 0x00004000; unsigned long int chop = (reg & ~0x00006000) | 0x00006000; static double add_up(const double& x, const double& y) { volatile double r, x1 = x, y1 = y; _mm_setcsr(up); r = x1 + y1; _mm_setcsr(nearest); return r; }

#include <cmath> #include <limits> static void twosum(const double& a, const double& b, double& x, double& y) { double tmp; x = a + b; if (std::fabs(a) > std::fabs(b)) { tmp = x - a; y = b - tmp; } else { tmp = x - b; y = a - tmp; } } static double succ(const double& x) { static const double th1 = std::ldexp(1., -969); static const double th2 = std::ldexp(1., -1021); static const double c1 = std::ldexp(1., -53) + std::ldexp(1., -105); static const double c2 = std::ldexp(1., -1074); static const double c3 = std::ldexp(1., 53); static const double c4 = std::ldexp(1., -53); double a, c, e; a = std::fabs(x); if (a >= th1) return x + a * c1; if (a < th2) return x + c2; c = c3 * x; e = c1 * std::fabs(c); return (c + e) * c4; } static double add_up(const double& x, const double& y) { double r, r2; twosum(x, y, r, r2); if (r == std::numeric_limits<double>::infinity()) { return r; } else if (r == -std::numeric_limits<double>::infinity()) { if (x == -std::numeric_limits≤double>::infinity() || y == -std::numeric_limits::infinity()) { return r; } else { return -(std::numeric_limits≤double>::max)(); } } if (r2 > 0.) { return succ(r); } return r; }

#include <immintrin.h> static double add_up(const double& x, const double& y) { volatile __m128d x1, y1, z1; double r; x1 = _mm_set_sd(x); y1 = _mm_set_sd(y); z1 = _mm_add_round_sd(x1, y1, _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC); _mm_store_sd(&r, z1); return r; }

static void split(const double& a, double& x, double& y) { static const double sigma = (double)((1L << 27) + 1); double tmp; tmp = a * sigma; x = tmp - (tmp - a); y = a - x; } static void twoproduct(const double& a, const double& b, double& x, double& y) { static const double th = std::ldexp(1., 996); static const double c1 = std::ldexp(1., -28); static const double c2 = std::ldexp(1., 28); static const double th2 = std::ldexp(1., 1023); double na, nb, a1, a2, b1, b2; x = a * b; if (std::fabs(a) > th) { na = a * c1; nb = b * c2; } else if (std::fabs(b) > th) { na = a * c2; nb = b * c1; } else { na = a; nb = b; } split(na, a1, a2); split(nb, b1, b2); if (std::fabs(x) > th2) { y = ((((a1 * 0.5) * b1 - (x * 0.5)) * 2 + a2 * b1) + a1 * b2) + a2 * b2; } else { y = (((a1 * b1 - x) + a2 * b1) + a1 * b2) + a2 * b2; } }

static void twoproduct(const double& a, const double& b, double& x, double& y) { x = a * b; y = fma(a, b, -x); }This option seems to be usable for Intel CPUs after Haswell and OS/compiler which are not so old. The usability of the option can be checked by the following procedure:

- compile test/test-fma.cc . If you can not compile it, you can not use the f ma instruction.
- If the execution result is "fma works!", there is no problem, but if it is "fma is broken!", the fma instruction does not have sufficient accu racy and the -DKV_USE_TPFMA option can not be used.
- For appropriate program using dd, compare the speeds with and without the -DKV_USE_TPFMA option. If the case of a system where fma is executed by emulation, abnormal speed down will be observed.

The computer environment used is as follows:

Intel Core i5 11400 2.6GHz/4.4GHz, Memory 32G

Ubuntu 22.04 LTS

gcc 11.2.0

As explained in the previous section, in some cases the calculation speed will be improved with the -DKV_USE_TPFMA option. Calculation time of both is shown in that cases.

precision (type of endpoint) | compile option (In addition to -O3 -DNDEBUG) |
calculation time | |
---|---|---|---|

without -DKV_USE_TPFMA | with -DKV_USE_TPFMA | ||

double | none | 6.65 sec | |

-DKV_FASTROUND | 3.74 sec | ||

-DKV_NOHWROUND | 7.88 sec | 6.05 sec | |

-DKV_USE_AVX512 -mavx512f | 2.74 sec | ||

dd | none | 37.20 sec | 33.58 sec |

-DKV_FASTROUND | 25.08 sec | 21.52 sec | |

-DKV_NOHWROUND | 92.24 sec | 71.81 sec | |

-DKV_USE_AVX512 -mavx512f | 16.47 sec |

Next, Benchmark results on the M1 MacBook Air are described. Currently, ARM CPUs seem to have a large penalty in execution time for hardware rounding mode changes, and emulation is the fastest, which is a disappointing result.

The computer environment used is as follows:

Apple M1 MacBook Air 3.2GHz, Memory 16G

macOS Monterey 12.4

gcc 12.2.0

precision (type of endpoint) | compile option (In addition to -O3 -DNDEBUG) |
calculation time | |
---|---|---|---|

without -DKV_USE_TPFMA | with -DKV_USE_TPFMA | ||

double | none | 24.45 sec | |

-DKV_FASTROUND | 22.36 sec | ||

-DKV_NOHWROUND | 9.08 sec | 6.03 sec | |

dd | none | 108.8 sec | 100.6 sec |

-DKV_FASTROUND | 102.4 sec | 94.64 sec | |

-DKV_NOHWROUND | 84.60 sec | 53.76 sec |