Last Update: 8 Nov., 2017

Changing Rounding Mode and Compile Options

Masahide Kashiwagi

- "interval arithmeric with *double* endpoint" using interval.hpp + rdouble.hpp
- "interval arithmeric with *dd* endpoint" using interval.hpp + dd.hpp + rdd.hpp
- "interval arithmeric with *mpfr* endpoint" using interval.hpp + mpfr.hpp + rmpfr.hpp

- in Microsoft Visual C++, using _controlfp_s provided in float.h.
- otherwise, using fesetround provided in fenv.h.

The reason of different implementation only for Visual C++ is as follows:

- Visual C++ 2012 and earlier did not implement fenv.h .
- In Visual C++ 2013, 2015, there was a fatal bug that the directions of upward rounding and downward rounding are reversed.
- (In Visual C++ 2017 it finally became normal.)

Also, fesetround and _controlfp_s do the following operations:

- 1. Read the control register
- 2. Rewrite the bits corresponding to the rounding mode of the read data.
- 3. Write to the control register

- 1. and 2. are executed at program start and the value of the contorl register is stored.
- In 3., just write the stored value.

Using -DKV_FASTROUND speeds up using these two methods. But, It can only be used in Intel CPU's 64bit mode.

See kv/hwround.hpp for implementation details.

For details of the algorithm, see emu.pdf (Jun 9, 2014, in Japanese).

See

- kv/rdouble-nohwround.hpp (interval arithmetic with *double* endpoint)
- kv/rdd-nohwround.hpp (interval arithmetic with *dd* endpoint)

[Note] Emulation of directed rounding by -DKV_NOHWROUND is designed on the premise that the operation of double conforms to IEEE 754, so it works in many standard environments, but in Intel CPU's 32bit mode it may not work correctly. In the 32bit mode of the Intel CPU, calculation is executed by FPU which has internal 80bit register, so the calculation result based on IEEE 754 may not be obtained in some cases.

- Xeon Phi Knights Landing
- Skylake-X (Core i9 7900X, etc.)
- Skylake-SP (Xeon Platinum/Gold/Silver/Bronze, etc.)

See

- kv/rdouble-avx512.hpp (interval arithmetic with *double* endpoint)
- kv/rdd-avx512.hpp (interval arithmetic with *dd* endpoint)

#include <fenv.h> static double add_up(const double& x, const double& y) { volatile double r, x1 = x, y1 = y; fesetround(FE_UPWARD); r = x1 + y1; fesetround(FE_TONEAREST); return r; }

#include <emmintrin.h> unsigned long int reg = _mm_getcsr(); unsigned long int nearest = (reg & ~0x00006000) | 0x00000000; unsigned long int down = (reg & ~0x00006000) | 0x00002000; unsigned long int up = (reg & ~0x00006000) | 0x00004000; unsigned long int chop = (reg & ~0x00006000) | 0x00006000; static double add_up(const double& x, const double& y) { volatile double r, x1 = x, y1 = y; _mm_setcsr(up); r = x1 + y1; _mm_setcsr(nearest); return r; }

#include <cmath> #include <limits> static void twosum(const double& a, const double& b, double& x, double& y) { double tmp; x = a + b; if (std::fabs(a) > std::fabs(b)) { tmp = x - a; y = b - tmp; } else { tmp = x - b; y = a - tmp; } } static double succ(const double& x) { static const double th1 = std::ldexp(1., -969); static const double th2 = std::ldexp(1., -1021); static const double c1 = std::ldexp(1., -53) + std::ldexp(1., -105); static const double c2 = std::ldexp(1., -1074); static const double c3 = std::ldexp(1., 53); static const double c4 = std::ldexp(1., -53); double a, c, e; a = std::fabs(x); if (a >= th1) return x + a * c1; if (a < th2) return x + c2; c = c3 * x; e = c1 * std::fabs(c); return (c + e) * c4; } static double add_up(const double& x, const double& y) { double r, r2; twosum(x, y, r, r2); if (r == std::numeric_limits<double>::infinity()) { return r; } else if (r == -std::numeric_limits<double>::infinity()) { if (x == -std::numeric_limits≤double>::infinity() || y == -std::numeric_limits::infinity()) { return r; } else { return -(std::numeric_limits≤double>::max)(); } } if (r2 > 0.) { return succ(r); } return r; }

#include <immintrin.h> static double add_up(const double& x, const double& y) { volatile __m128d x1, y1, z1; double r; x1 = _mm_set_sd(x); y1 = _mm_set_sd(y); z1 = _mm_add_round_sd(x1, y1, _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC); _mm_store_sd(&r, z1); return r; }

static void split(const double& a, double& x, double& y) { static const double sigma = (double)((1L << 27) + 1); double tmp; tmp = a * sigma; x = tmp - (tmp - a); y = a - x; } static void twoproduct(const double& a, const double& b, double& x, double& y) { static const double th = std::ldexp(1., 996); static const double c1 = std::ldexp(1., -28); static const double c2 = std::ldexp(1., 28); static const double th2 = std::ldexp(1., 1023); double na, nb, a1, a2, b1, b2; x = a * b; if (std::fabs(a) > th) { na = a * c1; nb = b * c2; } else if (std::fabs(b) > th) { na = a * c2; nb = b * c1; } else { na = a; nb = b; } split(na, a1, a2); split(nb, b1, b2); if (std::fabs(x) > th2) { y = ((((a1 * 0.5) * b1 - (x * 0.5)) * 2 + a2 * b1) + a1 * b2) + a2 * b2; } else { y = (((a1 * b1 - x) + a2 * b1) + a1 * b2) + a2 * b2; } }

static void twoproduct(const double& a, const double& b, double& x, double& y) { x = a * b; y = fma(a, b, -x); }This option seems to be usable for Intel CPUs after Haswell and OS/compiler which are not so old. The usability of the option can be checked by the following procedure:

- compile test/test-fma.cc . If you can not compile it, you can not use the f ma instruction.
- If the execution result is "fma works!", there is no problem, but if it is "fma is broken!", the fma instruction does not have sufficient accu racy and the -DKV_USE_TPFMA option can not be used.
- For appropriate program using dd, compare the speeds with and without the -DKV_USE_TPFMA option. If the case of a system where fma is executed by emulation, abnormal speed down will be observed.

The computer environment used is as follows:

Intel Core i9 7900X 3.3GHz/4.3GHz, Memory 128G

Ubuntu 16.04 LTS

gcc 5.4.0

As explained in the previous section, in some cases the calculation speed will be improved with the -DKV_USE_TPFMA option. Calculation time of both is shown in that cases.

precision (type of endpoint) | compile option | calculation time | |
---|---|---|---|

without -DKV_USE_TPFMA | with -DKV_USE_TPFMA | ||

double | -O3 | 10.81 sec | |

-O3 -DKV_FASTROUND | 8.40 sec | ||

-O3 -DKV_NOHWROUND | 19.91 sec | 12.68 sec | |

-O3 -DKV_USE_AVX512 -mavx512f | 5.55 sec | ||

dd | -O3 | 54.10 sec | 44.94 sec |

-O3 -DKV_FASTROUND | 46.08 sec | 36.72 sec | |

-O3 -DKV_NOHWROUND | 185.7 sec | 118.9 sec | |

-O3 -DKV_USE_AVX512 -mavx512f | 21.42 sec |

The calculation time is the average measured 3 times.