The reason of different implementation only for Visual C++ is as follows:
Also, fesetround and _controlfp_s do the following operations:
Using -DKV_FASTROUND speeds up using these two methods. But, It can only be used in Intel CPU's 64bit mode.
See kv/hwround.hpp for implementation details.
For details of the algorithm, see emu.pdf (Jun 9, 2014, slightly modified on Apr. 27, 2023, in Japanese).
See
[Note] Emulation of directed rounding by -DKV_NOHWROUND is designed on the assumption that the operation of double conforms to IEEE 754. It works in many standard environments, but the 32bit environment of Intel CPU may not comply with IEEE754, so kv library cannot be compiled in such an environment (modified in version 0.4.54). Actually, after #include <cfloat>, the macro FLT_EVAL_METHOD is checked, and if it is not 0, it is an error and cannot be compiled. If you really want to use KV_NOHWROUND in 32bit Intel CPU environment, you can use -msse2 -mfpmath=sse compile option to use SSE2, which will set the above macro to 0 and allow you to compile. We decided not to support compilers that do not define this macro, or CPUs that do not have SSE2, because they are too old.
See
#include <fenv.h> static double add_up(const double& x, const double& y) { volatile double r, x1 = x, y1 = y; fesetround(FE_UPWARD); r = x1 + y1; fesetround(FE_TONEAREST); return r; }
#include <emmintrin.h> unsigned long int reg = _mm_getcsr(); unsigned long int nearest = (reg & ~0x00006000) | 0x00000000; unsigned long int down = (reg & ~0x00006000) | 0x00002000; unsigned long int up = (reg & ~0x00006000) | 0x00004000; unsigned long int chop = (reg & ~0x00006000) | 0x00006000; static double add_up(const double& x, const double& y) { volatile double r, x1 = x, y1 = y; _mm_setcsr(up); r = x1 + y1; _mm_setcsr(nearest); return r; }
#include <cmath> #include <limits> static void twosum(const double& a, const double& b, double& x, double& y) { double tmp; x = a + b; if (std::fabs(a) > std::fabs(b)) { tmp = x - a; y = b - tmp; } else { tmp = x - b; y = a - tmp; } } static double succ(const double& x) { static const double th1 = std::ldexp(1., -969); static const double th2 = std::ldexp(1., -1021); static const double c1 = std::ldexp(1., -53) + std::ldexp(1., -105); static const double c2 = std::ldexp(1., -1074); static const double c3 = std::ldexp(1., 53); static const double c4 = std::ldexp(1., -53); double a, c, e; a = std::fabs(x); if (a >= th1) return x + a * c1; if (a < th2) return x + c2; c = c3 * x; e = c1 * std::fabs(c); return (c + e) * c4; } static double add_up(const double& x, const double& y) { double r, r2; twosum(x, y, r, r2); if (r == std::numeric_limits<double>::infinity()) { return r; } else if (r == -std::numeric_limits<double>::infinity()) { if (x == -std::numeric_limits≤double>::infinity() || y == -std::numeric_limits::infinity()) { return r; } else { return -(std::numeric_limits≤double>::max)(); } } if (r2 > 0.) { return succ(r); } return r; }
#include <immintrin.h> static double add_up(const double& x, const double& y) { volatile __m128d x1, y1, z1; double r; x1 = _mm_set_sd(x); y1 = _mm_set_sd(y); z1 = _mm_add_round_sd(x1, y1, _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC); _mm_store_sd(&r, z1); return r; }
static void split(const double& a, double& x, double& y) { static const double sigma = (double)((1L << 27) + 1); double tmp; tmp = a * sigma; x = tmp - (tmp - a); y = a - x; } static void twoproduct(const double& a, const double& b, double& x, double& y) { static const double th = std::ldexp(1., 996); static const double c1 = std::ldexp(1., -28); static const double c2 = std::ldexp(1., 28); static const double th2 = std::ldexp(1., 1023); double na, nb, a1, a2, b1, b2; x = a * b; if (std::fabs(a) > th) { na = a * c1; nb = b * c2; } else if (std::fabs(b) > th) { na = a * c2; nb = b * c1; } else { na = a; nb = b; } split(na, a1, a2); split(nb, b1, b2); if (std::fabs(x) > th2) { y = ((((a1 * 0.5) * b1 - (x * 0.5)) * 2 + a2 * b1) + a1 * b2) + a2 * b2; } else { y = (((a1 * b1 - x) + a2 * b1) + a1 * b2) + a2 * b2; } }
static void twoproduct(const double& a, const double& b, double& x, double& y) { x = a * b; y = fma(a, b, -x); }This option seems to be usable for Intel CPUs after Haswell and OS/compiler which are not so old. The usability of the option can be checked by the following procedure:
The computer environment used is as follows:
Intel Core i5 11400 2.6GHz/4.4GHz, Memory 32G
Ubuntu 22.04 LTS
gcc 11.2.0
As explained in the previous section, in some cases the calculation speed will be improved with the -DKV_USE_TPFMA option. Calculation time of both is shown in that cases.
precision (type of endpoint) | compile option (In addition to -O3 -DNDEBUG) |
calculation time | |
---|---|---|---|
without -DKV_USE_TPFMA | with -DKV_USE_TPFMA | ||
double | none | 6.65 sec | |
-DKV_FASTROUND | 3.74 sec | ||
-DKV_NOHWROUND | 7.88 sec | 6.05 sec | |
-DKV_USE_AVX512 -mavx512f | 2.74 sec | ||
dd | none | 37.20 sec | 33.58 sec |
-DKV_FASTROUND | 25.08 sec | 21.52 sec | |
-DKV_NOHWROUND | 92.24 sec | 71.81 sec | |
-DKV_USE_AVX512 -mavx512f | 16.47 sec |
Next, Benchmark results on the M1 MacBook Air are described. Currently, ARM CPUs seem to have a large penalty in execution time for hardware rounding mode changes, and emulation is the fastest, which is a disappointing result.
The computer environment used is as follows:
Apple M1 MacBook Air 3.2GHz, Memory 16G
macOS Monterey 12.4
gcc 12.2.0
precision (type of endpoint) | compile option (In addition to -O3 -DNDEBUG) |
calculation time | |
---|---|---|---|
without -DKV_USE_TPFMA | with -DKV_USE_TPFMA | ||
double | none | 24.45 sec | |
-DKV_FASTROUND | 22.36 sec | ||
-DKV_NOHWROUND | 9.08 sec | 6.03 sec | |
dd | none | 108.8 sec | 100.6 sec |
-DKV_FASTROUND | 102.4 sec | 94.64 sec | |
-DKV_NOHWROUND | 84.60 sec | 53.76 sec |