UP | HOME

Numerical precision

Table of Contents

1. Control of the numerical precision

Numerical precision control is a key feature that enables performance optimizations while maintaining accuracy. The library allows users to specify the target numerical precision for computations, which can be used by optimized implementations to select appropriate algorithms or data types.

Here, the default parameters determining the target numerical precision and range are defined. Following the IEEE Standard for Floating-Point Arithmetic (IEEE 754), precision refers to the number of significand bits (including the implicit leading bit), and range refers to the number of exponent bits.

The default values correspond to IEEE 754 double precision (binary64):

  • 53 bits of precision (52 explicit + 1 implicit)
  • 11 bits for the exponent
QMCKL_DEFAULT_PRECISION 53
QMCKL_DEFAULT_RANGE 11

The following functions provide access to set and get the required precision and range parameters. The precision parameter is an integer between 2 and 53 (representing the number of significand bits), and range is an integer between 2 and 11 (representing the number of exponent bits).

These parameters allow fine-grained control over the accuracy-performance tradeoff. Lower precision may enable faster computation on specialized hardware or with optimized algorithms, at the cost of reduced accuracy.

The setter functions return a new context as a 64-bit integer. The getter functions return the requested value as a 32-bit integer. The update functions return QMCKL_SUCCESS upon successful update or QMCKL_FAILURE if the parameters are invalid.

2. Precision

qmckl_context_set_numprec_precision modifies the parameter for the numerical precision in the context.

qmckl_exit_code qmckl_set_numprec_precision(const qmckl_context context, const int precision);

qmckl_get_numprec_precision returns the value of the numerical precision in the context.

int32_t qmckl_get_numprec_precision(const qmckl_context context);

3. Range

qmckl_set_numprec_range modifies the parameter for the numerical range in a given context.

qmckl_exit_code qmckl_set_numprec_range(const qmckl_context context, const int range);
interface
   integer (qmckl_exit_code) function qmckl_set_numprec_range(context, range) bind(C)
     use, intrinsic :: iso_c_binding
     import
     integer (qmckl_context), intent(in), value :: context
     integer (c_int32_t), intent(in), value :: range
   end function qmckl_set_numprec_range
end interface

qmckl_get_numprec_range returns the value of the numerical range in the context.

int32_t qmckl_get_numprec_range(const qmckl_context context);

4. Helper functions

4.1. Epsilon

qmckl_get_numprec_epsilon returns \(\epsilon = 2^{1-n}\) where n is the precision. We need to remove the sign bit from the precision.

double qmckl_get_numprec_epsilon(const qmckl_context context);

4.2. Testing the number of unchanged bits

To test that a given approximation keeps a given number of bits unchanged, we need a function that returns the number of unchanged bits in the range, and in the precision.

For this, we first count by how many units in the last place (ulps) two numbers differ.

int32_t qmckl_test_precision_64(double a, double b);
int32_t qmckl_test_precision_32(float a, float b);

5. Approximate functions

5.1. Exponential

Fast exponential function, adapted from Johan Rade's implementation (https://gist.github.com/jrade/293a73f89dfef51da6522428c857802d). It is based on Schraudolph's paper:

N. Schraudolph, "A Fast, Compact Approximation of the Exponential Function", Neural Computation 11, 853–862 (1999). (available at https://nic.schraudolph.org/pubs/Schraudolph99.pdf)

Author: TREX CoE

Created: 2026-06-05 Fri 11:22

Validate