Floating Point in C --- CS:APP

All versions of C provide two different floating-point data types: float and double. On machines that support IEEE floating point, these data types correspond to single-and double-precision floating point. In addition, the machines use the round-to-even rounding mode. Unfortunately, since the C standards do not require the machine to use IEEE floating point, there are no standard methods to change the rounding mode or to get special values such as -0, +00,-00, or NaN.

Most systems provide a combination of include (‘.h’) files and procedure libraries to provide access to these features, but the details vary from one system to another. For example, the GNU compiler GCC defines program constants INFINITY and NAN when the following sequence occurs in the program file:

#define _GNU_SOURCE 1

#define <math.h>

More recent versions of C, including ISO C99, include a third floating-point data type, long double. For many machines and compilers, this data type is equivalent to the double data type. For Intel-compatible machines, however, GCC implements this data type using an 80-bit “extended precision” format, providing a much larger range and precision than does the standard 64-bit format.

When casting values between int, float, and double formats, the program changes the numeric values and the bit representations as follows (assuming a 32-bit int):

  • From int to float, the number cannot overflow, but it may be rounded.
  • From int or float to double, the exact numeric value can be preserved because double has both greater range(i.e., the range of representable values), as well as greater precision(i.e., the number of significant bits).
  • From double to float, the value can overflow to +00 or -00, since the range is smaller. Otherwise, it may be rounded, because the precision is smaller.
  • From float or double to int the value will be rounded toward zero. For examople, 1.999 will be converted to 1, while -1.999 will be converted to -1. Furthermore, the value may overflow. The C standards do not specify a fixed result for this case. Intel-compatible microprocessors designate thebit pattern[10…00] as an integer indefinite value. Any conversion from floating point to integer that cannot assign a reasonable integer approximation yields this value. Thus, the expression (int) +1e10 yields -21483648, generating a negative value from a positive one.