On the Precision Attainable with Various Floating-Point Number Systems

For scientific computations on a digital computer the set of real numbers is usually approximated by a finite set F of ``floating-point'' numbers. We compare the numerical accuracy possible with different choices of F having approximately the same range and requiring the same word length. In particular, we compare different choices of base (or radix) in the usual floating-point systems. The emphasis is on the choice of F, not on the details of the number representation or the arithmetic, but both rounded and truncated arithmetic are considered. Theoretical results are given, and some simulations of typical floating-point computations (forming sums, solving systems of linear equations, finding eigenvalues) are described. If the leading fraction bit of a normalized base-2 number is not stored explicitly (saving a bit), and the criterion is to minimize the mean square roundoff error, then base 2 is best. If unnormalized numbers are allowed, so the first bit must be stored explicitly, then base 4 (or sometimes base 8) is the best of the usual systems.

[1]  C. Weinstein Roundoff noise in floating point fast Fourier transform computation , 1969 .

[2]  D.E. Atkins Design of the Arithmetic Units of ILLIAC III: Use of Redundancy and Higher Radix Methods , 1970, IEEE Transactions on Computers.

[3]  Paul L. Richman,et al.  Floating-point number representations: base choice versus exponent range , 1967 .

[4]  W. M. McKeeman Representation Error for Real Numbers in Binary Computer Arithmetic , 1967, IEEE Trans. Electron. Comput..

[5]  Martti Tienari,et al.  A statistical model of roundoff error for varying length floating-point arithmetic , 1970 .

[6]  Paul L. Richman,et al.  The choice of base , 1969, CACM.

[7]  G. Ramos Roundoff error analysis of the fast Fourier transform , 1970 .

[8]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[9]  James Hardy Wilkinson,et al.  Householder's tridiagonalization of a symmetric matrix , 1968 .

[10]  Minoru Urabe Roundoff Error Distribution in Fixed-Point Multiplication and a Remark about the Rounding Rule , 1968 .

[11]  P. Henrici Discrete Variable Methods in Ordinary Differential Equations , 1962 .

[12]  I. Bennett Goldberg 27 bits are not enough for 8-digit accuracy , 1967, CACM.

[13]  H. C. Ratz,et al.  A Mean Square Estimate of the Generated Roundoff Error in Constant Matrix Iterative Processes , 1971, JACM.

[14]  Bede Liu,et al.  Accumulation of Round-Off Error in Fast Fourier Transforms , 1970, JACM.

[15]  S. F. Anderson,et al.  The IBM system/360 model 91: floating-point execution unit , 1967 .

[16]  James Hardy Wilkinson,et al.  The QR and QL Algorithms for Symmetric Matrices , 1971 .

[17]  William J. Cody,et al.  Static and Dynamic Numerical Characteristics of Floating-Point Arithmetic , 1973, IEEE Transactions on Computers.

[18]  David W. Matula,et al.  A Formalization of Floating-Point Numeric Base Conversion , 1970, IEEE Transactions on Computers.

[19]  William J. Cody,et al.  A statistical study of the accuracy of floating point number systems , 1973, CACM.

[20]  Richard P. Brent On the Precision Attainable with Various Floating-Point Number Systems , 1973, IEEE Trans. Computers.

[21]  R. W. Hamming,et al.  On the distribution of numbers , 1970, Bell Syst. Tech. J..

[22]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[23]  J. Richard Swenson,et al.  Tests of probabilistic models for propagation of roundoff errors , 1966, CACM.

[24]  R. Morris,et al.  Tapered Floating Point: A New Floating-Point Representation , 1971, IEEE Transactions on Computers.