Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications

Numerical reproducibility and stability of large scale scientific simulations, especially climate modeling, on distributed memory parallel computers are becoming critical issues. In particular, global summation of distributed arrays is most susceptible to rounding errors, and their propagation and accumulation cause uncertainty in final simulation results. We analyzed several accurate summation methods and found that two methods are particularly effective to improve (ensure) reproducibility and stability: Kahan's self-compensated summation and Bailey's double-double precision summation. We provide an MPI operator MPLSUMDD to work with MPI collective operations to ensure a scalable implementation on large number of processors. The final methods are particularly simple to adopt in practical codes.

[1]  R. C. Malone,et al.  Parallel ocean general circulation modeling , 1992 .

[2]  S. Griffies,et al.  Tracer Conservation with an Explicit Free Surface Method for z-Coordinate Ocean Models , 2001 .

[3]  DrakeJohn,et al.  Design and performance of a scalable parallel community climate model , 1995 .

[4]  Ian T. Foster,et al.  Design and Performance of a Scalable Parallel Community Climate Model , 1995, Parallel Comput..

[5]  Richard P. Brent,et al.  Recent technical reports , 1977, SIGA.

[6]  James J. Hack,et al.  Computational Design of the NCAR Community Climate Model , 1995, Parallel Comput..

[7]  David H. Bailey,et al.  Multiprecision Translation and Execution of Fortran Programs , 1993 .

[8]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[9]  James Demmel,et al.  Design, implementation and testing of extended and mixed precision BLAS , 2000, TOMS.

[10]  Yun He,et al.  Data Organization and I/O in a Parallel Ocean Circulation Model , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[11]  Chris H. Q. Ding,et al.  Atmosperic Data Assimilation on Distributed-Memory Parallel Supercomputers , 1998, HPCN Europe.

[12]  Douglas M. Priest On properties of floating point arithmetics: numerical stability and the cost of accurate computations , 1992 .

[13]  B. Parlett The Symmetric Eigenvalue Problem , 1981 .

[14]  David Goldberg,et al.  What every computer scientist should know about floating-point arithmetic , 1991, CSUR.

[15]  William Kahan,et al.  Pracniques: further remarks on reducing truncation errors , 1965, CACM.

[16]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[17]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[18]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[19]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[20]  David H. Bailey,et al.  Algorithm 719: Multiprecision translation and execution of FORTRAN programs , 1993, TOMS.