Comparison Criticality in Sorting Algorithms

Fault tolerance techniques often presume that the end-user computation must complete flawlessly. Though such strict correctness is natural and easy to explain, it's increasingly unaffordable for extreme-scale computations, and blind to possible preferences among errors, should they prove inevitable. In a case study on traditional sorting algorithms, we present explorations of a criticality measure defined over expected fault damage rather than probability of correctness. We discover novel 'error structure' in even the most familiar algorithms, and observe that different plausible error measures can qualitatively alter criticality relationships, suggesting the importance of explicit error measures and criticality in the wise deployment of the limited spare resources likely to be available in future extreme-scale computers.

[1]  Algirdas Avižienis Fault-tolerance and fault-intolerance: Complementary approaches to reliable computing , 1975 .

[2]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[3]  Lingamneni Avinash,et al.  Sustaining moore's law in embedded computing through probabilistic and approximate design: retrospects and prospects , 2009, CASES '09.

[4]  2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Budapest, Hungary, June 24-27, 2013 , 2013, DSN.

[5]  Shi Qian,et al.  Evaluation of network resilience, survivability, and disruption tolerance: analysis, topology generation, simulation, and experimentation , 2013, Telecommun. Syst..

[6]  John F. Meyer,et al.  On Evaluating the Performability of Degradable Computing Systems , 1980, IEEE Transactions on Computers.

[7]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[8]  Derick Wood,et al.  A survey of adaptive sorting algorithms , 1992, CSUR.

[9]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Edward J. McCluskey,et al.  Software-implemented EDAC protection against SEUs , 2000, IEEE Trans. Reliab..

[11]  Kaushik Roy,et al.  Analysis and characterization of inherent application resilience for approximate computing , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[12]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[13]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[14]  J. Neumann The General and Logical Theory of Au-tomata , 1963 .

[15]  Kishor S. Trivedi,et al.  Interacting Stochastic Models Approach for Performability Analysis of IaaS Cloud , 2010 .

[16]  David H. Ackley Beyond efficiency , 2013, Commun. ACM.

[17]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[18]  John P. Hayes,et al.  Low-cost on-line fault detection using control flow assertions , 2003, 9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003..

[19]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[20]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[21]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[22]  Frank Mueller,et al.  Resilience in Numerical Methods: A Position on Fault Models and Methodologies , 2014, ArXiv.

[23]  Rakesh Kumar,et al.  A numerical optimization-based methodology for application robustification: Transforming applications for error tolerance , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).