Probe scheduling for efficient detection of silent failures

Most discovery systems for silent failures work in two phases: a continuous monitoring phase that detects presence of failures through probe packets and a localization phase that pinpoints the faulty element(s). This separation is important because localization requires significantly more resources than detection and should be initiated only when a fault is present. We focus on improving the efficiency of the detection phase, where the goal is to balance the overhead with the cost associated with longer failure detection times. We formulate a general model which unifies the treatment of probe scheduling mechanisms, stochastic or deterministic, and different cost objectives - minimizing average detection time (SUM) or worst-case detection time (MAX). We then focus on two classes of schedules. {\em Memoryless schedules} -- a subclass of stochastic schedules which is simple and suitable for distributed deployment. We show that the optimal memorlyess schedulers can be efficiently computed by convex programs (for SUM objectives) or linear programs (for MAX objectives), and surprisingly perhaps, are guaranteed to have expected detection times that are not too far off the (NP hard) stochastic optima. {\em Deterministic schedules} allow us to bound the maximum (rather than expected) cost of undetected faults, but like stochastic schedules, are NP hard to optimize. We develop novel efficient deterministic schedulers with provable approximation ratios. An extensive simulation study on real networks, demonstrates significant performance gains of our memoryless and deterministic schedulers over previous approaches. Our unified treatment also facilitates a clear comparison between different objectives and scheduling mechanisms.

[1]  Mostafa H. Ammar,et al.  On the optimality of cyclic transmission in teletext systems , 1985, 1985 24th IEEE Conference on Decision and Control.

[2]  László Lovász,et al.  Approximating Min-sum Set Cover , 2002, APPROX.

[3]  Randeep Bhatia,et al.  Minimizing service and operation costs of periodic scheduling , 2002, SODA '98.

[4]  Wushow Chou,et al.  Queueing Systems, Volume II: Computer Applications - Leonard Kleinrock , 1977, IEEE Transactions on Communications.

[5]  Edith Cohen,et al.  Replication strategies in unstructured peer-to-peer networks , 2002, SIGCOMM.

[6]  Rafael Alonso,et al.  Broadcast Disks: Data Management for Asymmetric Communication Environments , 1994, Mobidata.

[7]  Claire Mathieu,et al.  Polynomial-time approximation scheme for data broadcast , 2000, STOC '00.

[8]  Qiang Zheng,et al.  Minimizing Probing Cost and Achieving Identifiability in Probe-Based Network Link Monitoring , 2013, IEEE Transactions on Computers.

[9]  Edith Cohen,et al.  Efficient sequences of trials , 2003, SODA '03.

[10]  Edith Cohen,et al.  Scheduling Subset Tests: One-Time, Continuous, and How They Relate , 2013, APPROX-RANDOM.

[11]  Albert G. Greenberg,et al.  Detection and Localization of Network Black Holes , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[12]  Boaz Patt-Shamir,et al.  Efficient algorithms for periodic scheduling , 2004, Comput. Networks.

[13]  George Varghese,et al.  Automatic Test Packet Generation , 2012, IEEE/ACM Transactions on Networking.

[14]  Robert B. Cooper,et al.  Queueing systems, volume II: computer applications : By Leonard Kleinrock. Wiley-Interscience, New York, 1976, xx + 549 pp. , 1977 .

[15]  Renata Teixeira,et al.  Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis , 2009, IEEE INFOCOM 2009.

[16]  Nitin H. Vaidya,et al.  Log-time algorithms for scheduling single and multiple channel data broadcast , 1997, MobiCom '97.