Why Does Software Fail and What Should be Done About It?

  • Date June 11, 2015
  • Hour 9.30 am
  • Room GSSI Main Lecture Hall
  • Speaker Kishor S. Trivedi (ECE Dept., Duke University, Durham, NC)

Most large scale systems contain significant amount of software. Several recent studies have established that most system outages are due to software faults. Traditional methods of fault avoidance, fault removal based on extensive testing/debugging, and fault tolerance based on design/data diversity are found wanting. The key challenge then is how to provide highly dependable software. We discuss a new view of fault tolerance of software-based systems. We classify software faults into Bohrbugs and Mandelbugs, and identify aging-related bugs as a subtype of the latter. Traditional methods have been designed to deal with Bohrbugs. The next challenge then is to develop mitigation methods for Mandelbugs in general and aging-related bugs in particular. We submit that mitigation methods for Mandelbugs utilize environmental diversity. Restart application, failover to an identical replica (hot, warm or cold) and reboot the OS are examples of mitigation techniques that rely on environmental diversity. We discuss environmental diversity both from experimental and analytic points of view. We also discuss software aging related faults where it is possible to utilize proactive environmental diversity technique known as software rejuvenation.