The Reliability Challenge: Maintenance in the Broader Context of High Reliabilty Organizations

Today’s blog post is by Paul R. Schulman, a Senior Research Fellow at University of California, Berkeley’s Center for Catastrophic Risk Management. Paul originally wrote these ideas in response to Andy Russell’s and Lee Vinsel’s New York Times op-ed. We found Paul’s reflections, which draw on work he and his colleagues have been doing for years, to be really insightful and they have been informing our thinking and wider projects ever since.

Professors Andy Russell and Lee Vinsel raise an important point in their July 22 New York Times Sunday Review article, “Let’s Get Excited About Maintenance!” Maintenance attention and financing may indeed be a distant priority in many organizations compared to innovation, with its exciting promise of new services and the achievement of market advantages over competitors.

But undervaluing maintenance is really a piece of a larger problem: a weak commitment to reliability — in services, in maintenance work itself and, most importantly, in avoiding catastrophic failure in key infrastructures on which we are increasingly dependent. My colleagues and I at the Center for Catastrophic Risk Management at the University of California, Berkeley have been studying lapses in reliability across many organizations for many years. But the undervalue of reliability is not simply a result of the romance of innovation in these organizations. (In fact a number of them are not too excited about innovation either.)

One problem we have seen widely in play is that reliability doesn’t have a measurable value in comparison with competing values in modern organizations. Reliability is in essence about the failures and accidents that don’t happen. How do you measure that value in comparison with a new building or a new technology that can be seen and that adds something new to capacity, output or sales? It might be argued that prevention of a costly failure is the value of reliability — the dog that doesn’t bark is providing a valuable silence. And, of course people and organizations do buy insurance against a future accident. But insurance does not prevent accidents — it may even make them more likely in some cases. And insurance payments, over what turn out to be uneventful years, are often retrospectively seen as a wasted investment.

This is one reason why investments in reliability, including preventive maintenance, are often difficult to make in organizations. The careers of executives are not likely to be advanced by their committing organizational funds to prevention. Can they point proudly and claim credit from customers or shareholders for something that didn’t happen? Was it guaranteed to happen without those funds; was it absolutely prevented by spending them?

Even regulatory agencies face this dilemma when contemplating approving rate increases in public utilities on behalf of safety. Rarely does the public support higher service rates to reduce the likelihood of an accident. Yet retrospectively, after an accident, the public will condemn the regulator for laxness in its oversight responsibility. Reliability investments, including maintenance, only probabilistically reduce failures and accidents. If one happened, would the public be satisfied that the agency at least delayed the accident or made its occurrence less likely? Ironically post-failure investments in rebuilding, restoring or replacing are far easier to sell because their value added is so much clearer and determinate than that of prevention.

But under-investing in reliability is about more than perverse incentives. There are also cognitive and organizational challenges to enhancing reliability and safety. Cognitively, there are abundant examples in research on accidents, of a root or even proximate cause being that of “representational error” — the misperception, misunderstanding or mis-specification by people in supervision, maintenance or even among operators of the systems they are working with. This also applied to their appraisal of risk. In a surprising number of accident reports someone in a key position refers to the failure of the system in words to the effect of: “I had no idea it could do that.”

Other major challenges to reliability are organizational. One classic review of a variety of catastrophic accidents by organizational analyst Barry Turner bears the title: “Causes of Disaster: Sloppy Management”. Poor communication and poor decision-making processes have been linked to many catastrophic accidents. Further, reliability in service or safety, is becoming less and less a property of single organizations. Many modern infrastructures and technical systems are really networked systems. Their reliability is no longer under the exclusive control of single organizations — their management and their operators. Their reliability is dependent upon essential inputs from other organizations, and their service or product outputs are the reliability inputs of downstream organizations. There are now many players among diverse organizations with their own cultures, interests and, often, proprietary information.

It turns out that we do not really know how to manage well for reliability as an inter-organizational property. We don’t even know how to regulate for it. Many regulatory organizations silo regulation among separate divisions and departments based upon separate utilities and the organizations that own them. They are not equipped to understand or regulate interconnected infrastructure risk.

The reliability challenge is widespread in its effects, and there are no simple answers to it. In those few organizations we have studied that have made exceptional commitments to reliability and safety a few things appear in common. First, they are managing or regulating technical systems with such catastrophic potential (nuclear power, nuclear weapons or commercial aviation) that public dread of their failure provides a strong social, political, social and regulatory foundation for high levels of reliability. Secondly, we have observed a distinctive role among key people in a variety of operational and maintenance settings in these organizations. We have come to term them “reliability professionals”. These individuals internalize a responsibility for things turning out right, even beyond their official job descriptions. They are generally very good at pattern recognition and in combining formal deductive principles with experiential knowledge. They often follow what psychologist Gary Klein has called “recognition-primed” decision-making, fitting incoming information about a problem into previously experienced situations while at the same time being very sensitive to differences or anomalies in current conditions. In this way reliability professionals are always on guard against representational error.

Factors such as these help to create and sustain a culture that countervails against laxity and error in organizations, and makes it less likely that people will short change prevention efforts, and more likely that the organizations will make substantial investments in reliability and safety, even if these yield only probabilistic returns.

Professors Russell and Vinsel admirably raise an important issue in citing the lack of excitement and the neglect of maintenance in many organizations. The larger problem of discounting reliability, however, should also be the subject increasing attention and social urgency.