Sources of technology failure

A recurring theme in Understanding Society is the topic of technology failure — air disasters, chemical plant explosions, deep drilling accidents. This diagram is intended to capture several dimensions of failure causes that have been discussed. The categories identified here include organizational dysfunctions, behavioral shortcomings, system failures, and regulatory dysfunctions. Each of these broad categories has contributed to the occurrence of major technology disasters, and often most or all of them are involved.

System failures. 2005 Texas City refinery explosion. A complex technology system involves a dense set of sub-systems that have multiple failure modes and multiple ways of affecting other sub-systems. As Charles Perrow points out, often those system interactions are “tightly coupled”, which means that there is very little time in which operators can attempt to diagnose the source of a failure before harmful effects have proliferated to other sub-systems. A pump fails in a cooling loop; an exhaust valve is stuck in the closed position; and nuclear fuel rods are left uncooled for less than a minute before they generate enough heat to boil away the coolant water. Similar to the issue of tight coupling is the feature of complex interactions: A influences B, C, D; B and D influence A; C’s change of state further influences unexpected performance by D. The causal chains here are not linear, so once again — operators and engineers are hard pressed to diagnose the source cause of an anomalous behavior in time to save the system from meltdown or catastrophic failure.

And then there are failures that originate in problems in the original design of the system and its instruments. Nancy Leveson identifies many such design failures in “The Role of Software in Spacecraft Accidents” (link). For example, the explosion at the Texas City refinery (link) occurred in part because the level transmitter instrument for the splitter high tower only measured column height up to the ten-foot maximum permissible height of the column of flammable liquid in the high splitter. Otherwise it only produced an alarm, which was routinely ignored. As a result the operators had no way of knowing that the column had gone up to 80 feet and then to the top of the column, leading to a release and subsequent fire and explosion (CSB Final Report Texas City BP) — an overflow accident. And sometimes the overall system actually had no formal design process at all; as Andrew Hopkins observes,

Processing plants evolve and grow over time. A study of petroleum refineries in the US has shown that “the largest and most complex refineries in the sample are also the oldest … Their complexity emerged as a result of historical accretion. Processes were modified, added, linked, enhanced and replaced over a history that greatly exceeded the memories of those who worked in the refinery. (Lessons from Longford, 33)

This implies that the whole system is not fully understood by any of the participants — executives, managers, engineers, or skilled operators.

Organizational dysfunctions. Deepwater Horizon. There is a very wide range of organizational dysfunctions that can be identified in case studies of technology disasters, from refineries to patient safety accidents. These include excessive cost reduction mandated by corporate decisions, inadequate safety culture embodied in leaders, operators, and day-to-day operations; poor inter-unit communications, where one unit concludes that a hazardous operation should be suspended but another unit doesn’t get the message; poor training and supervision; and conflicting priorities within the organization. Top managers are subject to production pressures that lead them to resist decisions involving a shutdown of process while anomalies are sorted out; higher-level managers sometimes lack the technical knowledge needed to know when a given signal or alarm may be potentially catastrophic; failures of communications within large companies about known process risks; and inadequate oversight within a large firm of subcontractor performance and responsibilities. Two pervasive problems are identified in a great many case studies: relentless cost containment initiatives to increase efficiency and profitability; and a lack of commitment to (and understanding of) an enterprise-wide culture of safety. In particular, it is common for executives and governing boards of high-risk enterprises to declare that “safety is our number-one priority”, where what they focus on is “days-lost” measures of injuries in the workplace. But this conception of safety fails completely to identify system risks. (Andrew Hopkins makes a very persuasive case for the use of “safety case” regulation and detailed HAZOP analysis for a complex operation as a whole; link.)

Behavioral shortcomings. Bhopal toxic gas release, Texas City refinery accident. No organization works like a Swiss watch. Rather, specific individuals occupy positions of work responsibility that may sometimes be only imperfectly performed. A control room supervisor is distracted at the end of his shift and fails to provide critical information for the supervisor on the incoming shift. Process inspectors sometimes take shortcuts and certify processes that in fact contain critical sources of failure; or inspectors yield to management pressure to overlook “minor” deviations from regulations. A maintenance crew deviates from training and protocol in order to complete tasks on time, resulting in a minor accident that leads to a cascade of more serious events. Directors of separate units within a process facility fail to inform each other of anomalies that may affect the safety of other sub-systems. Staff at each level have an incentive to conceal mistakes and “near-misses” that could otherwise be corrected.

Regulatory shortcomings. Longford gas plant, Davis-Besse nuclear plant incidents, East Palestine Norfolk Southern Railway accident. Risky industries plainly require regulation. But regulatory frameworks are often seriously flawed by known dysfunctions (link, link, link): industry capture (nuclear power industry); inadequate resources (NRC); inadequate enforcement tools (Chemical Safety Board); revolving door from industry to regulatory staff to industry; vulnerability to “anti-regulation” ideology expressed by industry and sympathetic legislators; and many of the dysfunctions already mentioned under the categories of organizational and behavioral shortcomings. The system of delegated regulation has been appealing to both industry and government officials. This is a system where central oversight is exercised by the regulatory agency, but the technical experts of the industry itself are called upon to assess critical safety features of the process being regulated. This approach makes government budget support for the regulatory agency much less costly. This system is used by the Federal Aviation Administration in its oversight of airframe safety. However, the experience of the Boeing 737 MAX failures has shown that the system of delegated regulation is vulnerable to distortion by the manufacturing companies that it oversees (link).

Here is Andrew Hopkin’s multi-level analysis of the Longford Esso gas plant accident (link). This diagram illustrates each of the categories of failure mentioned here.

Consider this alternative universe. It is a world in which CEOs, executives, directors, and staff in risky enterprises have taken the time to read 4-6 detailed case studies of major technology accidents and have absorbed the complexity of the kinds of dysfunctions that can lead to serious disasters. Instructive case studies might include the Longford Esso gas plant explosion, the 2005 Texas City refinery explosion, the Columbia Space Shuttle disaster, the Boeing 737 MAX failure, the BP Deepwater Horizon disaster, and the Davis-Besse nuclear power plant incidents. These case studies would provide enterprise leaders and staff with a much more detailed understanding of the kinds of organizational and system failure that can be expected to occur in risky enterprises, and leaders and managers would be much better prepared to prevent failures like these in the future. It would be a safer world.