System effects

Quite a few posts here have focused on the question of emergence in social ontology, the idea that there are causal processes and powers at work at the level of social entities that do not correspond to similar properties at the individual level. Here I want to raise a related question, the notion that an important aspect of the workings of the social world derives from “system effects” of the organizations and institutions through which social life transpires. A system accident or effect is one that derives importantly from the organization and configuration of the system itself, rather than the specific properties of the units.

What are some examples of system effects? Consider these phenomena:

  • Flash crashes in stock markets as a result of automated trading
  • Under-reporting of land values in agrarian fiscal regimes 
  • Grade inflation in elite universities 
  • Increase in product defect frequency following a reduction in inspections 
  • Rising frequency of industrial errors at the end of work shifts 

Here is how Nancy Leveson describes systems causation in Engineering a Safer World: Systems Thinking Applied to Safety:

Safety approaches based on systems theory consider accidents as arising from the interactions among system components and usually do not specify single causal variables or factors. Whereas industrial (occupational) safety models and event chain models focus on unsafe acts or conditions, classic system safety models instead look at what went wrong with the system’s operation or organization to allow the accident to take place. (KL 977)

Charles Perrow offers a taxonomy of systems as a hierarchy of composition in Normal Accidents: Living with High-Risk Technologies:

Consider a nuclear plant as the system. A part will be the first level — say a valve. This is the smallest component of the system that is likely to be identified in analyzing an accident. A functionally related collection of parts, as, for example, those that make up the steam generator, will be called a unit, the second level. An array of units, such as the steam generator and the water return system that includes the condensate polishers and associated motors, pumps, and piping, will make up a subsystem, in this case the secondary cooling system. This is the third level. A nuclear plan has around two dozen subsystems under this rough scheme. They all come together in the fourth level, the nuclear plant or system. Beyond this is the environment. (65)

Large socioeconomic systems like capitalism and collectivized socialism have system effects — chronic patterns of low productivity and corruption in the latter case, a tendency to inequality and immiseration in the former case. In each case the observed effect is the result of embedded features of property and labor in the two systems that result in specific kinds of outcomes. And an important dimension of social analysis is to uncover the ways in which ordinary actors pursuing ordinary goals within the context of the two systems, lead to quite different outcomes at the level of the “mode of production”. And these effects do not depend on there being a distinctive kind of actor in each system; in fact, one could interchange the actors and still find the same macro-level outcomes.

Here is a preliminary effort at a definition for this concept in application to social organizations:

A system effect is an outcome that derives from the embedded characteristics of incentive and opportunity within a social arrangement that lead normal actors to engage in activity leading to the hypothesized aggregate effect.

Once we see what the incentive and opportunity structures are, we can readily see why some fraction of actors modify their behavior in ways that lead to the outcome. In this respect the system is the salient causal factor rather than the specific properties of the actors — change the system properties and you will change the social outcome.

 

When we refer to system effects we often have unintended consequences in mind — unintended both by the individual actors and the architects of the organization or practice. But this is not essential; we can also think of examples of organizational arrangements that were deliberately chosen or designed to bring about the given outcome. In particular, a given system effect may be intended by the designer and unintended by the individual actors. But when the outcomes in question are clearly dysfunctional or “catastrophic”, it is natural to assume that they are unintended. (This, however, is one of the specific areas of insight that comes out of the new institutionalism: the dysfunctional outcome may be favorable for some sets of actors even as they are unfavorable for the workings of the system as a whole.)

 
Another common assumption about system effects is that they are remarkably stable through changes of actors and efforts to reverse the given outcome. In this sense they are thought to be somewhat beyond the control of the individuals who make up the system. The only promising way of undoing the effect is to change the incentives and opportunities that bring it about. But to the extent that a given configuration has emerged along with supporting mechanisms protecting it from deformation, changing the configuration may be frustratingly difficult.

Safety and its converse are often described as system effects. By this is often meant two things. First, there is the important insight that traditional accident analysis favors “unit failure” at the expense of more systemic factors. And second, there is the idea that accidents and failures often result from “tightly linked” features of systems, both social and technical, in which variation in one component of a system can have unexpected consequences for the operation of other components of the system. Charles Perrow describes the topic of loose and tight coupling in social systems in Normal Accidents; 89 ff,)

Philosophy and the study of technology failure

image: Adolf von Menzel, The Iron Rolling Mill (Modern Cyclopes)
 

Readers may have noticed that my current research interests have to do with organizational dysfunction and largescale technology failures. I am interested in probing the ways in which organizational failures and dysfunctions have contributed to large accidents like Bhopal, Fukushima, and the Deepwater Horizon disaster. I’ve had to confront an important question in taking on this research interest: what can philosophy bring to the topic that would not be better handled by engineers, organizational specialists, or public policy experts?

One answer is the diversity of viewpoint that a philosopher can bring to the discussion. It is evident that technology failures invite analysis from all of these specialized experts, and more. But there is room for productive contribution from reflective observers who are not committed to any of these disciplines. Philosophers have a long history of taking on big topics outside the defined canon of “philosophical problems”, and often those engagements have proven fruitful. In this particular instance, philosophy can look at organizations and technology in a way that is more likely to be interdisciplinary, and perhaps can help to see dimensions of the problem that are less apparent from a purely disciplinary perspective.

There is also a rationale based on the terrain of the philosophy of science. Philosophers of biology have usually attempted to learn as much about the science of biology as they can manage, but they lack the level of expertise of a research biologist, and it is rare for a philosopher to make an original contribution to the scientific biological literature. Nonetheless it is clear that philosophers have a great deal to add to scientific research in biology. They can contribute to better reasoning about the implications of various theories, they can probe the assumptions about confirmation and explanation that are in use, and they can contribute to important conceptual disagreements. Biology is in a better state because of the work of philosophers like David Hull and Elliot Sober.

Philosophers have also made valuable contributions to science and technology studies, bringing a viewpoint that incorporates insights from the philosophy of science and a sensitivity to the social groundedness of technology. STS studies have proven to be a fruitful place for interaction between historians, sociologists, and philosophers. Here again, the concrete study of the causes and context of large technology failure may be assisted by a philosophical perspective.

There is also a normative dimension to these questions about technology failure for which philosophy is well prepared. Accidents hurt people, and sometimes the causes of accidents involve culpable behavior by individuals and corporations. Philosophers have a long history of contribution to these kinds of problems of fault, law, and just management of risks and harms.

Finally, it is realistic to say that philosophy has an ability to contribute to social theory. Philosophers can offer imagination and critical attention to the problem of creating new conceptual schemes for understanding the social world. This capacity seems relevant to the problem of describing, analyzing, and explaining largescale failures and disasters.

The situation of organizational studies and accidents is in some ways more hospitable for contributions by a philosopher than other “wicked problems” in the world around us. An accident is complicated and complex but not particularly obscure. The field is unlike quantum mechanics or climate dynamics, which are inherently difficult for non-specialists to understand. The challenge with accidents is to identify a multi-layered analysis of the causes of the accident that permits observers to have a balanced and operative understanding of the event. And this is a situation where the philosopher’s perspective is most useful. We can offer higher-level descriptions of the relative importance of different kinds of causal factors. Perhaps the role here is analogous to messenger RNA, providing a cross-disciplinary kind of communications flow. Or it is analogous to the role of philosophers of history who have offered gentle critique of the cliometrics school for its over-dependence on a purely statistical approach to economic history.

So it seems reasonable enough for a philosopher to attempt to contribute to this set of topics, even if the disciplinary expertise a philosopher brings is more weighted towards conceptual and theoretical discussions than undertaking original empirical research in the domain.

What I expect to be the central finding of this research is the idea that a pervasive and often unrecognized cause of accidents is a systemic organizational defect of some sort, and that it is enormously important to have a better understanding of common forms of these deficiencies. This is a bit analogous to a paradigm shift in the study of accidents. And this view has important policy implications. We can make disasters less frequent by improving the organizations through which technology processes are designed and managed.

System safety

An ongoing thread of posts here is concerned with organizational causes of large technology failures. The driving idea is that failures, accidents, and disasters usually have a dimension of organizational causation behind them. The corporation, research office, shop floor, supervisory system, intra-organizational information flow, and other social elements often play a key role in the occurrence of a gas plant fire, a nuclear power plant malfunction, or a military disaster. There is a tendency to look first and foremost for one or more individuals who made a mistake in order to explain the occurrence of an accident or technology failure; but researchers such as Perrow, Vaughan, Tierney, and Hopkins have demonstrated in detail the importance of broadening the lens to seek out the social and organizational background of an accident.

It seems important to distinguish between system flaws and organizational dysfunction in considering all of the kinds of accidents mentioned here. We might specify system safety along these lines. Any complex process has the potential for malfunction. Good system design means creating a flow of events and processes that make accidents inherently less likely. Part of the task of the designer and engineer is to identify chief sources of harm inherent in the process — release of energy, contamination of food or drugs, unplanned fission in a nuclear plant — and design fail-safe processes so that these events are as unlikely as possible. Further, given the complexity of contemporary technology systems it is critical to attempt to anticipate unintended interactions among subsystems — each of which is functioning correctly but that lead to disaster in unusual but possible interaction scenarios.

In a nuclear processing plant, for example, there is the hazard of radioactive materials being brought into proximity with each other in a way that creates unintended critical mass. Jim Mahaffey’s Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima offers numerous examples of such unintended events, from the careless handling of plutonium scrap in a machining process to the transfer of a fissionable liquid from a vessel of one shape to another. We might try to handle these risks as an organizational problem: more and better training for operatives about the importance of handling nuclear materials according to established protocols, and effective supervision and oversight to ensure that the protocols are observed on a regular basis. But it is also possible to design the material processes within a nuclear plant in a way that makes unintended criticality virtually impossible — for example, by storing radioactive solutions in containers that simply cannot be brought into close proximity with each other.

Nancy Leveson is a national expert on defining and applying principles of system safety. Her book Engineering a Safer World: Systems Thinking Applied to Safety is a thorough treatment of her thinking about this subject. She offers a handful of compelling reasons for believing that safety is a system-level characteristic that requires a systems approach: the fast pace of technological change, reduced ability to learn from experience, the changing nature of accidents, new types of hazards, increasing complexity and coupling, decreasing tolerance for single accidents, difficulty in selecting priorities and making tradeoffs , more complex relationships between humans and automation, and changing regulatory and public view of safety (kl 130 ff.). Particularly important in this list is the comment about complexity and coupling: “The operation of some systems is so complex that it defies the understanding of all but a few experts, and sometimes even they have incomplete information about the system’s potential behavior” (kl 137). 

Given the fact that safety and accidents are products of whole systems, she is critical of the accident methodology generally applied to serious industrial, aerospace, and chemical accidents. This methodology involves tracing the series of events that led to the outcome, and identifying one or more events as the critical cause of the accident. However, she writes:

In general, event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the or industry. An accident model should encourage a broad view of accident mechanisms that expands the investigation beyond the proximate evens.A narrow focus on technological components and pure engineering activities or a similar narrow focus on operator errors may lead to ignoring some of the most important factors in terms of preventing future accidents. (kl 452)

Here is a definition of system safety offered later in ESW in her discussion of the emergence of the concept within the defense and aerospace fields in the 1960s:

System Safety … is a subdiscipline of system engineering. It was created at the same time and for the same reasons. The defense community tried using the standard safety engineering techniques on their complex new systems, but the limitations became clear when interface and component interaction problems went unnoticed until it was too late, resulting in many losses and near misses. When these early aerospace accidents were investigated, the causes of a large percentage of them were traced to deficiencies in design, operations, and management. Clearly, big changes were needed. System engineering along with its subdiscipline, System Safety, were developed to tackle these problems. (kl 1007)

Here Leveson mixes system design and organizational dysfunctions as system-level causes of accidents. But much of her work in this book and her earlier Safeware: System Safety and Computers gives extensive attention to the design faults and component interactions that lead to accidents — what we might call system safety in the narrow or technical sense.

A systems engineering approach to safety starts with the basic assumption that some properties of systems, in this case safety, can only be treated adequately in the context of the social and technical system as a whole. A basic assumption of systems engineering is that optimization of individual components or subsystems will not in general lead to a system optimum; in fact, improvement of a particular subsystem may actually worsen the overall system performance because of complex, nonlinear interactions among the components. (kl 1007) 

Overall, then, it seems clear that Leveson believes that both organizational features and technical system characteristics are part of the systems that created the possibility for accidents like Bhopal, Fukushima, and Three Mile Island. Her own accident model designed to help identify causes of accidents, STAMP (Systems-Theoretic Accident Model and Processes) emphasizes both kinds of system properties.

Using this new causality model … changes the emphasis in system safety from preventing failures to enforcing behavioral safety constraints. Component failure accidents are still included, but or conception of causality is extended to include component interaction accidents. Safety is reformulated as a control problem rather than a reliability problem. (kl 1062)

In this framework, understanding why an accident occurred requires determining why the control was ineffective. Preventing future accidents requires shifting from a focus on preventing failures to the broader goal of designing and implementing controls that will enforce the necessary constraints. (kl 1084)

 Leveson’s brief analysis of the Bhopal disaster in 1984 (kl 384 ff.) emphasizes the organizational dysfunctions that led to the accident — and that were completely ignored by the Indian state’s accident investigation of the accident: out-of-service gauges, alarm deficiencies, inadequate response to prior safety audits, shortage of oxygen masks, failure to inform the police or surrounding community of the accident, and an environment of cost cutting that impaired maintenance and staffing. “When all the factors, including indirect and systemic ones, are considered, it becomes clear that the maintenance worker was, in fact, only a minor and somewhat irrelevant player in the loss. Instead, degradation in the safety margin occurred over time and without any particular single decision to do so but simply as a series of decisions that moved the plant slowly toward a situation where any slight error would lead to a major accident” (kl 447).

What the boss wants to hear …

According to David Halberstam in his outstanding history of the war in Vietnam, The Best and the Brightest, a prime cause of disastrous decision-making by Presidents Kennedy and Johnson was an institutional imperative in the Defense Department to come up with a set of facts that conformed to what the President wanted to hear. Robert McNamara and McGeorge Bundy were among the highest-level miscreants in Halberstam’s account; they were determined to craft an assessment of the situation on the ground in Vietnam that conformed best with their strategic advice to the President.

Ironically, a very similar dynamic led to one of modern China’s greatest disasters, the Great Leap Forward famine in 1959. The Great Helmsman was certain that collective agriculture would be vastly more productive than private agriculture; and following the collectivization of agriculture, party officials in many provinces obliged this assumption by reporting inflated grain statistics throughout 1958 and 1959. The result was a famine that led to at least twenty million excess deaths during a two-year period as the central state shifted resources away from agriculture (Frank DikötterMao’s Great Famine: The History of China’s Most Devastating Catastrophe, 1958-62).

More mundane examples are available as well. When information about possible sexual harassment in a given department is suppressed because “it won’t look good for the organization” and “the boss will be unhappy”, the organization is on a collision course with serious problems. When concerns about product safety or reliability are suppressed within the organization for similar reasons, the results can be equally damaging, to consumers and to the corporation itself. General Motors, Volkswagen, and Michigan State University all seem to have suffered from these deficiencies of organizational behavior. This is a serious cause of organizational mistakes and failures. It is impossible to make wise decisions — individual or collective — without accurate and truthful information from the field. And yet the knowledge of higher-level executives depends upon the truthful and full reporting of subordinates, who sometimes have career incentives that work against honesty.

So how can this unhappy situation be avoided? Part of the answer has to do with the behavior of the leaders themselves. It is important for leaders to explicitly and implicitly invite the truth — whether it is good news or bad news. Subordinates must be encouraged to be forthcoming and truthful; and bearers of bad news must not be subject to retaliation. Boards of directors, both private and public, need to make clear their own expectations on this score as well: that they expect leading executives to invite and welcome truthful reporting, and that they expect individuals throughout the organization to provide truthful reporting. A culture of honesty and transparency is a powerful antidote to the disease of fabrications to please the boss.

Anonymous hotlines and formal protection of whistle-blowers are other institutional arrangements that lead to greater honesty and transparency within an organization. These avenues have the advantage of being largely outside the control of the upper executives, and therefore can serve as a somewhat independent check on dishonest reporting.

A reliable practice of accountability is also a deterrent to dishonest or partial reporting within an organization. The truth eventually comes out — whether about sexual harassment, about hidden defects in a product, or about workplace safety failures. When boards of directors and organizational policies make it clear that there will be negative consequences for dishonest behavior, this gives an ongoing incentive of prudence for individuals to honor their duties of honesty within the organization.

This topic falls within the broader question of how individual behavior throughout an organization has the potential for giving rise to important failures that harm the public and harm the organization itself. 


Empowering the safety officer?

How can industries involving processes that create large risks of harm for individuals or populations be modified so they are more capable of detecting and eliminating the precursors of harmful accidents? How can nuclear accidents, aviation crashes, chemical plant explosions, and medical errors be reduced, given that each of these activities involves large bureaucratic organizations conducting complex operations and with substantial inter-system linkages? How can organizations be reformed to enhance safety and to minimize the likelihood of harmful accidents?

One of the lessons learned from the Challenger space shuttle disaster is the importance of a strongly empowered safety officer in organizations that deal in high-risk activities. This means the creation of a position dedicated to ensuring safe operations that falls outside the normal chain of command. The idea is that the normal decision-making hierarchy of a large organization has a built-in tendency to maintain production schedules and avoid costly delays. In other words, there is a built-in incentive to treat safety issues with lower priority than most people would expect.

If there had been an empowered safety officer in the launch hierarchy for the Challenger launch in 1986, there is a good chance this officer would have listened more carefully to the Morton-Thiokol engineering team’s concerns about low temperature damage to O-rings and would have ordered a halt to the launch sequence until temperatures in Florida raised to the critical value. The Rogers Commission faulted the decision-making process leading to the launch decision in its final report on the accident (The Report of the Presidential Commission on the Space Shuttle Challenger Accident – The Tragedy of Mission 51-L in 1986 – Volume OneVolume TwoVolume Three).

This approach is productive because empowering a safety officer creates a different set of interests in the management of a risky process. The safety officer’s interest is in safety, whereas other decision makers are concerned about revenues and costs, public relations, reputation, and other instrumental goods. So a dedicated safety officer is empowered to raise safety concerns that other officers might be hesitant to raise. Ordinary bureaucratic incentives may lead to underestimating risks or concealing faults; so lowering the accident rate requires giving some individuals the incentive and power to act effectively to reduce risks.

Similar findings have emerged in the study of medical and hospital errors. It has been recognized that high-risk activities are made less risky by empowering all members of the team to call a halt in an activity when they perceive a safety issue. When all members of the surgical team are empowered to halt a procedure when they note an apparent error, serious operating-room errors are reduced. (Here is a report from the American College of Obstetricians and Gynecologists on surgical patient safety; link. And here is a 1999 National Academy report on medical error; link.)

The effectiveness of a team-based approach to safety depends on one central fact. There is a high level of expertise embodied in the staff operating a surgical suite, an engineering laboratory, or a drug manufacturing facility. By empowering these individuals to stop a procedure when they judge there is an unrecognized error in play, this greatly extend the amount of embodied knowledge involved in a process. The surgeon, the commanding officer, or the lab director is no longer the sole expert whose judgments count.

But it also seems clear that these innovations don’t work equally well in all circumstances. Take nuclear power plant operations. In Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima James Mahaffey documents multiple examples of nuclear accidents that resulted from the efforts of mid-level workers to address an emerging problem in an improvised way. In the case of nuclear power plant safety, it appears that the best prescription for safety is to insist on rigid adherence to pre-established protocols. In this case the function of a safety officer is to monitor operations to ensure protocol conformance — not to exercise independent judgment about the best way to respond to an unfavorable reactor event.

It is in fact an interesting exercise to try to identify the kinds of operations in which these innovations are likely to be effective.

Here is a fascinating interview in Slate with Jim Bagian, a former astronaut, one-time director of the Veteran Administration’s National Center for Patient Safety, and distinguished safety expert; link. Bagian emphasizes the importance of taking a system-based approach to safety. Rather than focusing on finding blame for specific individuals whose actions led to an accident, Bagian emphasizes the importance of tracing back to the institutional, organizational, or logistic background of the accident. What can be changed in the process — of delivering medications to patients, of fueling a rocket, or of moving nuclear solutions around in a laboratory — that make the likelihood of an accident substantially lower?

The safety principles involved here seem fairly simple: cultivate a culture in which errors and near-misses are reported and investigated without blame; empower individuals within risky processes to halt the process if their expertise and experience indicates the possibility of a significant risky error; create individuals within organizations whose interests are defined in terms of the identification and resolution of unsafe practices or conditions; and share information about safety within the industry and with the public.

Mechanisms, singular and general

Let’s think again about the semantics of causal ascriptions. Suppose that we want to know what  caused a building crane to collapse during a windstorm. We might arrive at an account something like this:

  • An unusually heavy gust of wind at 3:20 pm, in the presence of this crane’s specific material and structural properties, with the occurrence of the operator’s effort to adjust the crane’s extension at 3:21 pm, brought about cascading failures of structural elements of the crane, leading to collapse at 3:25 pm.

The process described here proceeds from the “gust of wind striking the crane” through an account of the material and structural properties of the device, incorporating the untimely effort by the operator to readjust the device’s extension, leading to a cascade from small failures to a large failure. And we can identify the features of causal necessity that were operative at the several links of the chain.

Notice that there are few causal regularities or necessary and constant conjunctions in this account. Wind does not usually bring about the collapse of cranes; if the operator’s intervention had occurred a few minutes earlier or later, perhaps the failure would not have occurred; and small failures do not always lead to large failures. Nonetheless, in the circumstances described here there is causal necessity extending from the antecedent situation at 3:15 pm to the full catastrophic collapse at 3:25 pm.

Does this narrative identify a causal mechanism? Are we better off describing this as a sequences of cause-effect sequences, none of which represents a causal mechanism per se? Or, on the contrary, can we look at the whole sequence as a single causal mechanism — though one that is never to be repeated? Does a causal mechanism need to be a recurring and robust chain of events, or can it be a highly unique and contingent chain?

Most mechanisms theorists insist on a degree of repeatability in the sequences that they describe as “mechanisms”. A causal mechanism is the triggering pathway through which one event leads to the production of another event in a range of circumstances in an environment. Fundamentally a causal mechanism is a “molecule” of causal process which can recur in a range of different social settings.

For example:

  • X typically brings about O.

Whenever this sequence of events occurs, in the appropriate timing, the outcome O is produced. This ensemble of events {X, O} is a single mechanism.

And here is the crucial point: to call this a mechanism requires that this sequence recurs in multiple instances across a range of background conditions.

This suggests an answer to the question about the collapsing crane: the sequence from gust to operator error to crane collapse is not a mechanism, but is rather a unique causal sequence. Each part of the sequence has a causal explanation available; each conveys a form of causal necessity in the circumstances. But the aggregation of these cause-effect connections falls short of constituting a causal mechanism because the circumstances in which it works are all but unique. A satisfactory causal explanation of the internal cause-effect pairs will refer to real repeatable mechanisms — for example, “twisting a steel frame leads to a loss of support strength”. But the concatenation does not add up to another, more complex, mechanism.

Contrast this with “stuck valve” accidents in nuclear power reactors. Valves control the flow of cooling fluids around the critical fuel. If the fuel is deprived of coolant it rapidly overheats and melts. A “stuck valve-loss of fluid-critical overheating” sequence is a recognized mechanism of nuclear meltdown, and has been observed in a range of nuclear-plant crises. It is therefore appropriate to describe this sequence as a genuine causal mechanism in the creation of a nuclear plant failure.

(Stuart Glennan takes up a similar question in “Singular and General Causal Relations: A Mechanist Perspective”; link.)

Nuclear accidents

 
diagrams: Chernobyl reactor before and after
 

Nuclear fission is one of the world-changing discoveries of the mid-twentieth century. The atomic bomb projects of the United States led to the atomic bombing of Japan in August 1945, and the hope for limitless electricity brought about the proliferation of a variety of nuclear reactors around the world in the decades following World War II. And, of course, nuclear weapons proliferated to other countries beyond the original circle of atomic powers.

Given the enormous energies associated with fission and the dangerous and toxic properties of radioactive components of fission processes, the possibility of a nuclear accident is a particularly frightening one for the modern public. The world has seen the results of several massive nuclear accidents — Chernobyl and Fukushima in particular — and the devastating results they have had on human populations and the social and economic wellbeing of the regions in which they occurred.

Safety is therefore a paramount priority in the nuclear industry, both in research labs and military and civilian applications. So what is the situation of safety in the nuclear sector? Jim Mahaffey’s Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima is a detailed and carefully researched attempt to answer this question. And the information he provides is not reassuring. Beyond the celebrated and well-known disasters at nuclear power plants (Three Mile Island, Chernobyl, Fukushima), Mahaffey refers to hundreds of accidents involving reactors, research laboratories, weapons plants, and deployed nuclear weapons that have had less public awareness. These accidents resulted in a very low number of lives lost, but their frequency is alarming. They are indeed “normal accidents” (Perrow, Normal Accidents: Living with High-Risk Technologies. For example:

  • a Japanese fishing boat is contaminated by fallout from Castle Bravo test of hydrogen bomb; lots of radioactive fish at the markets in Japan (March 1, 1954) (kl 1706)
  • one MK-6 atomic bomb is dropped on Mars Bluff, South Carolina, after a crew member accidentally pulled the emergency bomb release handle (February 5, 1958) (kl 5774)
  • Fermi 1 liquid sodium plutonium breeder reactor experiences fuel meltdown during startup trials near Detroit (October 4, 1966) (kl 4127)

Mahaffey also provides detailed accounts of the most serious nuclear accidents and meltdowns during the past forty years, Three Mile Island, Chernobyl, and Fukushima.

The safety and control of nuclear weapons is of particular interest. Here is Mahaffey’s summary of “Broken Arrow” events — the loss of atomic and fusion weapons:

Did the Air Force ever lose an A-bomb, or did they just misplace a few of them for a short time? Did they ever drop anything that could be picked up by someone else and used against us? Is humanity going to perish because of poisonous plutonium spread that was snapped up by the wrong people after being somehow misplaced? Several examples will follow. You be the judge. 

Chuck Hansen [

U.S. Nuclear Weapons – The Secret History

] was wrong about one thing. He counted thirty-two “Broken Arrow” accidents. There are now sixty-five documented incidents in which nuclear weapons owned by the United States were lost, destroyed, or damaged between 1945 and 1989. These bombs and warheads, which contain hundreds of pounds of high explosive, have been abused in a wide range of unfortunate events. They have been accidentally dropped from high altitude, dropped from low altitude, crashed through the bomb bay doors while standing on the runway, tumbled off a fork lift, escaped from a chain hoist, and rolled off an aircraft carrier into the ocean. Bombs have been abandoned at the bottom of a test shaft, left buried in a crater, and lost in the mud off the coast of Georgia. Nuclear devices have been pounded with artillery of a foreign nature, struck by lightning, smashed to pieces, scorched, toasted, and burned beyond recognition. Incredibly, in all this mayhem, not a single nuclear weapon has gone off accidentally, anywhere in the world. If it had, the public would know about it. That type of accident would be almost impossible to conceal. (kl 5527)

There are a few common threads in the stories of accident and malfunction that Mahaffey provides. First, there are failures of training and knowledge on the part of front-line workers. The physics of nuclear fission are often counter-intuitive, and the idea of critical mass does not fully capture the danger of a quantity of fissionable material. The geometry of the storage of the material makes a critical difference in going critical. Fissionable material is often transported and manipulated in liquid solution; and the shape and configuration of the vessel in which the solution is held makes a difference to the probability of exponential growth of neutron emission — leading to runaway fission of the material. Mahaffey documents accidents that occurred in nuclear materials processing plants that resulted from plant workers applying what they knew from industrial plumbing to their efforts to solve basic shop-floor problems. All too often the result was a flash of blue light and the release of a great deal of heat and radioactive material.

Second, there is a fault at the opposite end of the knowledge spectrum — the tendency of expert engineers and scientists to believe that they can solve complicated reactor problems on the fly. This turned out to be a critical problem at Chernobyl (kl 6859).

The most difficult problem to handle is that the reactor operator, highly trained and educated with an active and disciplined mind, is liable to think beyond the rote procedures and carefully scheduled tasks. The operator is not a computer, and he or she cannot think like a machine. When the operator at NRX saw some untidy valve handles in the basement, he stepped outside the procedures and straightened them out, so that they were all facing the same way. (kl 2057)

There are also clear examples of inappropriate supervision in the accounts shared by Mahaffey. Here is an example from Chernobyl.

[Deputy chief engineer] Dyatlov was enraged. He paced up and down the control panel, berating the operators, cursing, spitting, threatening, and waving his arms. He demanded that the power be brought back up to 1,500 megawatts, where it was supposed to be for the test. The operators, Toptunov and Akimov, refused on grounds that it was against the rules to do so, even if they were not sure why. 

Dyatlov turned on Toptunov. “You lying idiot! If you don’t increase power, Tregub will!”  

Tregub, the Shift Foreman from the previous shift, was officially off the clock, but he had stayed around just to see the test. He tried to stay out of it. 

Toptunov, in fear of losing his job, started pulling rods. By the time he had wrestled it back to 200 megawatts, 205 of the 211 control rods were all the way out. In this unusual condition, there was danger of an emergency shutdown causing prompt supercriticality and a resulting steam explosion. At 1: 22: 30 a.m., a read-out from the operations computer advised that the reserve reactivity was too low for controlling the reactor, and it should be shut down immediately. Dyatlov was not worried. “Another two or three minutes, and it will be all over. Get moving, boys! (kl 6887)

This was the turning point in the disaster.

A related fault is the intrusion of political and business interests into the design and conduct of high-risk nuclear actions. Leaders want a given outcome without understanding the technical details of the processes they are demanding; subordinates like Toptunov are eventually cajoled or coerced into taking the problematic actions. The persistence of advocates for liquid sodium breeder reactors represents a higher-level example of the same fault. Associated with this role of political and business interests is an impulse towards secrecy and concealment when accidents occur and deliberate understatement of the public dangers created by an accident — a fault amply demonstrated in the Fukushima disaster.

Atomic Accidents provides a fascinating history of events of which most of us are unaware. The book is not primarily intended to offer an account of the causes of these accidents, but rather the ways in which they unfolded and the consequences they had for human welfare. (Generally speaking his view is that nuclear accidents in North America and Western Europe have had remarkably few human casualties.) And many of the accidents he describes are exactly the sorts of failures that are common in all largescale industrial and military processes.

(Largescale technology failure has come up frequently here. See these posts for analysis of some of the organizational causes of technology failure (link, link, link).)

Accident analysis and systems thinking

Complex socio-technical systems fail; that is, accidents occur. And it is enormously important for engineers and policy makers to have a better way of thinking about accidents than is the current protocol following an air crash, a chemical plant fire, or the release of a contaminated drug. We need to understand better what the systems and organizational causes of an accident are; even more importantly, we need to have a basis for improving the safe functioning of complex socio-technical systems by identifying better processes and better warning indicators of impending failure.

A long-term leader in the field of systems-safety thinking is Nancy Leveson, a professor of aeronautics and astronautics at MIT and the author of Safeware: System Safety and Computers (1995) and Engineering a Safer World: Systems Thinking Applied to Safety (2012). Leveson has been a particular advocate for two insights: looking at safety as a systems characteristic, and looking for the organizational and social components of safety and accidents as well as the technical event histories that are more often the focus of accident analysis. Her approach to safety and accidents involves looking at a technology system in terms of the set of controls and constraints that have been designed into the process to prevent accidents. “Accidents are seen as resulting from inadequate control or enforcement of constraints on safety-related behavior at each level of the system development and system operations control structures.” (25)

The abstract for her essay “A New Accident Model for Engineering Safety” (link) captures both points.

New technology is making fundamental changes in the etiology of accidents and is creating a need for changes in the explanatory mechanisms used. We need better and less subjective understanding of why accidents occur and how to prevent future ones. The most effective models will go beyond assigning blame and instead help engineers to learn as much as possible about all the factors involved, including those related to social and organizational structures. This paper presents a new accident model founded on basic systems theory concepts. The use of such a model provides a theoretical foundation for the introduction of unique new types of accident analysis, hazard analysis, accident prevention strategies including new approaches to designing for safety, risk assessment techniques, and approaches to designing performance monitoring and safety metrics.

The accident model she describes in this article and elsewhere is STAMP (Systems-Theoretic Accident Model and Processes). Here is a short description of the approach.

In STAMP, systems are viewed as interrelated components that are kept in a state of dynamic equilibrium by feedback loops of information and control. A system in this conceptualization is not a static design—it is a dynamic process that is continually adapting to achieve its ends and to react to changes in itself and its environment. The original design must not only enforce appropriate constraints on behavior to ensure safe operation, but the system must continue to operate safely as changes occur. The process leading up to an accident (loss event) can be described in terms of an adaptive feedback function that fails to maintain safety as performance changes over time to meet a complex set of goals and values…. 

The basic concepts in STAMP are constraints, control loops and process models, and levels of control. (12)

The other point of emphasis in Leveson’s treatment of safety is her consistent effort to include the social and organizational forms of control that are a part of the safe functioning of a complex technological system.

Event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management deficiencies, and flaws in the safety culture of the company or industry. An accident model should encourage a broad view of accident mechanisms that expands the investigation from beyond the proximate events. (6)

She treats the organizational backdrop of the technology process in question as being a crucial component of the safe functioning of the process.

Social and organizational factors, such as structural deficiencies in the organization, flaws in the safety culture, and inadequate management decision making and control are directly represented in the model and treated as complex processes rather than simply modeling their reflection in an event chain. (26)

And she treats organizational features as another form of control system (along the lines of Jay Forrester’s early definitions of systems in Industrial Dynamics.

Modeling complex organizations or industries using system theory involves dividing them into hierarchical levels with control processes operating at the interfaces between levels (Rasmussen, 1997). Figure 4 shows a generic socio-technical control model. Each system, of course, must be modeled to reflect its specific features, but all will have a structure that is a variant on this one. (17)

Here is figure 4:

The approach embodied in the STAMP framework is that safety is a systems effect, dynamically influenced by the control systems embodied in the total process in question.

In STAMP, systems are viewed as interrelated components that are kept in a state of dynamic equilibrium by feedback loops of information and control. A system in this conceptualization is not a static design—it is a dynamic process that is continually adapting to achieve its ends and to react to changes in itself and its environment. The original design must not only enforce appropriate constraints on behavior to ensure safe operation, but the system must continue to operate safely as changes occur. The process leading up to an accident (loss event) can be described in terms of an adaptive feedback function that fails to maintain safety as performance changes over time to meet a complex set of goals and values. (12) 

And:

In systems theory, systems are viewed as hierarchical structures where each level imposes constraints on the activity of the level beneath it—that is, constraints or lack of constraints at a higher level allow or control lower-level behavior (Checkland, 1981). Control laws are constraints on the relationships between the values of system variables. Safety-related control laws or constraints therefore specify those relationships between system variables that constitute the nonhazardous system states, for example, the power must never be on when the access door is open. The control processes (including the physical design) that enforce these constraints will limit system behavior to safe changes and adaptations. (17)

Leveson’s understanding of systems theory brings along with it a strong conception of “emergence”. She argues that higher levels of systems possess properties that cannot be reduced to the properties of the components, and that safety is one such property:

In systems theory, complex systems are modeled as a hierarchy of levels of organization, each more complex than the one below, where a level is characterized by having emergent or irreducible properties. Hierarchy theory deals with the fundamental differences between one level of complexity and another. Its ultimate aim is to explain the relationships between different levels: what generates the levels, what separates them, and what links them. Emergent properties associated with a set of components at one level in a hierarchy are related to constraints upon the degree of freedom of those components. (11)

But her understanding of “irreducible” seems to be different from that commonly used in the philosophy of science. She does in fact believe that these higher-level properties can be explained by the system of properties at the lower levels — for example, in this passage she asks “… what generates the levels” and how the emergent properties are “related to constraints” imposed on the lower levels. In other words, her position seems to be similar to that advanced by Dave Elder-Vass (link): emergent properties are properties at a higher level that are not possessed by the components, but which depend upon the interactions and composition of the lower-level components.

The domain of safety engineering and accident analysis seems like a particularly suitable place for Bayesian analysis. It seems unavoidable that accident analysis involves both frequency-based probabilities (e.g. the frequency of pump failure) and expert-based estimates of the likelihood of a particular kind of failure (e.g. the likelihood that a train operator will slacken attention to track warnings in response to company pressure on timetable). Bayesian techniques are suitable for the task of combining these various kinds of estimates of risk into a unified calculation.

The topic of safety and accidents is particularly relevant to Understanding Society because it expresses very clearly the causal complexity of the social world in which we live. And rather than simply ignoring that complexity, the systematic study of accidents gives us an avenue for arriving at better ways of representing, modeling, and intervening in parts of that complex world.

 

System safety engineering

Why do complex technologies so often fail, and fail in such unexpected ways? Why is it so difficult for hospitals, chemical plants, and railroads to design their processes in such a way as to dramatically reduce the accident rate? How should we attempt to provide systematic analysis of the risks that a given technology presents and the causes of accidents that sometimes ensue? Earlier posts have looked at the ways that sociologists have examined this problem (link, link, link); but how do gifted engineers address the issue?

Nancy Leveson’s current book, Engineering a Safer World: Systems Thinking Applied to Safety (2012), is an outstanding introduction to system safety engineering. This book brings forward the pioneering work that she did in Safeware: System Safety and Computers (1994) with new examples and new contributions to the field of safety engineering.

Leveson’s basic insight, here and in her earlier work, is that technical failure is rarely the result of the failure of a single component. Instead, failures result from multiple incidents involving the components, and unintended interactions among the components. So safety is a feature of the system as a whole, not of the individual sub-systems and components. Here is how she puts the point in Engineering a Safer World:

Safety is a system property, not a component property, and must be controlled at the system level, not the component level. (kl 263)

Traditional risk and failure analysis focuses on specific pathways that lead to accidents, identifying potential points of failure and the singular “causes” of the accident (most commonly including operator error). Leveson believes that this approach is no longer helpful. Instead she argues for what she calls a “new accident model” — a better and more comprehensive way of analyzing the possibilities of accident scenarios and the causes of actual accidents. This new conception has several important parts (kl 877-903):

  • expand accident analysis by forcing consideration of factors other than component failures and human errors
  • provide a more scientific way to model accidents that produces a better and less subjective understanding of why the accident occurred
  • include system design errors and dysfunctional system interactions
  • allow for and encourage new types of hazard analyses and risk assessments 
  • shift the emphasis in the role of humans in accidents from errors … to focus on the mechanisms and factors that shape human behavior
  • encourage a shift in the emphasis in accident analysis from “cause” … to understanding accidents in terms of reasons, that is, why the events and errors occurred
  • allow for and encourage multiple viewpoints and multiple interpretations when appropriate
  • assist in defining operational metrics and analyzing performance data

Leveson is particularly dissatisfied with the formal apparatus in use in engineering and elsewhere when it comes to analysis of safety and accident causation, and she argues that there are a number of misleading conflations in the field that need to be addressed. One of these is the conflation between reliability and safety. Reliability is an assessment of the performance of a component relative to its design. But Leveson points out that systems like automobiles, chemical plants, and weapons systems can all consist of components that are highly reliable and yet that give rise to highly destructive and unanticipated accidents.

So thinking about accidents in terms of component failure is a serious misreading of the nature of the technologies with which we interact every day. Instead she argues that safety engineering must be systems engineering:

The solution, I believe, lies in creating approaches to safety based on modern systems thinking and systems theory. (kl 88)

One important part of a better understanding of accidents and safety is a recognition of the fact of complexity in contemporary technology systems — interactive complexity, dynamic complexity, decompositional complexity, and nonlinear complexity (kl 139). Each of these forms of complexity makes it more difficult to anticipate possible accidents, and more difficult to assign discrete accident pathways to the occurrence of an accident.

Accidents are complex processes involving the entire sociotechnical system. Traditional event-chain models cannot describe this process adequately. (kl 496)

Leveson is highly critical of iterative safety engineering — what she calls the “fly-fix-fly” approach. Given the severity of outcomes that are possible when it comes to control systems for nuclear weapons, the operations of nuclear reactors, or the air traffic control system, we need to be able to do better than simply improving safety processes following an accident (kl 148).

The model that she favors is called STAMP (Systems-Theoretic Accident Model and Processes; kl 1059). This model replaces the linear component-by-component analysis of technical devices with a system-level representation of their functioning. The STAMP approach begins with an effort to identify crucial safety constraints for a given system. (For example, in the Union Carbide plant at Bhopal, “never allow MIC to come in contact with water”; in design of the Mars Polar Lander, “don’t allow the spacecraft to impact the planet surface with more than a maximum force” (kl 1074); in design of public water systems, “water quality must not be compromises” (kl 1205).) Once the constraints are specified, the issue of control arises; what are the internal and external processes that ensure that the constraints are continuously satisfied? This devolves into a set of questions about system design and system administration; the instrumentation that is developed to measure compliance with the constraint and the management systems that are in place to ensure continuous compliance.

Also of interest in the book is Leveson’s description of a new systems-level way of analyzing the hazards associated with a device or technology, STPA (System-Theoretic Process Analysis) (kl 2732). She describes STPA as the hazards analysis associated with the risks identified by STAMP:

STPA has two main steps:

  1. Identify the potential for inadequate control of the system that could lead to a hazardous state.
  2. Determine how each potentially hazardous control action identified in step 1 could occur. (kl 2758)

Here is an example of the process through which an STPA risk analysis proceeds for NASA (kl 2995).

It would be very interesting to see how an engineer would employ the STAMP and STPA methodologies to evaluate the risks and hazards associated with swarms of autonomous vehicles. Each vehicle is a system that can be analyzed using the STAMP methodology. But likewise the workings of an expressway with hundreds of autonomous vehicles (perhaps interspersed with less predictable human drivers) is also a system with complex characteristics.

Each individual vehicle has a hierarchical system of control designed to ensure safe transportation of its passengers and the vehicle itself; what are the failure modes for this control system? And what about the swarm — given that each vehicle is responsive to the other vehicles around it, how will individual cars respond to unusual circumstances (a jack-knifed truck blocking all three lanes, let’s say)? It would appear that autonomous vehicles create the kinds of novel hazards with which Leveson begins her book — complexity, non-linear relationships, emergent properties of the whole that are unexpected given the expected operations of the components. The fly-fix-fly approach would suggest the deployment of a certain number of experimental vehicles and then evaluate their interactions in real-world settings. A more disciplined approach using the methodologies of STAMP and STPA would make systematic efforts to identify and control the pathways through which accidents can occur.

Here is a simulated swarm of autonomous vehicles:

 

But accidents happen; neither software nor control systems are perfect. So what would be the result of one disabling fender-bender in the intersection, followed by a half dozen more; followed by a gigantic pileup of robo-cars?

Kathleen Tierney on disaster and resilience

The fact of large-scale technology failure has come up fairly often in Understanding Society (link, link, link). There are a couple of reasons for this. One is that our society is highly technology-dependent, relying on more and more densely interlinked and concentrated systems of production and delivery that are subject to unexpected but damaging forms of failure. So it is a pressingly important problem for us to have a better understanding of technology failure than we do today. The other reason that examples of technology failure are frequent here is that it seems pretty clear that failures of this kind are generally social and organizational failures (in part), not simply technological failures. So the study of technology failure is a good way of examining the weaknesses and strengths of various organizational forms — from the firm or plant to the vast regulatory agency. I have highlighted the work of Charles Perrow as being especially useful in this context, especially Normal Accidents: Living with High-Risk Technologies and The Next Catastrophe: Reducing Our Vulnerabilities to Natural, Industrial, and Terrorist Disasters.

Kathleen Tierney has studied disasters very extensively, and her recent The Social Roots of Risk: Producing Disasters, Promoting Resilience is an important contribution. Tierney is both an academic and a practitioner; she is an expert on earthquake science and preparedness and serves as director of the Natural Hazards Center at the University of Colorado. The topics of disaster and technology failure are linked; natural disasters (earthquakes, tsunami, hurricanes) are often the cause of ensuing technology failures of enormous magnitude. Here is Tierney’s over-riding framework of analysis:

The general answer is that disasters of all types occur as a consequence of common sets of social activities and processes that are well understood on the basis of both social science theory and empirical data. Put simply, the organizing idea for this books is that disasters and their impacts are socially produced, and that the forces driving the production of disaster are embedded in the social order itself. As the case studies and research findings discussed throughout the book will show, this is equally true whether the culprit in question is a hurricane, flood, earthquake, or a bursting speculative bubble. The origins of disaster lie not in nature, and not in technology, but rather in the ordinary everyday workings of society itself. (4-5)

This is one of Tierney’s key premises — that disasters are socially produced and socially constituted. Her other major theme is the notion of resilience — the idea that social characteristics exist that make one set of social arrangements more resilient  than another to harm in the face of natural catastrophe. Features of resilience involve —

preexisting, planned, and naturally emerging activities that make societies and communities better able to cope, adapt, and sustain themselves when disasters occur, and also to develop ways of recovering following such events. (5)

Tierney is often drawn to the alliteration of “risk and resilience”. “Risk” is the possibility of serious disturbance to the integrity of a system. “Risk” is a compound of likelihood of a type of disturbance and the damage created by that eventuality. Here is Tierney’s capsule definition:

Risk is commonly conceptualized as the answer to three questions: What can go wrong? How likely is it? And what are the consequences? (11)

“Resilience”, by contrast, is a feature of the system in response to such a disturbance. So the concepts of risk and resilience do not operate on the same level. A more apt opposition is fragility and resilience. (Tierney sometimes refers to brittle institutions.)  Some institutional arrangements are like glass — a sharp tap and they fall into a mound of shards. Others are more like a starfish — able to recover form and function following even very damaging encounters with the world. Both kinds of systems are subject to risk, and the probability of a given disturbance may be the same in the two instances. The difference between them is how well they recover from the realization of risk. But the damage that results from the same disturbance is much greater in a fragile system than a resilient system. And Tierney makes a crucial point for all of us in the twenty-first century: we need to be exerting ourselves to create social systems and communities that are substantially more resilient than they currently are.

A very important example of non-resilient trends in twenty-first century life is the spread of ultra-tall buildings in global cities. There are a variety of reasons why developers and urban leaders like ultra-tall structures — reasons that largely have to do with prestige. But Tierney points out in expert detail the degree to which these buildings are unreasonably fragile in face of disaster: they shed vast quantities of glass, they concentrate people and business in a way that invites terrorist attack, they exist in vulnerable systems of electricity and water that are crucial to their hour-to-hour functioning. A major earthquake in San Francisco has the potential to leave the buildings standing but the populations living within them stranded without light or elevators, and the emergency responders one hundred flights of stairs away from the emergencies they need to confront (63ff.).

The most fundamental and intractable source of hazard for our society that Tierney highlights is the likelihood of failure of government regulatory and safety organizations to carry out their stated missions of protecting the safety and health of the public. Like Perrow in The Next Catastrophe, she finds instance after instance of cases where the public’s interest would be best served by a regulation or prohibition of a certain kind of risky activity (residential and commercial development in flood or earthquake zones, for example) but where powerful economic interests (corporations, local developers) have the overwhelming ability to block sensible and prudent regulations in this space. “Economic power on this scale is easily translated into political power, with important consequences for risk buildup” (91). Tierney offers the case of the Japanese nuclear industry as an example of a concentrated and powerful set of organizations that were able to succeed in creating siting decisions and safety regulations that served their interests rather than the interests of the general public.

As nuclear power emerged as a major source of energy in Japan, communities were essentially bribed into accepting nuclear plants, with the promise of jobs for young workers and support for schools and community projects; also, extensive propaganda efforts were launched…. Then, once government and industry succeeded in getting communities to accept the presence of nuclear plants, the natural tendency was to locate multiple reactors at nuclear sites to achieve economies of scale and to avoid having to repeat costly charm offensives in large numbers of communities. (92)

In Tierney’s view, the problem of regulatory capture by the economically powerful is perhaps the largest obstacle to our ability to create a rational and prudent plan for managing risks in the future (94). (Here is an earlier post on the quiet use of economic power; link.)

The Social Roots of Risk is rich in detail and deeply insightful into the sociology of risk in a large democratic corporation-centered society. The hazards she identifies concerning the failure of our institutions to devise genuinely prudent policies around foreseeable risks (earthquake, hurricane, flood, terrorism, nuclear or chemical plant malfunction, train disaster, …) are deeply alarming. The public and our governments need to absorb these lessons and design for more resilient societies and communities, exactly as Tierney and Perrow argue.

%d bloggers like this: