Nuclear accidents

diagrams: Chernobyl reactor before and after

Nuclear fission is one of the world-changing discoveries of the mid-twentieth century. The atomic bomb projects of the United States led to the atomic bombing of Japan in August 1945, and the hope for limitless electricity brought about the proliferation of a variety of nuclear reactors around the world in the decades following World War II. And, of course, nuclear weapons proliferated to other countries beyond the original circle of atomic powers.

Given the enormous energies associated with fission and the dangerous and toxic properties of radioactive components of fission processes, the possibility of a nuclear accident is a particularly frightening one for the modern public. The world has seen the results of several massive nuclear accidents — Chernobyl and Fukushima in particular — and the devastating results they have had on human populations and the social and economic wellbeing of the regions in which they occurred.

Safety is therefore a paramount priority in the nuclear industry, both in research labs and military and civilian applications. So what is the situation of safety in the nuclear sector? Jim Mahaffey’s Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima is a detailed and carefully researched attempt to answer this question. And the information he provides is not reassuring. Beyond the celebrated and well-known disasters at nuclear power plants (Three Mile Island, Chernobyl, Fukushima), Mahaffey refers to hundreds of accidents involving reactors, research laboratories, weapons plants, and deployed nuclear weapons that have had less public awareness. These accidents resulted in a very low number of lives lost, but their frequency is alarming. They are indeed “normal accidents” (Perrow, Normal Accidents: Living with High-Risk Technologies. For example:

  • a Japanese fishing boat is contaminated by fallout from Castle Bravo test of hydrogen bomb; lots of radioactive fish at the markets in Japan (March 1, 1954) (kl 1706)
  • one MK-6 atomic bomb is dropped on Mars Bluff, South Carolina, after a crew member accidentally pulled the emergency bomb release handle (February 5, 1958) (kl 5774)
  • Fermi 1 liquid sodium plutonium breeder reactor experiences fuel meltdown during startup trials near Detroit (October 4, 1966) (kl 4127)

Mahaffey also provides detailed accounts of the most serious nuclear accidents and meltdowns during the past forty years, Three Mile Island, Chernobyl, and Fukushima.

The safety and control of nuclear weapons is of particular interest. Here is Mahaffey’s summary of “Broken Arrow” events — the loss of atomic and fusion weapons:

Did the Air Force ever lose an A-bomb, or did they just misplace a few of them for a short time? Did they ever drop anything that could be picked up by someone else and used against us? Is humanity going to perish because of poisonous plutonium spread that was snapped up by the wrong people after being somehow misplaced? Several examples will follow. You be the judge. 

Chuck Hansen [

U.S. Nuclear Weapons – The Secret History

] was wrong about one thing. He counted thirty-two “Broken Arrow” accidents. There are now sixty-five documented incidents in which nuclear weapons owned by the United States were lost, destroyed, or damaged between 1945 and 1989. These bombs and warheads, which contain hundreds of pounds of high explosive, have been abused in a wide range of unfortunate events. They have been accidentally dropped from high altitude, dropped from low altitude, crashed through the bomb bay doors while standing on the runway, tumbled off a fork lift, escaped from a chain hoist, and rolled off an aircraft carrier into the ocean. Bombs have been abandoned at the bottom of a test shaft, left buried in a crater, and lost in the mud off the coast of Georgia. Nuclear devices have been pounded with artillery of a foreign nature, struck by lightning, smashed to pieces, scorched, toasted, and burned beyond recognition. Incredibly, in all this mayhem, not a single nuclear weapon has gone off accidentally, anywhere in the world. If it had, the public would know about it. That type of accident would be almost impossible to conceal. (kl 5527)

There are a few common threads in the stories of accident and malfunction that Mahaffey provides. First, there are failures of training and knowledge on the part of front-line workers. The physics of nuclear fission are often counter-intuitive, and the idea of critical mass does not fully capture the danger of a quantity of fissionable material. The geometry of the storage of the material makes a critical difference in going critical. Fissionable material is often transported and manipulated in liquid solution; and the shape and configuration of the vessel in which the solution is held makes a difference to the probability of exponential growth of neutron emission — leading to runaway fission of the material. Mahaffey documents accidents that occurred in nuclear materials processing plants that resulted from plant workers applying what they knew from industrial plumbing to their efforts to solve basic shop-floor problems. All too often the result was a flash of blue light and the release of a great deal of heat and radioactive material.

Second, there is a fault at the opposite end of the knowledge spectrum — the tendency of expert engineers and scientists to believe that they can solve complicated reactor problems on the fly. This turned out to be a critical problem at Chernobyl (kl 6859).

The most difficult problem to handle is that the reactor operator, highly trained and educated with an active and disciplined mind, is liable to think beyond the rote procedures and carefully scheduled tasks. The operator is not a computer, and he or she cannot think like a machine. When the operator at NRX saw some untidy valve handles in the basement, he stepped outside the procedures and straightened them out, so that they were all facing the same way. (kl 2057)

There are also clear examples of inappropriate supervision in the accounts shared by Mahaffey. Here is an example from Chernobyl.

[Deputy chief engineer] Dyatlov was enraged. He paced up and down the control panel, berating the operators, cursing, spitting, threatening, and waving his arms. He demanded that the power be brought back up to 1,500 megawatts, where it was supposed to be for the test. The operators, Toptunov and Akimov, refused on grounds that it was against the rules to do so, even if they were not sure why. 

Dyatlov turned on Toptunov. “You lying idiot! If you don’t increase power, Tregub will!”  

Tregub, the Shift Foreman from the previous shift, was officially off the clock, but he had stayed around just to see the test. He tried to stay out of it. 

Toptunov, in fear of losing his job, started pulling rods. By the time he had wrestled it back to 200 megawatts, 205 of the 211 control rods were all the way out. In this unusual condition, there was danger of an emergency shutdown causing prompt supercriticality and a resulting steam explosion. At 1: 22: 30 a.m., a read-out from the operations computer advised that the reserve reactivity was too low for controlling the reactor, and it should be shut down immediately. Dyatlov was not worried. “Another two or three minutes, and it will be all over. Get moving, boys! (kl 6887)

This was the turning point in the disaster.

A related fault is the intrusion of political and business interests into the design and conduct of high-risk nuclear actions. Leaders want a given outcome without understanding the technical details of the processes they are demanding; subordinates like Toptunov are eventually cajoled or coerced into taking the problematic actions. The persistence of advocates for liquid sodium breeder reactors represents a higher-level example of the same fault. Associated with this role of political and business interests is an impulse towards secrecy and concealment when accidents occur and deliberate understatement of the public dangers created by an accident — a fault amply demonstrated in the Fukushima disaster.

Atomic Accidents provides a fascinating history of events of which most of us are unaware. The book is not primarily intended to offer an account of the causes of these accidents, but rather the ways in which they unfolded and the consequences they had for human welfare. (Generally speaking his view is that nuclear accidents in North America and Western Europe have had remarkably few human casualties.) And many of the accidents he describes are exactly the sorts of failures that are common in all largescale industrial and military processes.

(Largescale technology failure has come up frequently here. See these posts for analysis of some of the organizational causes of technology failure (link, link, link).)

Accident analysis and systems thinking

Complex socio-technical systems fail; that is, accidents occur. And it is enormously important for engineers and policy makers to have a better way of thinking about accidents than is the current protocol following an air crash, a chemical plant fire, or the release of a contaminated drug. We need to understand better what the systems and organizational causes of an accident are; even more importantly, we need to have a basis for improving the safe functioning of complex socio-technical systems by identifying better processes and better warning indicators of impending failure.

A long-term leader in the field of systems-safety thinking is Nancy Leveson, a professor of aeronautics and astronautics at MIT and the author of Safeware: System Safety and Computers (1995) and Engineering a Safer World: Systems Thinking Applied to Safety (2012). Leveson has been a particular advocate for two insights: looking at safety as a systems characteristic, and looking for the organizational and social components of safety and accidents as well as the technical event histories that are more often the focus of accident analysis. Her approach to safety and accidents involves looking at a technology system in terms of the set of controls and constraints that have been designed into the process to prevent accidents. “Accidents are seen as resulting from inadequate control or enforcement of constraints on safety-related behavior at each level of the system development and system operations control structures.” (25)

The abstract for her essay “A New Accident Model for Engineering Safety” (link) captures both points.

New technology is making fundamental changes in the etiology of accidents and is creating a need for changes in the explanatory mechanisms used. We need better and less subjective understanding of why accidents occur and how to prevent future ones. The most effective models will go beyond assigning blame and instead help engineers to learn as much as possible about all the factors involved, including those related to social and organizational structures. This paper presents a new accident model founded on basic systems theory concepts. The use of such a model provides a theoretical foundation for the introduction of unique new types of accident analysis, hazard analysis, accident prevention strategies including new approaches to designing for safety, risk assessment techniques, and approaches to designing performance monitoring and safety metrics.

The accident model she describes in this article and elsewhere is STAMP (Systems-Theoretic Accident Model and Processes). Here is a short description of the approach.

In STAMP, systems are viewed as interrelated components that are kept in a state of dynamic equilibrium by feedback loops of information and control. A system in this conceptualization is not a static design—it is a dynamic process that is continually adapting to achieve its ends and to react to changes in itself and its environment. The original design must not only enforce appropriate constraints on behavior to ensure safe operation, but the system must continue to operate safely as changes occur. The process leading up to an accident (loss event) can be described in terms of an adaptive feedback function that fails to maintain safety as performance changes over time to meet a complex set of goals and values…. 

The basic concepts in STAMP are constraints, control loops and process models, and levels of control. (12)

The other point of emphasis in Leveson’s treatment of safety is her consistent effort to include the social and organizational forms of control that are a part of the safe functioning of a complex technological system.

Event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management deficiencies, and flaws in the safety culture of the company or industry. An accident model should encourage a broad view of accident mechanisms that expands the investigation from beyond the proximate events. (6)

She treats the organizational backdrop of the technology process in question as being a crucial component of the safe functioning of the process.

Social and organizational factors, such as structural deficiencies in the organization, flaws in the safety culture, and inadequate management decision making and control are directly represented in the model and treated as complex processes rather than simply modeling their reflection in an event chain. (26)

And she treats organizational features as another form of control system (along the lines of Jay Forrester’s early definitions of systems in Industrial Dynamics.

Modeling complex organizations or industries using system theory involves dividing them into hierarchical levels with control processes operating at the interfaces between levels (Rasmussen, 1997). Figure 4 shows a generic socio-technical control model. Each system, of course, must be modeled to reflect its specific features, but all will have a structure that is a variant on this one. (17)

Here is figure 4:

The approach embodied in the STAMP framework is that safety is a systems effect, dynamically influenced by the control systems embodied in the total process in question.

In STAMP, systems are viewed as interrelated components that are kept in a state of dynamic equilibrium by feedback loops of information and control. A system in this conceptualization is not a static design—it is a dynamic process that is continually adapting to achieve its ends and to react to changes in itself and its environment. The original design must not only enforce appropriate constraints on behavior to ensure safe operation, but the system must continue to operate safely as changes occur. The process leading up to an accident (loss event) can be described in terms of an adaptive feedback function that fails to maintain safety as performance changes over time to meet a complex set of goals and values. (12) 


In systems theory, systems are viewed as hierarchical structures where each level imposes constraints on the activity of the level beneath it—that is, constraints or lack of constraints at a higher level allow or control lower-level behavior (Checkland, 1981). Control laws are constraints on the relationships between the values of system variables. Safety-related control laws or constraints therefore specify those relationships between system variables that constitute the nonhazardous system states, for example, the power must never be on when the access door is open. The control processes (including the physical design) that enforce these constraints will limit system behavior to safe changes and adaptations. (17)

Leveson’s understanding of systems theory brings along with it a strong conception of “emergence”. She argues that higher levels of systems possess properties that cannot be reduced to the properties of the components, and that safety is one such property:

In systems theory, complex systems are modeled as a hierarchy of levels of organization, each more complex than the one below, where a level is characterized by having emergent or irreducible properties. Hierarchy theory deals with the fundamental differences between one level of complexity and another. Its ultimate aim is to explain the relationships between different levels: what generates the levels, what separates them, and what links them. Emergent properties associated with a set of components at one level in a hierarchy are related to constraints upon the degree of freedom of those components. (11)

But her understanding of “irreducible” seems to be different from that commonly used in the philosophy of science. She does in fact believe that these higher-level properties can be explained by the system of properties at the lower levels — for example, in this passage she asks “… what generates the levels” and how the emergent properties are “related to constraints” imposed on the lower levels. In other words, her position seems to be similar to that advanced by Dave Elder-Vass (link): emergent properties are properties at a higher level that are not possessed by the components, but which depend upon the interactions and composition of the lower-level components.

The domain of safety engineering and accident analysis seems like a particularly suitable place for Bayesian analysis. It seems unavoidable that accident analysis involves both frequency-based probabilities (e.g. the frequency of pump failure) and expert-based estimates of the likelihood of a particular kind of failure (e.g. the likelihood that a train operator will slacken attention to track warnings in response to company pressure on timetable). Bayesian techniques are suitable for the task of combining these various kinds of estimates of risk into a unified calculation.

The topic of safety and accidents is particularly relevant to Understanding Society because it expresses very clearly the causal complexity of the social world in which we live. And rather than simply ignoring that complexity, the systematic study of accidents gives us an avenue for arriving at better ways of representing, modeling, and intervening in parts of that complex world.


System safety engineering

Why do complex technologies so often fail, and fail in such unexpected ways? Why is it so difficult for hospitals, chemical plants, and railroads to design their processes in such a way as to dramatically reduce the accident rate? How should we attempt to provide systematic analysis of the risks that a given technology presents and the causes of accidents that sometimes ensue? Earlier posts have looked at the ways that sociologists have examined this problem (link, link, link); but how do gifted engineers address the issue?

Nancy Leveson’s current book, Engineering a Safer World: Systems Thinking Applied to Safety (2012), is an outstanding introduction to system safety engineering. This book brings forward the pioneering work that she did in Safeware: System Safety and Computers (1994) with new examples and new contributions to the field of safety engineering.

Leveson’s basic insight, here and in her earlier work, is that technical failure is rarely the result of the failure of a single component. Instead, failures result from multiple incidents involving the components, and unintended interactions among the components. So safety is a feature of the system as a whole, not of the individual sub-systems and components. Here is how she puts the point in Engineering a Safer World:

Safety is a system property, not a component property, and must be controlled at the system level, not the component level. (kl 263)

Traditional risk and failure analysis focuses on specific pathways that lead to accidents, identifying potential points of failure and the singular “causes” of the accident (most commonly including operator error). Leveson believes that this approach is no longer helpful. Instead she argues for what she calls a “new accident model” — a better and more comprehensive way of analyzing the possibilities of accident scenarios and the causes of actual accidents. This new conception has several important parts (kl 877-903):

  • expand accident analysis by forcing consideration of factors other than component failures and human errors
  • provide a more scientific way to model accidents that produces a better and less subjective understanding of why the accident occurred
  • include system design errors and dysfunctional system interactions
  • allow for and encourage new types of hazard analyses and risk assessments 
  • shift the emphasis in the role of humans in accidents from errors … to focus on the mechanisms and factors that shape human behavior
  • encourage a shift in the emphasis in accident analysis from “cause” … to understanding accidents in terms of reasons, that is, why the events and errors occurred
  • allow for and encourage multiple viewpoints and multiple interpretations when appropriate
  • assist in defining operational metrics and analyzing performance data

Leveson is particularly dissatisfied with the formal apparatus in use in engineering and elsewhere when it comes to analysis of safety and accident causation, and she argues that there are a number of misleading conflations in the field that need to be addressed. One of these is the conflation between reliability and safety. Reliability is an assessment of the performance of a component relative to its design. But Leveson points out that systems like automobiles, chemical plants, and weapons systems can all consist of components that are highly reliable and yet that give rise to highly destructive and unanticipated accidents.

So thinking about accidents in terms of component failure is a serious misreading of the nature of the technologies with which we interact every day. Instead she argues that safety engineering must be systems engineering:

The solution, I believe, lies in creating approaches to safety based on modern systems thinking and systems theory. (kl 88)

One important part of a better understanding of accidents and safety is a recognition of the fact of complexity in contemporary technology systems — interactive complexity, dynamic complexity, decompositional complexity, and nonlinear complexity (kl 139). Each of these forms of complexity makes it more difficult to anticipate possible accidents, and more difficult to assign discrete accident pathways to the occurrence of an accident.

Accidents are complex processes involving the entire sociotechnical system. Traditional event-chain models cannot describe this process adequately. (kl 496)

Leveson is highly critical of iterative safety engineering — what she calls the “fly-fix-fly” approach. Given the severity of outcomes that are possible when it comes to control systems for nuclear weapons, the operations of nuclear reactors, or the air traffic control system, we need to be able to do better than simply improving safety processes following an accident (kl 148).

The model that she favors is called STAMP (Systems-Theoretic Accident Model and Processes; kl 1059). This model replaces the linear component-by-component analysis of technical devices with a system-level representation of their functioning. The STAMP approach begins with an effort to identify crucial safety constraints for a given system. (For example, in the Union Carbide plant at Bhopal, “never allow MIC to come in contact with water”; in design of the Mars Polar Lander, “don’t allow the spacecraft to impact the planet surface with more than a maximum force” (kl 1074); in design of public water systems, “water quality must not be compromises” (kl 1205).) Once the constraints are specified, the issue of control arises; what are the internal and external processes that ensure that the constraints are continuously satisfied? This devolves into a set of questions about system design and system administration; the instrumentation that is developed to measure compliance with the constraint and the management systems that are in place to ensure continuous compliance.

Also of interest in the book is Leveson’s description of a new systems-level way of analyzing the hazards associated with a device or technology, STPA (System-Theoretic Process Analysis) (kl 2732). She describes STPA as the hazards analysis associated with the risks identified by STAMP:

STPA has two main steps:

  1. Identify the potential for inadequate control of the system that could lead to a hazardous state.
  2. Determine how each potentially hazardous control action identified in step 1 could occur. (kl 2758)

Here is an example of the process through which an STPA risk analysis proceeds for NASA (kl 2995).

It would be very interesting to see how an engineer would employ the STAMP and STPA methodologies to evaluate the risks and hazards associated with swarms of autonomous vehicles. Each vehicle is a system that can be analyzed using the STAMP methodology. But likewise the workings of an expressway with hundreds of autonomous vehicles (perhaps interspersed with less predictable human drivers) is also a system with complex characteristics.

Each individual vehicle has a hierarchical system of control designed to ensure safe transportation of its passengers and the vehicle itself; what are the failure modes for this control system? And what about the swarm — given that each vehicle is responsive to the other vehicles around it, how will individual cars respond to unusual circumstances (a jack-knifed truck blocking all three lanes, let’s say)? It would appear that autonomous vehicles create the kinds of novel hazards with which Leveson begins her book — complexity, non-linear relationships, emergent properties of the whole that are unexpected given the expected operations of the components. The fly-fix-fly approach would suggest the deployment of a certain number of experimental vehicles and then evaluate their interactions in real-world settings. A more disciplined approach using the methodologies of STAMP and STPA would make systematic efforts to identify and control the pathways through which accidents can occur.

Here is a simulated swarm of autonomous vehicles:


But accidents happen; neither software nor control systems are perfect. So what would be the result of one disabling fender-bender in the intersection, followed by a half dozen more; followed by a gigantic pileup of robo-cars?

Kathleen Tierney on disaster and resilience

The fact of large-scale technology failure has come up fairly often in Understanding Society (link, link, link). There are a couple of reasons for this. One is that our society is highly technology-dependent, relying on more and more densely interlinked and concentrated systems of production and delivery that are subject to unexpected but damaging forms of failure. So it is a pressingly important problem for us to have a better understanding of technology failure than we do today. The other reason that examples of technology failure are frequent here is that it seems pretty clear that failures of this kind are generally social and organizational failures (in part), not simply technological failures. So the study of technology failure is a good way of examining the weaknesses and strengths of various organizational forms — from the firm or plant to the vast regulatory agency. I have highlighted the work of Charles Perrow as being especially useful in this context, especially Normal Accidents: Living with High-Risk Technologies and The Next Catastrophe: Reducing Our Vulnerabilities to Natural, Industrial, and Terrorist Disasters.

Kathleen Tierney has studied disasters very extensively, and her recent The Social Roots of Risk: Producing Disasters, Promoting Resilience is an important contribution. Tierney is both an academic and a practitioner; she is an expert on earthquake science and preparedness and serves as director of the Natural Hazards Center at the University of Colorado. The topics of disaster and technology failure are linked; natural disasters (earthquakes, tsunami, hurricanes) are often the cause of ensuing technology failures of enormous magnitude. Here is Tierney’s over-riding framework of analysis:

The general answer is that disasters of all types occur as a consequence of common sets of social activities and processes that are well understood on the basis of both social science theory and empirical data. Put simply, the organizing idea for this books is that disasters and their impacts are socially produced, and that the forces driving the production of disaster are embedded in the social order itself. As the case studies and research findings discussed throughout the book will show, this is equally true whether the culprit in question is a hurricane, flood, earthquake, or a bursting speculative bubble. The origins of disaster lie not in nature, and not in technology, but rather in the ordinary everyday workings of society itself. (4-5)

This is one of Tierney’s key premises — that disasters are socially produced and socially constituted. Her other major theme is the notion of resilience — the idea that social characteristics exist that make one set of social arrangements more resilient  than another to harm in the face of natural catastrophe. Features of resilience involve —

preexisting, planned, and naturally emerging activities that make societies and communities better able to cope, adapt, and sustain themselves when disasters occur, and also to develop ways of recovering following such events. (5)

Tierney is often drawn to the alliteration of “risk and resilience”. “Risk” is the possibility of serious disturbance to the integrity of a system. “Risk” is a compound of likelihood of a type of disturbance and the damage created by that eventuality. Here is Tierney’s capsule definition:

Risk is commonly conceptualized as the answer to three questions: What can go wrong? How likely is it? And what are the consequences? (11)

“Resilience”, by contrast, is a feature of the system in response to such a disturbance. So the concepts of risk and resilience do not operate on the same level. A more apt opposition is fragility and resilience. (Tierney sometimes refers to brittle institutions.)  Some institutional arrangements are like glass — a sharp tap and they fall into a mound of shards. Others are more like a starfish — able to recover form and function following even very damaging encounters with the world. Both kinds of systems are subject to risk, and the probability of a given disturbance may be the same in the two instances. The difference between them is how well they recover from the realization of risk. But the damage that results from the same disturbance is much greater in a fragile system than a resilient system. And Tierney makes a crucial point for all of us in the twenty-first century: we need to be exerting ourselves to create social systems and communities that are substantially more resilient than they currently are.

A very important example of non-resilient trends in twenty-first century life is the spread of ultra-tall buildings in global cities. There are a variety of reasons why developers and urban leaders like ultra-tall structures — reasons that largely have to do with prestige. But Tierney points out in expert detail the degree to which these buildings are unreasonably fragile in face of disaster: they shed vast quantities of glass, they concentrate people and business in a way that invites terrorist attack, they exist in vulnerable systems of electricity and water that are crucial to their hour-to-hour functioning. A major earthquake in San Francisco has the potential to leave the buildings standing but the populations living within them stranded without light or elevators, and the emergency responders one hundred flights of stairs away from the emergencies they need to confront (63ff.).

The most fundamental and intractable source of hazard for our society that Tierney highlights is the likelihood of failure of government regulatory and safety organizations to carry out their stated missions of protecting the safety and health of the public. Like Perrow in The Next Catastrophe, she finds instance after instance of cases where the public’s interest would be best served by a regulation or prohibition of a certain kind of risky activity (residential and commercial development in flood or earthquake zones, for example) but where powerful economic interests (corporations, local developers) have the overwhelming ability to block sensible and prudent regulations in this space. “Economic power on this scale is easily translated into political power, with important consequences for risk buildup” (91). Tierney offers the case of the Japanese nuclear industry as an example of a concentrated and powerful set of organizations that were able to succeed in creating siting decisions and safety regulations that served their interests rather than the interests of the general public.

As nuclear power emerged as a major source of energy in Japan, communities were essentially bribed into accepting nuclear plants, with the promise of jobs for young workers and support for schools and community projects; also, extensive propaganda efforts were launched…. Then, once government and industry succeeded in getting communities to accept the presence of nuclear plants, the natural tendency was to locate multiple reactors at nuclear sites to achieve economies of scale and to avoid having to repeat costly charm offensives in large numbers of communities. (92)

In Tierney’s view, the problem of regulatory capture by the economically powerful is perhaps the largest obstacle to our ability to create a rational and prudent plan for managing risks in the future (94). (Here is an earlier post on the quiet use of economic power; link.)

The Social Roots of Risk is rich in detail and deeply insightful into the sociology of risk in a large democratic corporation-centered society. The hazards she identifies concerning the failure of our institutions to devise genuinely prudent policies around foreseeable risks (earthquake, hurricane, flood, terrorism, nuclear or chemical plant malfunction, train disaster, …) are deeply alarming. The public and our governments need to absorb these lessons and design for more resilient societies and communities, exactly as Tierney and Perrow argue.

Regulatory thrombosis

Charles Perrow is a leading researcher on the sociology of organizations, and he is a singular expert on accidents and system failures. Several of his books are classics in their field — Normal Accidents: Living with High-Risk TechnologiesThe Next Catastrophe: Reducing Our Vulnerabilities to Natural, Industrial, and Terrorist DisastersOrganizing America: Wealth, Power, and the Origins of Corporate Capitalism. So it is very striking to find that Perrow is highly skeptical about the ability of governmental organizations in the United States to protect the public from large failures and disasters of various kinds — hurricanes, floods, chemical plant fires, software failures, terrorism. His assessment of organizations such as the Federal Emergency Management Agency, the Department of Homeland Security, or the Nuclear Regulatory Commission  is dismal. Here is his summary assessment of the Department of Homeland Security:

We should not expect too much of organizations, but the DHS is extreme in its dysfunctions. As with all organizations, the DHS has been used by its masters and outsiders for purposes that are beyond its mandate, and the usage of the DHS has been extreme. One major user of the DHS is Congress. While Congress is the arm of the government that is closest to the people, it is also the one that is most influenced by corporations and local interest groups that do not have the interests of the larger community in mind. (The Next Catastrophe, kl 205)

The most alarming chapters of The Next Catastrophe concern the failures of US agencies to effectively and intelligently organize preparations that will genuinely make us safer. Perrow provides extended analyses of the Department of Homeland Security, FEMA, and the Nuclear Regulatory Commission in their respective functions — securing the country against the consequences of terrorist attack, preparing for and responding to major environmental disasters like Katrina, and securing nuclear power plants and spent fuel storage dumps against accident and attack. In chapter after chapter he documents the most egregious and frightening failures of each of these agencies. 

The level of organizational ineptitude that he documents in the performance of these agencies is staggering — one has the impression that a particularly gifted group of high school seniors could have done a better job of responding to the Katrina disaster. (“You’re doing a great job, Brownie!”) And the disarray that he documents in these organizations is genuinely frightening. He walks through plausible scenarios through which a group of a dozen determined attackers could disable the cooling systems for spent fuel rods at existing nuclear power plants, with catastrophic release of radiation affecting millions of people within 50 miles of the accident. (These scenarios are all the more believable now that we’ve seen what happened to the cooling ponds at Fukishima.)

What this all suggests is that the U.S. government and our political culture do a particularly bad job of creating organizational intelligence in response to crucial national challenges. By this I mean an effective group of bureaus with a clear mission, committed executive leadership, and consistent communication and collaboration among agencies and a demonstrated ability to formulate and carry out rational plans in addressing identified risks. (Perrow’s general assessment of the French nuclear power system seems to be that it is more effective in maintaining safe operations and protecting nuclear materials against attack.)  And the US government’s ability to provide this kind of intelligent risk abatement seems particularly weak.

Perrow doesn’t endorse the general view that organizations can never succeed in accomplishing the functions we assign to them — hospitals, police departments, even labor unions. Instead, there seem to be particular reasons why large regulatory agencies in the United States have proven particularly inept, in his assessment. The most faulty organizations are those that are designed to regulate risky activities and those that are charged to create prudent longterm plans for the future that seem particularly suspect, in his account. So what are those reasons for failure in these kinds of organizations?

One major part of his assessment focuses on the role that economic and political power plays in deforming the operations of major organizations to serve the interests of the powerful. Regulatory agencies are “captured” by the powerful industries they are supposed to oversee, whether through influence on the executive branch or through merciless lobbying of the legislative branch. Energy companies pressure the Congress and the NRC to privatize security at nuclear power plants — with what would otherwise be comical results when it comes to testing the resulting level of security at numerous plants. Private security forces are given advance notice of the time and nature of the simulated attack — and even so half the attacks are successful.

Another major source of dysfunction that Perrow identifies in the case of the Department of Homeland Security is the workings of Congressional politics. Committee chairs resist losing scope for their committees, so the oversight process remains disjointed and disruptive to the functioning of the agencies. Senators from low-population states block the distribution of DHS funds to enhance the ability of first-responders to be effective in the first hours of an incident, in order to get higher levels of funding for their low-risk populations. So California receives only roughly 13% the per-capita level of funding for anti-terrorism functions that Vermont or Wyoming receive. And of course the funds available through Homeland Security become a major prize for lobbyists, corporations, and other interested parties — with resulting congressional pressure on DHS strategies and priorities. 

Another culprit in this story of failure is the conservative penchant for leaving everything to private enterprise. As Michael Brown put the point during his tenure as director of FEMA, “The general idea—that the business of government is not to provide services, but to make sure that they are provided—seems self-evident to me” (kl 1867). The sustained ideological war against government regulation that has been underway since the Reagan administration has had disastrous consequences when it comes to safety. Activities like nuclear power generation, chemical plants, air travel, drug safety, and residential development in hurricane or forest fire zones are all too risky to be left to private initiative and self-regulation. We need strong, well-resourced, well-staffed, and independent regulatory systems for these activities, and increasingly our scorecard on each of these dimensions is in the failing range. 

Overall it appears that Perrow believes that agencies like DHS and FEMA would function better if they were under clear authority of the executive branch rather than Congressional oversight and direction. Presidential authority would not guarantee success — witness George W. Bush’s hapless management of the first iteration of Homeland Security within the White House — but the odds are better. With a President with a clearly stated and implemented priority for effective management of the risk of terrorism, the planning and coordination needed would have a greater likelihood of success. 

It often sounds as though Perrow is faulting these organizations for defects that are inherent in all large organizations.  But it seems more fair to say that his analysis does not identify a general feature of organizations that leads to failure in these cases, but rather a situational fact having to do with the power of business to resist regulation and the susceptibility of Congress and the President to political pressures that hamstring effective regulatory organizations. Perrow does refer to specific organizational hazards — bad executive leadership, faltering morale, inability to collaborate across agencies, excessively hierarchical architecture — but the heart of his argument lies elsewhere. The key set of problems spiral back to the inordinate power that corporations have in the United States, and the distortions they create in Congress and the executive branch. The risks that any sober and independent assessment would identify as highest priority are ignored in pursuit of more immediate political or personal gain. It is specifics of the US political system rather than general defects of large organizations per se that lead to the bad outcomes that Perrow identifies. There are strong democracies that do a much better job of regulating risky industries and planning for disasters than we do — for example, France and Germany.  (Here is a discussion of nuclear safety systems in France published in Nature and a discussion of nuclear safety in Germany published by Nature Conservation and Nuclear Safety.)

It is significant that even though Perrow endorses the need for strengthened regulatory agencies, he doesn’t think this would be enough to prevent major catastrophes in the future. So he advocates strongly for reducing the concentration of hazards and populations. As a society, he argues, we need to come to grips with the fact that there are some kinds of activities we should simply not engage in anymore — intensive residential building in hurricane and forest fire zones, placement of chemical and nuclear plants near cities, routing rail tankers of chlorine through cities like Baltimore and Chicago. (For that matter, a reasonable conclusion one can draw from his account of near-disasters at Indian Point in New York and Davis-Besse in Toledo, is that nuclear power is simply too high a risk to continue to tolerate.) Here is a clear statement of the gravity of culture change this would require:

But what if FEMA were given a mandate to deal with settlement density, escape routes, building codes, and concentrations of hazardous materials in vulnerable sites? We would need a change in out mindset to make basic vulnerabilities such as the size of cities in risky areas and the amounts of hazardous materials in urban areas as high a priority as rescue and relief. (kl 1141)

So who will provide the political will that is needed to reverse course on nuclear and chemical regulation? The public seems to believe (falsely, it would appear) that the NRC is a rigorous and independent agency and that nuclear plants are unlikely to melt down. There isn’t much public concern about these risks, and legislators are therefore free to ignore them as well. (Here is an earlier post on “quiet politics” that is relevant.) So where will the political demand for strong regulation come from? Will we need to wait for the bad news we’ve managed by good fortune to have avoided up to this point?

Organizational failure as a meso cause

A recurring issue in the past few months here has been the validity of meso-level causal explanations of social phenomena. It is self-evident that we attribute causal powers to meso entities in ordinary interactions with the social world. We assent to statements like these; they make sense to us.

  • Reorganization of the city’s service departments led to less corruption in Chicago.
  • Poor oversight and a culture of hyper-masculinity encourages sexual harassment in the workplace.
  • Divided command and control of military forces leads to ineffective response to attack.
  • Mining culture is conducive to social resistance.

We can gain some clarity on the role played by appeals to meso-level factors by considering a concrete example in detail. Military failure is a particularly interesting example to consider. Warfare proceeds through complex interlocking social organizations; it depends on information collection and interpretation; it requires the coordination of sometimes independent decision-makers; it involves deliberate antagonists striving deliberately to interfere with the strategies of the other; and it often leads to striking and almost incomprehensible failures. Eliot Cohen and John Gooch’s Military Misfortunes: The Anatomy of Failure in War is a highly interesting study of military failure that makes substantial use of organizational sociology and the sociology of failure more broadly, so it provides a valuable example to consider.

Here are a few framing ideas that guide Cohen and Gooch in their analysis and selection of cases.

True military “misfortunes” — as we define them — can never be justly laid at the door of any one commander. They are failures of the organization, not of the individual. The other thing the failures we shall examine have in common is their apparently puzzling nature. Although something has clearly gone wrong, it is hard to see what; rather, it seems that fortune — evenly balanced between both sides at the outset — has turned against one side and favored the other. These are the occasions when it seems that the outcome of the battle depended at least as much on one side’s mishandling of the situation as on the other’s skill in exploiting a position of superiority … The causes of organizational failure in the military world are not easy to discern. (3)

From the start, then, Cohen and Gooch are setting their focus on a meso-level factor — features of organizations and their interrelations within a complex temporally extended period of activity.  They note that historians often start with the commander — the familiar explanation of failures based on “operator error” — as the culprit.  But as they argue in the cases they consider, this effort is as misguided in the case of military disaster as it is in other kinds of failure.  Much more fundamental are the organizational failures and system interactions that led to the misfortune. Take Field Marshal Douglas Haig, whose obstinate commitment to aggressive offense in the situation of trench warfare has been bitterly criticized as block-headed:

Not only was the high command confronted by a novel environment; it was also imprisoned in a system that made it well-nigh impossible to meet the challenges of trench warfare. The submissive obedience of Haig’s subordinates, which Forester took for blinkered ignorance and whole-hearted support, was in reality the unavoidable consequence of the way in which the army high command functioned as an organization under its commander in chief. (13)

So why are organizations so central to the explanation of military failure?

Wherever people come together to carry out purposeful activity, organizations spring into being. The more complex and demanding the task, the more ordered and integrated the organization. … Men form organizations, but they also work with systems. Whenever technological components are linked together in order to carry out a particular scientific or technological activity, the possibility exists that the normal sequence of events the system has been designed to carry out may go awry when failures in two or more components interact in an unexpected way. (21, 22)

And here is the crucial point: organizations and complexes of organizations (systems) have characteristics that sometimes produce features of coordinated action that are both unexpected and undesired.  A certain way of training officers may have been created in order to facilitate unity in combat; but it may also create a mentality of subordinacy that makes it difficult for officers to take appropriate independent action.  A certain system for managing the flow of materiel to the fighting forces may work fine in the leisurely circumstances of peace but quickly overload under the exigencies of war.  Weapon systems designed for central Europe may prove unreliable in North Africa.

Eliot and Gooch place organizational learning and information processing at the core of their theories of military failure.  They identify three kinds of failure: “failure to learn, failure to anticipate, and failure to adapt” (26). As a failure to learn, they cite the US Army’s failure to learn from the French experience in Vietnam before designing its own strategies in the Vietnam War.  And they emphasize the unexpected interactions that can occur between different components of a complex organization like the military.  They recommend a “layered” analysis: “We look for the interactions between these organizations, as well as assess how well they performed their proper tasks and missions” (52).

The cases they consider correspond to this classification of failure.  They examine failure to learn in the case of American antisubmarine warfare in 1942; failure to anticipate in the case of the Israel Defense Forces’ failure on the Golan Heights, 1973; and failure to adapt in the British disaster at Gallipoli, 1915.  Their example of aggregate failure, involving all three failures, is the defeat of the American Eighth Army in Korea, 1950.  And they reserve the grand prize, catastrophic failure, for the collapse of the French army and air force in 1940.

Each of these cases illustrates the authors’ central thesis: that organizational failures are at the heart of many or most large military failures.  The example of the failure of the American antisubmarine campaign in 1942 off the east coast of the United States is particularly clear.  German submarines preyed at will on American shipping, placing a large question mark over the ability of the Americans to continue to resupply Allied forces in Europe.  The failure of American antisubmarine warfare was perplexing because the British navy had already developed proven and effective methods of ASW, and the American navy was aware of those methods.  Unhappily, Eliot and Gooch argue, the US Navy did not succeed in learning from this detailed wartime experience.

The factors that Eliot and Gooch cite include: insufficient patrol boats early in the campaign: insufficient training for pilots and patrol vessel crews; crucial failures on operational intelligence (“collection, organization, interpretation and dissemination of many different kinds of data”; 75); and, most crucially, organizational failures.

A prompt and accurate intelligence assessment would mean nothing if the analysts could not communicate that assessment directly to commanders on the scene, if those commanders did not have operational control over the various air and naval assets they required to protect shipping and sink U-boats, if they saw no reason to heed that intelligence, or if they had no firm notion of what to do about it. The working out of correct standard tactics … could have no impact if destroyer skippers did not know or would not apply them. Moreover, as the U-boats changed their tactics and equipment …, the antisubmarine forces needed to adopt compensating tactical changes and technological innovation. (76)

This contrasts with the British case:

The British system worked because it had developed an organizational structure that enabled the Royal Navy and RAF to make use of all of the intelligence at their disposal, to analyze it swiftly and accurately, and to disseminate it immediately to those who needed to have it. (77)

So why did the US Navy not adopt the British system of organization?  Here too the authors find organizational causes:

If the United States Navy had thought seriously about adapting its organization to the challenges of ASW in a fashion similar to that chosen by the British, it would have required major changes in how existing organizations operated, and in no case would this have been more true than that of intelligence. (89)

So in this case, Eliot and Gooch find that the failure of the US Navy to conduct effective antisubmarine warfare off the coast of the United States resides centrally in features of the organization of the Navy itself, and in collateral organizations like the intelligence service.

This is a good example of an effort to explain concrete and complex historical outcomes using the theoretical resources of organizational sociology.  And the language of causation is woven throughout the narrative.  The authors make a credible case for the view that the organizational features they identify caused (or contributed to the cause) of the large instances of military failure they identify.

%d bloggers like this: