System safety

An ongoing thread of posts here is concerned with organizational causes of large technology failures. The driving idea is that failures, accidents, and disasters usually have a dimension of organizational causation behind them. The corporation, research office, shop floor, supervisory system, intra-organizational information flow, and other social elements often play a key role in the occurrence of a gas plant fire, a nuclear power plant malfunction, or a military disaster. There is a tendency to look first and foremost for one or more individuals who made a mistake in order to explain the occurrence of an accident or technology failure; but researchers such as Perrow, Vaughan, Tierney, and Hopkins have demonstrated in detail the importance of broadening the lens to seek out the social and organizational background of an accident.

It seems important to distinguish between system flaws and organizational dysfunction in considering all of the kinds of accidents mentioned here. We might specify system safety along these lines. Any complex process has the potential for malfunction. Good system design means creating a flow of events and processes that make accidents inherently less likely. Part of the task of the designer and engineer is to identify chief sources of harm inherent in the process — release of energy, contamination of food or drugs, unplanned fission in a nuclear plant — and design fail-safe processes so that these events are as unlikely as possible. Further, given the complexity of contemporary technology systems it is critical to attempt to anticipate unintended interactions among subsystems — each of which is functioning correctly but that lead to disaster in unusual but possible interaction scenarios.

In a nuclear processing plant, for example, there is the hazard of radioactive materials being brought into proximity with each other in a way that creates unintended critical mass. Jim Mahaffey’s Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima offers numerous examples of such unintended events, from the careless handling of plutonium scrap in a machining process to the transfer of a fissionable liquid from a vessel of one shape to another. We might try to handle these risks as an organizational problem: more and better training for operatives about the importance of handling nuclear materials according to established protocols, and effective supervision and oversight to ensure that the protocols are observed on a regular basis. But it is also possible to design the material processes within a nuclear plant in a way that makes unintended criticality virtually impossible — for example, by storing radioactive solutions in containers that simply cannot be brought into close proximity with each other.

Nancy Leveson is a national expert on defining and applying principles of system safety. Her book Engineering a Safer World: Systems Thinking Applied to Safety is a thorough treatment of her thinking about this subject. She offers a handful of compelling reasons for believing that safety is a system-level characteristic that requires a systems approach: the fast pace of technological change, reduced ability to learn from experience, the changing nature of accidents, new types of hazards, increasing complexity and coupling, decreasing tolerance for single accidents, difficulty in selecting priorities and making tradeoffs , more complex relationships between humans and automation, and changing regulatory and public view of safety (kl 130 ff.). Particularly important in this list is the comment about complexity and coupling: “The operation of some systems is so complex that it defies the understanding of all but a few experts, and sometimes even they have incomplete information about the system’s potential behavior” (kl 137). 

Given the fact that safety and accidents are products of whole systems, she is critical of the accident methodology generally applied to serious industrial, aerospace, and chemical accidents. This methodology involves tracing the series of events that led to the outcome, and identifying one or more events as the critical cause of the accident. However, she writes:

In general, event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the or industry. An accident model should encourage a broad view of accident mechanisms that expands the investigation beyond the proximate evens.A narrow focus on technological components and pure engineering activities or a similar narrow focus on operator errors may lead to ignoring some of the most important factors in terms of preventing future accidents. (kl 452)

Here is a definition of system safety offered later in ESW in her discussion of the emergence of the concept within the defense and aerospace fields in the 1960s:

System Safety … is a subdiscipline of system engineering. It was created at the same time and for the same reasons. The defense community tried using the standard safety engineering techniques on their complex new systems, but the limitations became clear when interface and component interaction problems went unnoticed until it was too late, resulting in many losses and near misses. When these early aerospace accidents were investigated, the causes of a large percentage of them were traced to deficiencies in design, operations, and management. Clearly, big changes were needed. System engineering along with its subdiscipline, System Safety, were developed to tackle these problems. (kl 1007)

Here Leveson mixes system design and organizational dysfunctions as system-level causes of accidents. But much of her work in this book and her earlier Safeware: System Safety and Computers gives extensive attention to the design faults and component interactions that lead to accidents — what we might call system safety in the narrow or technical sense.

A systems engineering approach to safety starts with the basic assumption that some properties of systems, in this case safety, can only be treated adequately in the context of the social and technical system as a whole. A basic assumption of systems engineering is that optimization of individual components or subsystems will not in general lead to a system optimum; in fact, improvement of a particular subsystem may actually worsen the overall system performance because of complex, nonlinear interactions among the components. (kl 1007) 

Overall, then, it seems clear that Leveson believes that both organizational features and technical system characteristics are part of the systems that created the possibility for accidents like Bhopal, Fukushima, and Three Mile Island. Her own accident model designed to help identify causes of accidents, STAMP (Systems-Theoretic Accident Model and Processes) emphasizes both kinds of system properties.

Using this new causality model … changes the emphasis in system safety from preventing failures to enforcing behavioral safety constraints. Component failure accidents are still included, but or conception of causality is extended to include component interaction accidents. Safety is reformulated as a control problem rather than a reliability problem. (kl 1062)

In this framework, understanding why an accident occurred requires determining why the control was ineffective. Preventing future accidents requires shifting from a focus on preventing failures to the broader goal of designing and implementing controls that will enforce the necessary constraints. (kl 1084)

 Leveson’s brief analysis of the Bhopal disaster in 1984 (kl 384 ff.) emphasizes the organizational dysfunctions that led to the accident — and that were completely ignored by the Indian state’s accident investigation of the accident: out-of-service gauges, alarm deficiencies, inadequate response to prior safety audits, shortage of oxygen masks, failure to inform the police or surrounding community of the accident, and an environment of cost cutting that impaired maintenance and staffing. “When all the factors, including indirect and systemic ones, are considered, it becomes clear that the maintenance worker was, in fact, only a minor and somewhat irrelevant player in the loss. Instead, degradation in the safety margin occurred over time and without any particular single decision to do so but simply as a series of decisions that moved the plant slowly toward a situation where any slight error would lead to a major accident” (kl 447).

Patient safety

An issue which is of concern to anyone who receives treatment in a hospital is the topic of patient safety. How likely is it that there will be a serious mistake in treatment — wrong-site surgery, incorrect medication or radiation dose, exposure to a hospital-acquired infection? The current evidence is alarming. (Martin Makary et al estimate that over 250,000 deaths per year result from medical mistakes — making medical error now the third leading cause of mortality in the United States (link).) And when these events occur, where should we look for assigning responsibility — at the individual providers, at the systems that have been implemented for patient care, at the regulatory agencies responsible for overseeing patient safety?

Medical accidents commonly demonstrate a complex interaction of factors, from the individual provider to the technologies in use to failures of regulation and oversight. We can look at a hospital as a place where caring professionals do their best to improve the health of their patients while scrupulously avoiding errors. Or we can look at it as an intricate system involving the recording and dissemination of information about patients; the administration of procedures to patients (surgery, medication, radiation therapy). In this sense a hospital is similar to a factory with multiple intersecting locations of activity. Finally, we can look at it as an organization — a system of division of labor, cooperation, and supervision by large numbers of staff whose joint efforts lead to health and accidents alike. Obviously each of these perspectives is partially correct. Doctors, nurses, and technicians are carefully and extensively trained to diagnose and treat their patients. The technology of the hospital — the digital patient record system, the devices that administer drugs, the surgical robots — can be designed better or worse from a safety point of view. And the social organization of the hospital can be effective and safe, or it can be dysfunctional and unsafe. So all three aspects are relevant both to safe operations and the possibility of chronic lack of safety.

So how should we analyze the phenomenon of patient safety? What factors can be identified that distinguish high safety hospitals from low safety? What lessons can be learned from the study of accidents and mistakes that cumulatively lead to a hospitals patient safety record?

The view that primarily emphasizes expertise and training of individual practitioners is very common in the healthcare industry, and yet this approach is not particularly useful as a basis for improving the safety of healthcare systems. Skill and expertise are necessary conditions for effective medical treatment; but the other two zones of accident space are probably more important for reducing accidents — the design of treatment systems and the organizational features that coordinate the activities of the various individuals within the system.

Dr. James Bagian is a strong advocate for the perspective of treating healthcare institutions as systems. Bagian considers both technical systems characteristics of processes and the organizational forms through which these processes are carried out and monitored. And he is very skilled at teasing out some of the ways in which features of both system and organization lead to avoidable accidents and failures. I recall his description of a safety walkthrough he had done in a major hospital. He said that during the tour he noticed a number of nurses’ stations which were covered with yellow sticky notes. He observed that this is both a symptom and a cause of an accident-prone organization. It means that individual caregivers were obligated to remind themselves of tasks and exceptions that needed to be observed. Far better was to have a set of systems and protocols that made sticky notes unnecessary. Here is the abstract from a short summary article by Bagian on the current state of patient safety:

Abstract

The traditional approach to patient safety in health care has ranged from reticence to outward denial of serious flaws. This undermines the otherwise remarkable advances in technology and information that have characterized the specialty of medical practice. In addition, lessons learned in industries outside health care, such as in aviation, provide opportunities for improvements that successfully reduce mishaps and errors while maintaining a standard of excellence. This is precisely the call in medicine prompted by the 1999 Institute of Medicine report “To Err Is Human: Building a Safer Health System.” However, to effect these changes, key components of a successful safety system must include: (1) communication, (2) a shift from a posture of reliance on human infallibility (hence “shame and blame”) to checklists that recognize the contribution of the system and account for human limitations, and (3) a cultivation of non-punitive open and/or de-identified/anonymous reporting of safety concerns, including close calls, in addition to adverse events.

(Here is the Institute of Medicine study to which Bagian refers; link.)

Nancy Leveson is an aeronautical and software engineer who has spent most of her career devoted to designing safe systems. Her book Engineering a Safer World: Systems Thinking Applied to Safety is a recent presentation of her theories of systems safety. She applies these approaches to problems of patient safety with several co-authors in “A Systems Approach to Analyzing and Preventing Hospital Adverse Events” (link). Here is the abstract and summary of findings for that article:

Objective:

This study aimed to demonstrate the use of a systems theory-based accident analysis technique in health care applications as a more powerful alternative to the chain-of-event accident models currently underpinning root cause analysis methods.

Method:

A new accident analysis technique, CAST [Causal Analysis based on Systems Theory], is described and illustrated on a set of adverse cardiovascular surgery events at a large medical center. The lessons that can be learned from the analysis are compared with those that can be derived from the typical root cause analysis techniques used today.

Results:

The analysis of the 30 cardiovascular surgery adverse events using CAST revealed the reasons behind unsafe individual behavior, which were related to the design of the system involved and not negligence or incompetence on the part of individuals. With the use of the system-theoretic analysis results, recommendations can be generated to change the context in which decisions are made and thus improve decision making and reduce the risk of an accident.

Conclusions:

The use of a systems-theoretic accident analysis technique can assist in identifying causal factors at all levels of the system without simply assigning blame to either the frontline clinicians or technicians involved. Identification of these causal factors in accidents will help health care systems learn from mistakes and design system-level changes to prevent them in the future.

Crucial in this article is this research group’s effort to identify causes “at all levels of the system without simply assigning blame to either the frontline clinicians or technicians involved”. The key result is this: “The analysis of the 30 cardiovascular surgery adverse events using CAST revealed the reasons behind unsafe individual behavior, which were related to the design of the system involved and not negligence or incompetence on the part of individuals.”

Bagian, Leveson, and others make a crucial point: in order to substantially increase the performance of hospitals and the healthcare system more generally when it comes to patient safety, it will be necessary to extend the focus of safety analysis from individual incidents and agents to the systems and organizations through which these accidents were possible. In other words, attention to systems and organizations is crucial if we are to significantly reduce the frequency of medical and hospital mistakes.

(The Makary et al estimate of 250,000 deaths caused by medical error has been questioned on methodological grounds. See Aaron Carroll’s thoughtful rebuttal (NYT 8/15/16; link).)

What the boss wants to hear …

According to David Halberstam in his outstanding history of the war in Vietnam, The Best and the Brightest, a prime cause of disastrous decision-making by Presidents Kennedy and Johnson was an institutional imperative in the Defense Department to come up with a set of facts that conformed to what the President wanted to hear. Robert McNamara and McGeorge Bundy were among the highest-level miscreants in Halberstam’s account; they were determined to craft an assessment of the situation on the ground in Vietnam that conformed best with their strategic advice to the President.

Ironically, a very similar dynamic led to one of modern China’s greatest disasters, the Great Leap Forward famine in 1959. The Great Helmsman was certain that collective agriculture would be vastly more productive than private agriculture; and following the collectivization of agriculture, party officials in many provinces obliged this assumption by reporting inflated grain statistics throughout 1958 and 1959. The result was a famine that led to at least twenty million excess deaths during a two-year period as the central state shifted resources away from agriculture (Frank DikötterMao’s Great Famine: The History of China’s Most Devastating Catastrophe, 1958-62).

More mundane examples are available as well. When information about possible sexual harassment in a given department is suppressed because “it won’t look good for the organization” and “the boss will be unhappy”, the organization is on a collision course with serious problems. When concerns about product safety or reliability are suppressed within the organization for similar reasons, the results can be equally damaging, to consumers and to the corporation itself. General Motors, Volkswagen, and Michigan State University all seem to have suffered from these deficiencies of organizational behavior. This is a serious cause of organizational mistakes and failures. It is impossible to make wise decisions — individual or collective — without accurate and truthful information from the field. And yet the knowledge of higher-level executives depends upon the truthful and full reporting of subordinates, who sometimes have career incentives that work against honesty.

So how can this unhappy situation be avoided? Part of the answer has to do with the behavior of the leaders themselves. It is important for leaders to explicitly and implicitly invite the truth — whether it is good news or bad news. Subordinates must be encouraged to be forthcoming and truthful; and bearers of bad news must not be subject to retaliation. Boards of directors, both private and public, need to make clear their own expectations on this score as well: that they expect leading executives to invite and welcome truthful reporting, and that they expect individuals throughout the organization to provide truthful reporting. A culture of honesty and transparency is a powerful antidote to the disease of fabrications to please the boss.

Anonymous hotlines and formal protection of whistle-blowers are other institutional arrangements that lead to greater honesty and transparency within an organization. These avenues have the advantage of being largely outside the control of the upper executives, and therefore can serve as a somewhat independent check on dishonest reporting.

A reliable practice of accountability is also a deterrent to dishonest or partial reporting within an organization. The truth eventually comes out — whether about sexual harassment, about hidden defects in a product, or about workplace safety failures. When boards of directors and organizational policies make it clear that there will be negative consequences for dishonest behavior, this gives an ongoing incentive of prudence for individuals to honor their duties of honesty within the organization.

This topic falls within the broader question of how individual behavior throughout an organization has the potential for giving rise to important failures that harm the public and harm the organization itself. 


Regulatory failure

When we think of the issues of health and safety that exist in a modern complex economy, it is impossible to imagine that these social goods will be produced in sufficient quantity and quality by market forces alone. Safety and health hazards are typically regarded as “externalities” by private companies — if they can be “dumped” on the public without cost, this is good for the profitability of the company. And state regulation is the appropriate remedy for this tendency of a market-based economy to chronically produce hazards and harms, whether in the form of environmental pollution, unsafe foods and drugs, or unsafe industrial processes. David Moss and John Cisternino’s New Perspectives on Regulation provides some genuinely important perspectives on the role and effectiveness of government regulation in an epoch which has been shaped by virulent efforts to reduce or eliminate regulations on private activity. This volume is a report from the Tobin Project.

It is poignant to read the optimism that the editors and contributors have — in 2009 — about the resurgence of support for government regulation. The financial crisis of 2008 had stimulated a vigorous round of regulation of financial institutions, and most of the contributors took this as a harbinger of a fresh public support for regulation more generally. Of course events have shown this confidence to be sadly mistaken; the dismantling of Federal regulatory regimes by the Trump administration threatens to take the country back to the period described by Upton Sinclair in the early part of the prior century. But what this demonstrates is the great importance of the Tobin Project. We need to build a public understanding and consensus around the unavoidable necessity of effective and pervasive regulatory regimes in environment, health, product safety, and industrial safety.

Here is how Mitchell Weiss, Executive Director of the Tobin Project, describes the project culminating in this volume:

To this end, in the fall of 2008 the Tobin Project approached leading scholars in the social sciences with an unusual request: we asked them to think about the topic of economic regulation and share key insights from their fields in a manner that would be accessible to both policymakers and the public. Because we were concerned that a conventional literature survey might obscure as much as it revealed, we asked instead that the writers provide a broad sketch of the most promising research in their fields pertaining to regulation; that they identify guiding principles for policymakers wherever possible; that they animate these principles with concrete policy proposals; and, in general, that they keep academic language and footnotes to a minimum. (5)

The lead essay is provided by Joseph Stiglitz, who looks more closely than previous decades of economists had done at the real consequences of market failure. Stiglitz puts the point about market failure very crisply:

Only under certain ideal circumstances may individuals, acting on their own, obtain “pareto efficient” outcomes, that is, situations in which no one can be made better off without making another worse off. These individuals involved must be rational and well informed, and must operate in competitive market- places that encompass a full range of insurance and credit markets. In the absence of these ideal circumstances, there exist government interventions that can potentially increase societal efficiency and/or equity. (11)

And regulation is unpopular — with the businesses, landowners, and other powerful agents whose actions are constrained.

By its nature, a regulation restricts an individual or firm from doing what it otherwise would have done. Those whose behavior is so restricted may complain about, say, their loss of profits and potential adverse effects on innovation. But the purpose of government intervention is to address potential consequences that go beyond the parties directly involved, in situations in which private profit is not a good measure of social impact. Appropriate regulation may even advance welfare-enhancing innovations. (13)

Stiglitz pays attention to the pervasive problem of “regulatory capture”:

The current system has made regulatory capture too easy. The voices of those who have benefited from lax regulation are strong; the perspectives of the investment community have been well represented. Among those whose perspectives need to be better represented are the laborers whose jobs would be lost by macro-mismanagement, and the pension holders whose pension funds would be eviscerated by excessive risk taking.

One of the arguments for a financial products safety commission, which would assess the efficacy and risks of new products and ascertain appropriate usage, is that it would have a clear mandate, and be staffed by people whose only concern would be protecting the safety and efficacy of the products being sold. It would be focused on the interests of the ordinary consumer and investors, not the interests of the financial institutions selling the products. (18)

It is very interesting to read Stiglitz’s essay with attention to the economic focus he offers. His examples all come from the financial industry — the risk at hand in 2008-2009. But the arguments apply equally profoundly to manufacturing, the pharmaceutical and food industries, energy industries, farming and ranching, and the for-profit education sector. At the same time the institutional details are different, and an essay on this subject with a focus on nuclear or chemical plants would probably identify a different set of institutional barriers to effective regulation.

Also particularly interesting is the contribution by Michael Barr, Eldar Shafir, and Sendhil Mullainathan on how behavioral perspectives on “rational action” can lead to more effective regulatory regimes. This essay pays close attention to the findings of experimental economics and behavioral economics, and the deviations from “pure economic rationality” that are pervasive in ordinary economic decision making. These features of decision-making are likely to be relevant to the effectiveness of a regulatory regime as well. Further, it suggests important areas of consumer behavior that are particularly subject to exploitative practices by financial companies — creating a new need for regulation of these kinds of practices. Here is how they summarize their approach:

We propose a different approach to regulation. Whereas the classical perspective assumes that people generally know what is important and knowable, plan with insight and patience, and carry out their plans with wisdom and self-control, the central gist of the behavioral perspective is that people often fail to know and understand things that matter; that they misperceive, misallocate, and fail to carry out their intended plans; and that the context in which people function has great impact on their behavior, and, consequently, merits careful attention and constructive work. In our framework, successful regulation requires integrating this richer view of human behavior with our understanding of markets. Firms will operate on the contour de ned by this psychology and will respond strategically to regulations. As we describe above, because firms have a great deal of latitude in issue framing, product design, and so on, they have the capacity to a affect behavior and circumvent or pervert regulatory constraints. Ironically, firms’ capacity to do so is enhanced by their interaction with “behavioral” consumers (as opposed to the hypothetically rational actors of neoclassical economic theory), since so many of the things a regulator would find very hard to control (for example, frames, design, complexity, etc.) can greatly influence consumers’ behavior. e challenge of behaviorally informed regulation, therefore, is to be well designed and insightful both about human behavior and about the behaviors that firms are likely to exhibit in response to both consumer behavior and regulation. (55)

The contributions to this volume are very suggestive with regard to the issues of product safety, manufacturing safety, food and drug safety, and the like which constitute the larger core of the need for regulatory regimes. And the challenges faced in the areas of financial regulation discussed here are likely to be found to be illuminating in other sectors as well.

 

Empowering the safety officer?

How can industries involving processes that create large risks of harm for individuals or populations be modified so they are more capable of detecting and eliminating the precursors of harmful accidents? How can nuclear accidents, aviation crashes, chemical plant explosions, and medical errors be reduced, given that each of these activities involves large bureaucratic organizations conducting complex operations and with substantial inter-system linkages? How can organizations be reformed to enhance safety and to minimize the likelihood of harmful accidents?

One of the lessons learned from the Challenger space shuttle disaster is the importance of a strongly empowered safety officer in organizations that deal in high-risk activities. This means the creation of a position dedicated to ensuring safe operations that falls outside the normal chain of command. The idea is that the normal decision-making hierarchy of a large organization has a built-in tendency to maintain production schedules and avoid costly delays. In other words, there is a built-in incentive to treat safety issues with lower priority than most people would expect.

If there had been an empowered safety officer in the launch hierarchy for the Challenger launch in 1986, there is a good chance this officer would have listened more carefully to the Morton-Thiokol engineering team’s concerns about low temperature damage to O-rings and would have ordered a halt to the launch sequence until temperatures in Florida raised to the critical value. The Rogers Commission faulted the decision-making process leading to the launch decision in its final report on the accident (The Report of the Presidential Commission on the Space Shuttle Challenger Accident – The Tragedy of Mission 51-L in 1986 – Volume OneVolume TwoVolume Three).

This approach is productive because empowering a safety officer creates a different set of interests in the management of a risky process. The safety officer’s interest is in safety, whereas other decision makers are concerned about revenues and costs, public relations, reputation, and other instrumental goods. So a dedicated safety officer is empowered to raise safety concerns that other officers might be hesitant to raise. Ordinary bureaucratic incentives may lead to underestimating risks or concealing faults; so lowering the accident rate requires giving some individuals the incentive and power to act effectively to reduce risks.

Similar findings have emerged in the study of medical and hospital errors. It has been recognized that high-risk activities are made less risky by empowering all members of the team to call a halt in an activity when they perceive a safety issue. When all members of the surgical team are empowered to halt a procedure when they note an apparent error, serious operating-room errors are reduced. (Here is a report from the American College of Obstetricians and Gynecologists on surgical patient safety; link. And here is a 1999 National Academy report on medical error; link.)

The effectiveness of a team-based approach to safety depends on one central fact. There is a high level of expertise embodied in the staff operating a surgical suite, an engineering laboratory, or a drug manufacturing facility. By empowering these individuals to stop a procedure when they judge there is an unrecognized error in play, this greatly extend the amount of embodied knowledge involved in a process. The surgeon, the commanding officer, or the lab director is no longer the sole expert whose judgments count.

But it also seems clear that these innovations don’t work equally well in all circumstances. Take nuclear power plant operations. In Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima James Mahaffey documents multiple examples of nuclear accidents that resulted from the efforts of mid-level workers to address an emerging problem in an improvised way. In the case of nuclear power plant safety, it appears that the best prescription for safety is to insist on rigid adherence to pre-established protocols. In this case the function of a safety officer is to monitor operations to ensure protocol conformance — not to exercise independent judgment about the best way to respond to an unfavorable reactor event.

It is in fact an interesting exercise to try to identify the kinds of operations in which these innovations are likely to be effective.

Here is a fascinating interview in Slate with Jim Bagian, a former astronaut, one-time director of the Veteran Administration’s National Center for Patient Safety, and distinguished safety expert; link. Bagian emphasizes the importance of taking a system-based approach to safety. Rather than focusing on finding blame for specific individuals whose actions led to an accident, Bagian emphasizes the importance of tracing back to the institutional, organizational, or logistic background of the accident. What can be changed in the process — of delivering medications to patients, of fueling a rocket, or of moving nuclear solutions around in a laboratory — that make the likelihood of an accident substantially lower?

The safety principles involved here seem fairly simple: cultivate a culture in which errors and near-misses are reported and investigated without blame; empower individuals within risky processes to halt the process if their expertise and experience indicates the possibility of a significant risky error; create individuals within organizations whose interests are defined in terms of the identification and resolution of unsafe practices or conditions; and share information about safety within the industry and with the public.

Technology lock-in accidents

image: diagram of molten salt reactor

Organizational and regulatory features are sometimes part of the causal background of important technology failures. This is particularly true in the history of nuclear power generation. The promise of peaceful uses of atomic energy was enormously attractive at the end of World War II. In abstract terms the possibility of generating useable power from atomic reactions was quite simple. What was needed was a controllable fission reaction in which the heat produced by fission could be captured to run a steam-powered electrical generator.

The technical challenges presented by harnessing nuclear fission in a power plant were large. Fissionable material needed to be produced as useable fuel sources. A control system needed to be designed to maintain the level of fission at a desired level. And, most critically, a system for removing heat from the fissioning fuel needed to be designed so that the reactor core would not overheat and melt down, releasing energy and radioactive materials into the environment.

Early reactor designs took different approaches to the heat-removal problem. Liquid metal reactors used a metal like sodium as the fluid that would run through the core removing heat to a heat sink for dispersal; and water reactors used pressurized water to serve that function. The sodium breeder reactor design appeared to be a viable approach, but incidents like the Fermi 1 disaster near Detroit cast doubt on the wisdom of using this approach. The reactor design that emerged as the dominant choice in civilian power production was the light water reactor. But light water reactors presented their own technological challenges, including most especially the risk of a massive steam explosion in the event of a power interruption to the cooling plant. In order to obviate this risk reactor designs involved multiple levels of redundancy to ensure that no such power interruption would occur. And much of the cost of construction of a modern light water power plant is dedicated to these systems — containment vessels, redundant power supplies, etc. In spite of these design efforts, however, light water reactors at Three Mile Island and Fukushima did in fact melt down under unusual circumstances — with particularly devastating results in Fukushima. The nuclear power industry in the United States essentially died as a result of public fears of the possibility of meltdown of nuclear reactors near populated areas — fears that were validated by several large nuclear disasters.

What is interesting about this story is that there was an alternative reactor design that was developed by US nuclear scientists and engineers in the 1950s that involved a significantly different solution to the problem of harnessing the heat of a nuclear reaction and that posed a dramatically lower level of risk of meltdown and radioactive release. This is the molten salt reactor, first developed at the Oak Ridge National Laboratory facility in the 1950s. This was developed as part of the loopy idea of creating an atomic-powered aircraft that could remain aloft for months. This reactor design operates at atmospheric pressure, and the technological challenges of maintaining a molten salt cooling system are readily solved. The fact that there is no water involved in the cooling system means that the greatest danger in a nuclear power plant, a violent steam explosion, is eliminated entirely. Molten salt will not turn to steam, and the risk of a steam-based explosion is removed completely. Chinese nuclear energy researchers are currently developing a next generation of molten salt reactors, and there is a likelihood that they will be successful in designing a reactor system that is both more efficient in terms of cost and dramatically safer in terms of low-probability, high-cost accidents (link). This technology also has the advantage of making much more efficient use of the nuclear fuel, leaving a dramatically smaller amount of radioactive waste to dispose of.

So why did the US nuclear industry abandon the molten-salt reactor design? This seems to be a situation of lock-in by an industry and a regulatory system. Once the industry settled on the light water reactor design, it was implemented by the Nuclear Regulatory Commission in terms of the regulations and licensing requirements for new nuclear reactors. It was subsequently extremely difficult for a utility company or a private energy corporation to invest in the research and development and construction costs that would be associated with a radical change of design. There is currently an effort by an American company to develop a new-generation molten salt reactor, and the process is inhibited by the knowledge that it will take a minimum of ten years to gain certification and licensing for a possible commercial plant to be based on the new design (link).

This story illustrates the possibility that a process of technology development may get locked into a particular approach that embodies substantial public risk, and it may be all but impossible to subsequently adopt a different approach. In another context Thomas Hughes refers to this as technological momentum, and it is clear that there are commercial, institutional, and regulatory reasons for this “stickiness” of a major technology once it is designed and adopted. In the case of nuclear power the inertia associated with light water reactors is particularly unfortunate, given that it blocked other solutions that were both safer and more economical.

(Here is a valuable review of safety issues in the nuclear power industry; link.)

Nuclear accidents

 
diagrams: Chernobyl reactor before and after
 

Nuclear fission is one of the world-changing discoveries of the mid-twentieth century. The atomic bomb projects of the United States led to the atomic bombing of Japan in August 1945, and the hope for limitless electricity brought about the proliferation of a variety of nuclear reactors around the world in the decades following World War II. And, of course, nuclear weapons proliferated to other countries beyond the original circle of atomic powers.

Given the enormous energies associated with fission and the dangerous and toxic properties of radioactive components of fission processes, the possibility of a nuclear accident is a particularly frightening one for the modern public. The world has seen the results of several massive nuclear accidents — Chernobyl and Fukushima in particular — and the devastating results they have had on human populations and the social and economic wellbeing of the regions in which they occurred.

Safety is therefore a paramount priority in the nuclear industry, both in research labs and military and civilian applications. So what is the situation of safety in the nuclear sector? Jim Mahaffey’s Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima is a detailed and carefully researched attempt to answer this question. And the information he provides is not reassuring. Beyond the celebrated and well-known disasters at nuclear power plants (Three Mile Island, Chernobyl, Fukushima), Mahaffey refers to hundreds of accidents involving reactors, research laboratories, weapons plants, and deployed nuclear weapons that have had less public awareness. These accidents resulted in a very low number of lives lost, but their frequency is alarming. They are indeed “normal accidents” (Perrow, Normal Accidents: Living with High-Risk Technologies. For example:

  • a Japanese fishing boat is contaminated by fallout from Castle Bravo test of hydrogen bomb; lots of radioactive fish at the markets in Japan (March 1, 1954) (kl 1706)
  • one MK-6 atomic bomb is dropped on Mars Bluff, South Carolina, after a crew member accidentally pulled the emergency bomb release handle (February 5, 1958) (kl 5774)
  • Fermi 1 liquid sodium plutonium breeder reactor experiences fuel meltdown during startup trials near Detroit (October 4, 1966) (kl 4127)

Mahaffey also provides detailed accounts of the most serious nuclear accidents and meltdowns during the past forty years, Three Mile Island, Chernobyl, and Fukushima.

The safety and control of nuclear weapons is of particular interest. Here is Mahaffey’s summary of “Broken Arrow” events — the loss of atomic and fusion weapons:

Did the Air Force ever lose an A-bomb, or did they just misplace a few of them for a short time? Did they ever drop anything that could be picked up by someone else and used against us? Is humanity going to perish because of poisonous plutonium spread that was snapped up by the wrong people after being somehow misplaced? Several examples will follow. You be the judge. 

Chuck Hansen [

U.S. Nuclear Weapons – The Secret History

] was wrong about one thing. He counted thirty-two “Broken Arrow” accidents. There are now sixty-five documented incidents in which nuclear weapons owned by the United States were lost, destroyed, or damaged between 1945 and 1989. These bombs and warheads, which contain hundreds of pounds of high explosive, have been abused in a wide range of unfortunate events. They have been accidentally dropped from high altitude, dropped from low altitude, crashed through the bomb bay doors while standing on the runway, tumbled off a fork lift, escaped from a chain hoist, and rolled off an aircraft carrier into the ocean. Bombs have been abandoned at the bottom of a test shaft, left buried in a crater, and lost in the mud off the coast of Georgia. Nuclear devices have been pounded with artillery of a foreign nature, struck by lightning, smashed to pieces, scorched, toasted, and burned beyond recognition. Incredibly, in all this mayhem, not a single nuclear weapon has gone off accidentally, anywhere in the world. If it had, the public would know about it. That type of accident would be almost impossible to conceal. (kl 5527)

There are a few common threads in the stories of accident and malfunction that Mahaffey provides. First, there are failures of training and knowledge on the part of front-line workers. The physics of nuclear fission are often counter-intuitive, and the idea of critical mass does not fully capture the danger of a quantity of fissionable material. The geometry of the storage of the material makes a critical difference in going critical. Fissionable material is often transported and manipulated in liquid solution; and the shape and configuration of the vessel in which the solution is held makes a difference to the probability of exponential growth of neutron emission — leading to runaway fission of the material. Mahaffey documents accidents that occurred in nuclear materials processing plants that resulted from plant workers applying what they knew from industrial plumbing to their efforts to solve basic shop-floor problems. All too often the result was a flash of blue light and the release of a great deal of heat and radioactive material.

Second, there is a fault at the opposite end of the knowledge spectrum — the tendency of expert engineers and scientists to believe that they can solve complicated reactor problems on the fly. This turned out to be a critical problem at Chernobyl (kl 6859).

The most difficult problem to handle is that the reactor operator, highly trained and educated with an active and disciplined mind, is liable to think beyond the rote procedures and carefully scheduled tasks. The operator is not a computer, and he or she cannot think like a machine. When the operator at NRX saw some untidy valve handles in the basement, he stepped outside the procedures and straightened them out, so that they were all facing the same way. (kl 2057)

There are also clear examples of inappropriate supervision in the accounts shared by Mahaffey. Here is an example from Chernobyl.

[Deputy chief engineer] Dyatlov was enraged. He paced up and down the control panel, berating the operators, cursing, spitting, threatening, and waving his arms. He demanded that the power be brought back up to 1,500 megawatts, where it was supposed to be for the test. The operators, Toptunov and Akimov, refused on grounds that it was against the rules to do so, even if they were not sure why. 

Dyatlov turned on Toptunov. “You lying idiot! If you don’t increase power, Tregub will!”  

Tregub, the Shift Foreman from the previous shift, was officially off the clock, but he had stayed around just to see the test. He tried to stay out of it. 

Toptunov, in fear of losing his job, started pulling rods. By the time he had wrestled it back to 200 megawatts, 205 of the 211 control rods were all the way out. In this unusual condition, there was danger of an emergency shutdown causing prompt supercriticality and a resulting steam explosion. At 1: 22: 30 a.m., a read-out from the operations computer advised that the reserve reactivity was too low for controlling the reactor, and it should be shut down immediately. Dyatlov was not worried. “Another two or three minutes, and it will be all over. Get moving, boys! (kl 6887)

This was the turning point in the disaster.

A related fault is the intrusion of political and business interests into the design and conduct of high-risk nuclear actions. Leaders want a given outcome without understanding the technical details of the processes they are demanding; subordinates like Toptunov are eventually cajoled or coerced into taking the problematic actions. The persistence of advocates for liquid sodium breeder reactors represents a higher-level example of the same fault. Associated with this role of political and business interests is an impulse towards secrecy and concealment when accidents occur and deliberate understatement of the public dangers created by an accident — a fault amply demonstrated in the Fukushima disaster.

Atomic Accidents provides a fascinating history of events of which most of us are unaware. The book is not primarily intended to offer an account of the causes of these accidents, but rather the ways in which they unfolded and the consequences they had for human welfare. (Generally speaking his view is that nuclear accidents in North America and Western Europe have had remarkably few human casualties.) And many of the accidents he describes are exactly the sorts of failures that are common in all largescale industrial and military processes.

(Largescale technology failure has come up frequently here. See these posts for analysis of some of the organizational causes of technology failure (link, link, link).)

Declining industries

Why is it so difficult for leaders in various industries and sectors to seriously address the existential threats that sometimes arise? Planning for marginal changes in the business environment is fairly simple; problems can be solved, costs can be cut, and the firm can stay in the black. But how about more radical but distant threats? What about the grocery sector when confronted by Amazon’s radical steps in food selling? What about Polaroid or Kodak when confronted by the rise of digital photography in the 1990s? What about the US steel industry in the 1960s when confronted with rising Asian competition and declining manufacturing facilities?

From the outside these companies and sectors seem like dodos incapable of confronting the threats that imperil them. They seem to be ignoring oncoming train wrecks simply because these catastrophes are still in the distant future. And yet the leaders in these companies were generally speaking talented, motivated men and women. So what are the organizational or cognitive barriers that arise to make it difficult for leaders to successfully confront the biggest threats they face?

Part of the answer seems to be the fact that distant hazards seem smaller than the more immediate and near-term challenges that an organization must face; so there is a systematic bias towards myopic decision-making. This sounds like a Kahneman-Tversky kind of cognitive shortcoming.

A second possible explanation is that it is easy enough to persuade oneself that distant threats will either resolve themselves organically or that the organization will discover novel solutions in the future. This seems to be part of the reason that climate-change foot-draggers take the position they do: that “things will sort out”, “new technologies will help solve the problems in the future.” This sounds like a classic example of weakness of the will — an unwillingness to rationally confront hard truths about the future that ought to influence choices today but often don’t.

Then there is the timeframe of accountability that is in place in government, business, and non-profit organizations alike. Leaders are rewarded and punished for short-term successes and failures, not prudent longterm planning and preparation. This is clearly true for term-limited elected officials, but it is equally true for executives whose stakeholders evaluate performance based on quarterly profits rather than longterm objectives and threats.

We judge harshly those leaders who allow their firms or organizations to perish because of a chronic failure to plan for substantial change in the environments in which they will need to operate in the future. Nero is not remembered kindly for his dedication to his fiddle. And yet at any given time, many industries are in precisely that situation. What kind of discipline and commitment can protect organizations against this risk?

This is an interesting question in the abstract. But it is also a challenging question for people who care about the longterm viability of colleges and universities. Are there forces at work today that will bring about existential crisis for universities in twenty years (enrollments, tuition pressure, technology change)? Are there technological or organizational choices that should be made today that would help to avert those crises in the future? And are university leaders taking the right steps to prepare their institutions for the futures they will face in several decades?

Gaining compliance

Organizations always involve numerous staff members whose behavior has the potential for creating significant risk for individuals and the organization but who are only loosely supervised. This situation unavoidably raises principal-agent problems. Let’s assume that the great majority of staff members are motivated by good intentions and ethical standards. That means that there are a small number of individuals whose behavior is not ethical and well intentioned. What arrangements can an organization put in place to prevent bad behavior and protect individuals and the integrity of the organization?

For certain kinds of bad behavior there are well understood institutional arrangements that work well to detect and deter the wrong actions. This is especially true for business transactions, purchasing, control of cash, expense reporting and reimbursement, and other financial processes within the organization. The audit and accounting functions within almost every sophisticated organization permit a reasonably high level of confidence in the likelihood of detection of fraud, theft, and misreporting. This doesn’t mean that corrupt financial behavior does not occur; but audits make it much more difficult to succeed in persistent dishonest behavior. So an organization with an effective audit function is likely to have a reasonably high level of compliance in the areas where standard audits can be effectively conducted.

A second kind of compliance effort has to do with the culture and practice of observer reporting of misbehavior. Compliance hotlines allow individuals who have observed (or suspected) bad behavior to report that behavior to responsible agents who are obligated to investigate these allegations. Policies that require reporting of certain kinds of bad behavior to responsible officers of the organization — sexual harassment, racial discrimination, or fraudulent actions, for example — should have the effect of revealing some kinds of misbehavior, and deterring others from engaging in bad behavior. So a culture and expectation of reporting is helpful in controlling bad behavior.

A third approach that some organizations take to compliance is to place a great deal of emphasis the moral culture of the organization — shared values, professional duty, and role responsibilities. Leaders can support and facilitate a culture of voluntary adherence to the values and policies of the organization, so that virtually all members of the organization fall in the “well-intentioned” category. The thrust off this approach is to make large efforts at eliciting voluntary good behavior. Business professor David Hess has done a substantial amount of research on these final two topics (link, link).

Each of these organizational mechanisms has some efficacy. But unfortunately they do not suffice to create an environment where we can be highly confident that serious forms of misconduct do not occur. In particular, reporting and culture are only partially efficacious when it comes to private and covert behavior like sexual assault, bullying, and discriminatory speech and behavior in the workplace. This leads to an important question: are there more intrusive mechanisms of supervision and observation that would permit organizations to discover patterns of misconduct even if they remain unreported by observers and victims? Are there better ways for an organization to ensure that no one is subject to the harmful actions of a predator or harasser?

A more active strategy for an organization committed to eliminating sexual assault is to attempt to predict the environments where inappropriate interpersonal behavior is possible and to redesign the setting so the behavior is substantially less likely. For example, a hospital may require that any physical examinations of minors must be conducted in the presence of a chaperone or other health professional. A school of music or art may require that after-hours private lessons are conducted in semi-public locations. These rules would deprive a potential predator of the seclusion needed for the bad behavior. And the practitioner who is observed violating the rule would then be suspect and subject to further investigation and disciplinary action.

Here is perhaps a farfetched idea: a “behavior audit” that is periodically performed in settings where inappropriate covert behavior is possible. Here we might imagine a process in which a random set of people are periodically selected for interview who might have been in a position to have been subject to inappropriate behavior. These individuals would then be interviewed with an eye to helping to surface possible negative or harmful experiences that they have had. This process might be carried out for groups of patients, students, athletes, performers, or auditioners in the entertainment industry. And the goal would be to uncover traces of the kinds of behavior involving sexual harassment and assault that are at the heart of recent revelations in a myriad of industries and organizations. The results of such an audit would occasionally reveal a pattern of previously unknown behavior requiring additional investigation, while the more frequent results would be negative. This process would lead to a higher level of confidence that the organization has reasonably good knowledge of the frequency and scope of bad behavior and a better system for putting in place a plan of remediation.

All of these organizational strategies serve fundamentally as attempts to solve principal-agent problems within the organization. The principals of the organization have expectations about the norms that ought to govern behavior within the organization. These mechanisms are intended to increase the likelihood that there is conformance between the principal’s expectations and the agent’s behavior. And, when they fail, several of these mechanisms are intended to make it more likely that bad behavior is identified and corrected.

(Here is an earlier post treating scientific misconduct as a principal-agent problem; link.)

Trust and organizational effectiveness

It is fairly well agreed that organizations require a degree of trust among the participants in order for the organization to function at all. But what does this mean? How much trust is needed? How is trust cultivated among participants? And what are the mechanisms through which trust enhances organizational effectiveness?

The minimal requirements of cooperation presuppose a certain level of trust. As A plans and undertakes a sequence of actions designed to bring about Y, his or her efforts must rely upon the coordination promised by other actors. If A does not have a sufficiently high level of confidence in B’s assurances and compliance, then he will be rationally compelled to choose another series of actions. If Larry Bird didn’t have trust in his teammate Dennis Johnson, the famous steal would not have happened.

 

First, what do we mean by trust in the current context? Each actor in an organization or group has intentions, engages in behavior, and communicates with other actors. Part of communication is often in the form of sharing information and agreeing upon a plan of coordinated action. Agreeing upon a plan in turn often requires statements and commitments from various actors about the future actions they will take. Trust is the circumstance that permits others to rely upon those statements and commitments. We might say, then, that A trusts B just in case —

  • A believes that when B asserts P, this is an honest expression of B’s beliefs.
  • A believes that when B says he/she will do X, this is an honest commitment on B’s part and B will carry it out (absent extraordinary reasons to the contrary).
  • A believes that when B asserts that his/her actions will be guided by his best understanding of the purposes and goals of the organization, this is a truthful expression.
  • A believes that B’s future actions, observed and unobserved, will be consistent with his/her avowals of intentions, values, and commitments.

So what are some reasons why mistrust might rear its ugly head between actors in an organization? Why might A fail to trust B?

  • A may believe that B’s private interests are driving B’s actions (rather than adherence to prior commitments and values).
  • A may believe that B suffers from weakness of the will, an inability to carry out his honest intentions.
  • A may believe that B manipulates his statements of fact to suit his private interests.
  • Or less dramatically: A may not have high confidence in these features of B’s behavior.
  • B may have no real interest or intention in behaving in a truthful way.

And what features of organizational life and practice might be expected to either enhance inter-personal trust or to undermine it?

Trust is enhanced by individuals having the opportunity to get acquainted with their collaborators in a more personal way — to see from non-organizational contexts that they are generally well intentioned; that they make serious efforts to live up to their stated intentions and commitments; and that they are generally honest. So perhaps there is a rationale for the bonding exercises that many companies undertake for their workers.

Likewise, trust is enhanced by the presence of a shared and practiced commitment to the value of trustworthiness. An organization itself can enhance trust in its participants by performing the actions that its participants expect the organization to perform. For example, an organization that abruptly and without consultation ends an important employee benefit undermines trust in the employees that the organization has their best interests at heart. This abrogation of prior obligations may in turn lead individuals to behave in a less trustworthy way, and lead others to have lower levels of trust in each other.

How does enhancing trust have the promise of bringing about higher levels of organizational effectiveness? Fundamentally this comes down to the question of the value of teamwork and the burden of unnecessary transaction costs. If every expense report requires investigation, the amount of resources spent on accountants will be much greater than a situation where only the outlying reports are questioned. If each vice president needs to defend him or herself against the possibility that another vice president is conspiring against him, then less time and energy are available to do the work of the organization. If the CEO doesn’t have high confidence that her executive team will work wholeheartedly to bring about a successful implementation of a risky investment, then the CEO will choose less risky investments.

In other words, trust is crucial for collaboration and teamwork. And organizations that manage to help to cultivate a high level of trust among its participants is likely to perform better than one that depends primarily on supervision and enforcement.

%d bloggers like this: