The US Chemical Safety Board

The Federal agency responsible for investigating chemical and petrochemical accidents in the United States is the Chemical Safety Board (link). The mission of the Board is described in these terms:

The CSB is an independent federal agency charged with investigating industrial chemical accidents. Headquartered in Washington, DC, the agency’s board members are appointed by the President and confirmed by the Senate.
The CSB’s mission is to “drive chemical safety change through independent investigation to protect people and the environment.”
The CSB’s vision is “a nation safe from chemical disasters.”
The CSB conducts root cause investigations of chemical accidents at fixed industrial facilities. Root causes are usually deficiencies in safety management systems, but can be any factor that would have prevented the accident if that factor had not occurred. Other accident causes often involve equipment failures, human errors, unforeseen chemical reactions or other hazards. The agency does not issue fines or citations, but does make recommendations to plants, regulatory agencies such as the Occupational Safety and Health Administration (OSHA) and the Environmental Protection Agency (EPA), industry organizations, and labor groups. Congress designed the CSB to be non-regulatory and independent of other agencies so that its investigations might, where appropriate, review the effectiveness of regulations and regulatory enforcement.

CSB was legislatively conceived in analogy with the National Transportation Safety Board, and its sole responsibility is to conduct investigations of major chemical accidents in the United States and report its findings to the public. It is not subordinate to OSHA or EPA, but it collaborates with those (and other) Federal agencies as appropriate (link). It has no enforcement powers; its sole function is to investigate, report, and recommend when serious chemical or petrochemical accidents have occurred.

One of its most important investigations concerned the March 23, 2005 Texas City BP refinery explosion. A massive explosion resulted in the deaths of 15 workers, injuries to over 170 workers, and substantial destruction of the refinery infrastructure. CSB conducted an extensive investigation into the “root causes” of the accident, and assigned substantial responsibility to BP’s corporate management of the facility. Here is the final report of that investigation (link), and here is a video prepared by CSB summarizing its main findings (link).

The key findings of the CSB report focus on the responsibility of BP management for the accident. Here is a summary of the CSB assessment of root causes:

The BP Texas City tragedy is an accident with organizational causes embedded in the refinery’s culture. The CSB investigation found that organizational causes linked the numerous safety system failures that extended beyond the ISOM unit. The organizational causes of the March 23, 2005, ISOM explosion are

    • BP Texas City lacked a reporting and learning culture. Reporting bad news was not encouraged, and often Texas City managers did not effectively investigate incidents or take appropriate corrective action.
  • BP Group lacked focus on controlling major hazard risk. BP management paid attention to, measured, and rewarded personal safety rather than process safety.
  • BP Group and Texas City managers provided ineffective leadership and oversight. BP management did not implement adequate safety oversight, provide needed human and economic resources, or consistently model adherence to safety rules and procedures.
  • BP Group and Texas City did not effectively evaluate the safety implications of major organizational, personnel, and policy changes.

Underlying almost all of these failures to manage this complex process with a priority on “process safety” rather than simply personal safety is a corporate mandate for cost reduction:

In late 2004, BP Group refining leadership ordered a 25 percent budget reduction “challenge” for 2005. The Texas City Business Unit Leader asked for more funds based on the conditions of the Texas City plant, but the Group refining managers did not, at first, agree to his request. Initial budget documents for 2005 reflect a proposed 25 percent cutback in capital expenditures, including on compliance, HSE, and capital expenditures needed to maintain safe plant operations.[208] The Texas City Business Unit Leader told the Group refining executives that the 25 percent cut was too deep, and argued for restoration of the HSE and maintenance-related capital to sustain existing assets in the 2005 budget. The Business Unit Leader was able to negotiate a restoration of less than half the 25 percent cut; however, he indicated that the news of the budget cut negatively affected workforce morale and the belief that the BP Group and Texas City managers were sincere about culture change. (176)

And what about corporate accountability? What did BP have to pay in recompense for its faulty management of the Texas City refinery and the subsequent damages to workers and local residents? The answer is, remarkably little. OSHA assessed a fine of $50.6 million for its violations of safety regulations (link, link), and it committed to spend at least $500M to take corrective steps within the plant to protect the safety of workers. This was a record fine at the time; and yet it might very well be seen by BP corporate executives as a modest cost of doing business in this industry. It does not seem to be of the magnitude that would lead to fundamental change of culture, action, and management within the company.

 
BP commissioned a major review of BP refinery safety in all five of its US-based refineries following release of the CSB report. This study became the Baker Panel REPORT OF THE BP U.S. REFINERIES INDEPENDENT SAFETY REVIEW PANEL (JANUARY 2007) (link). The Baker Panel consisted of fully qualified experts on industrial and technological safety who were in a very good position to assess the safety management and culture of BP in its operations of its five US-based refineries. The Baker Panel was specifically directed to refrain from attempting to analyze responsibility for the Texas City disaster and to focus its efforts on assessing the safety culture and management direction that were currently to be found in BP’s five refineries. Here are some central findings:

  • Based on its review, the Panel believes that BP has not provided effective process safety leadership and has not adequately established process safety as a core value across all its five U.S. refineries.
  • BP has not always ensured that it identified and provided the resources required for strong process safety performance at its U.S. refineries. Despite having numerous staff at different levels of the organization that support process safety, BP does not have a designated, high-ranking leader for process safety dedicated to its refining business.
  • The Panel also found that BP did not effectively incorporate process safety into management decision-making. BP tended to have a short-term focus, and its decentralized management system and entrepreneurial culture have delegated substantial discretion to U.S. refinery plant managers without clearly defining process safety expectations, responsibilities, or accountabilities.
  • BP has not instilled a common, unifying process safety culture among its U.S. refineries.
  • While all of BP’s U.S. refineries have active programs to analyze process hazards, the system as a whole does not ensure adequate identification and rigorous analysis of those hazards.
  • The Panel’s technical consultants and the Panel observed that BP does have internal standards and programs for managing process risks. However, the Panel’s examination found that BP’s corporate safety management system does not ensure timely compliance with internal process safety standards and programs at BP’s five U.S. refineries.
  • The Panel also found that BP’s corporate safety management system does not ensure timely implementation of external good engineering practices that support and could improve process safety performance at BP’s five U.S. refineries. (Summary of findings, xii-xiii)
These findings largely validate and support the critical assessment of BP’s safety management practices in the CSB report.

It seems clear that an important part of the substantial improvement that has occurred in aviation safety in the past fifty years is the effective investigation and reporting provided by the NTSB. NTSB is an authoritative and respected bureau of experts whom the public trusts when it comes to discovering the causes of aviation disasters. The CSB has a much shorter institutional history — it was created in 1990 — but we need to ask a parallel question here as well: Does the CSB provide a strong lever for improving safety practices in the chemical and petrochemical industries through its accident investigations; or are industry actors largely free to continue their poor management practices indefinitely, safe in the realization that large chemical accidents are rare and the costs of occasional liability judgments are manageable?

Pervasive organizational and regulatory failures

It is intriguing to observe how pervasive organizational and regulatory failures are in our collective lives. Once you are sensitized to these factors, you see them everywhere. A good example is in the business section of today’s print version of the New York Times, August 1, 2019. There are at least five stories in this section that reflect the consequences of organizational and regulatory failure.

The first and most obvious story is one that has received frequent mention in Understanding Society, the Boeing 737 Max disaster. In a story titled “FAA oversight of Boeing scrutinized”, the reporters give information about a Senate hearing on FAA oversight earlier this week.  Members of the Senate Appropriations Committee questioned the process of certification of new aircraft currently in use by the FAA.

Citing the Times story, Ms. Collins raised concerns over “instances in which FAA managers appeared to be more concerned with Boeing’s production timeline, rather than the safety recommendations of its own engineers.”

Senator Jack Reed referred to the need for a culture change to rebalance the relationship between regulator and industry. Agency officials continued to defend the certification process, which delegates 96% of the work of certification to the manufacturer.

This story highlights two common sources of organizational and regulatory failure. There is first the fact of “production pressure” coming from the owner of a risky process, involving timing, supply of product, and profitability. This pressure leads the owner to push the organization hard in an effort to achieve goals — often leading to safety and design failures. The second factor identified here is the structural imbalance that exists between powerful companies running complex and costly processes, and the safety agencies tasked to oversee and regulate their behavior. The regulatory agency, in this case the FAA, is under-resourced and lacks the expert staff needed to carry out in depth a serious process of technical oversight.  The article does not identify the third factor which has been noted in prior posts on the Boeing disaster, the influence which Boeing has on legislators, government officials, and the executive branch.

 A second relevant story (on the same page as the Boeing story) refers to charges filed in Germany against the former CEO of Audi who has been charged concerning his role in the vehicle emissions scandal. This is part of the long-standing deliberate effort by Volkswagen to deceive regulators about the emissions characteristics of their diesel engine and exhaust systems. The charges against the Audi executive involved ordering the development of software designed to cheat diesel emissions testing for their vehicles. This ongoing story is primarily a story about corporate dysfunction, in which corporate leaders were involved in unethical and dishonest activities on behalf of the company. Regulatory failure is not a prominent part of this story, because the efforts at deception were so carefully calculated that it is difficult to see how normal standards of regulatory testing could have defeated them. Here the pressing problem is to understand how professional, experienced executives could have been led to undertake such actions, and how the corporation was vulnerable to this kind of improper behavior at multiple levels within the corporation. Presumably there were staff at multiple levels within these automobile companies who were aware of improper behavior. The story quotes a mid-level staff person who writes in an email that “we won’t make it without a few dirty tricks.” So the difficult question for these corporations is how their internal systems were inadequate to take note of dangerously improper behavior. The costs to Volkswagen and Audi in liability judgments and government penalties are truly vast, and surely outweigh the possible gains of the deception. These costs in the United States alone exceed $22 billion.

A similar story, this time from the tech industry, concerns a settlement of civil claims against Cisco Systems to settle claims “that it sold video surveillance technology that it knew had a significant security flaw to federal, state and local government agencies.” Here again we find a case of corporate dishonesty concerning some of its central products, leading to a public finding of malfeasance. The hard question is, what systems are in place for companies like Cisco that ensure ethical and honest presentation of the characteristics and potential defects of the products that they sell? The imperatives of working always to maximize profits and reduce costs lead to many kinds of dysfunctions within organizations, but this is a well understood hazard. So profit-based companies need to have active and effective programs in place that encourage and enforce honest and safe practices by managers, executives, and frontline workers. Plainly those programs broke down at Cisco, Volkswagen, and Audi. (One of the very useful features of Tom Beauchamp’s book Case Studies in Business, Society, and Ethics is the light Beauchamp sheds through case studies on the genesis of unethical and dishonest behavior within a corporate setting.)

Now we go on to Christopher Flavelle’s story about home-building in flood zones. From a social point of view, it makes no sense to continue to build homes, hotels, and resorts in flood zones. The increasing destruction of violent storms and extreme weather events has been evident at least since the devastation of Hurricane Katrina. Flavelle writes:

There is overwhelming scientific consensus that rising temperatures will increase the frequency and severity of coastal flooding caused by hurricanes, storm surges, heavy rain and tidal floods. At the same time there is the long-term threat of rising seas pushing the high-tide line inexorably inland.

However, Flavelle reports research by Climate Central that shows that the rate of home-building in flood zones since 2010 exceeds the rate of home-building in non-flood zones in eight states. So what are the institutional and behavioral factors that produce this amazingly perverse outcome? The article refers to incentives of local municipalities in generating property-tax revenues and of potential homeowners subject to urban sprawl and desires for second-home properties on the water. Here is a tragically short-sighted development official in Galveston who finds that “the city has been able to deal with the encroaching water, through the installation of pumps and other infrastructure upgrades”: “You can build around it, at least for the circumstances today. It’s really not affected the vitality of things here on the island at all.” The factor that is not emphasized in this article is the role played by the National Flood Insurance Program in the problem of coastal (and riverine) development. If flood insurance rates were calculated in terms of the true riskiness of the proposed residence, hotel, or resort, then it would no longer be economically attractive to do the development. But, as the article makes clear, local officials do not like that answer because it interferes with “development” and property tax growth. ProPublica has an excellent 2013 story on the perverse incentives created by the National Flood Insurance Program, and its inequitable impact on wealthier home-owners and developers (link). Here is an article by Christine Klein and Sandra Zellmer in the SMU Law Review on the dysfunctions of Federal flood policy (link):

Taken together, the stories reveal important lessons, including the inadequacy of engineered flood control structures such as levees and dams, the perverse incentives created by the national flood insurance program, and the need to reform federal leadership over flood hazard control, particularly as delegated to the Army Corps of Engineers.

Here is a final story from the business section of the New York Times illustrating organizational and regulatory dysfunctions — this time from the interface between the health industry and big tech. The story here is an effort that is being made by DeepMind researchers to use artificial intelligence techniques to provide early diagnosis of otherwise mysterious medical conditions like “acute kidney injury” (AKI). The approach proceeds by analyzing large numbers of patient medical records and attempting to identify precursor conditions that would predict the occurrence of AKI. The primary analytical tool mentioned in the article is the set of algorithms associated with neural networks. In this instance the organizational / regulatory dysfunction is latent rather than explicit and has to do with patient privacy. DeepMind is a business unit within the Google empire of businesses, Alphabet. DeepMind researchers gained access to large volumes of patient data from the UK National Health Service. There is now regulatory concern in the UK and the US concerning the privacy of patients whose data may wind up in the DeepMind analysis and ultimately in Google’s direct control. “Some critics question whether corporate labs like DeepMind are the right organization to handle the development of technology with such broad implications for the public.” Here the issue is a complicated one. It is of course a good thing to be able to diagnose disorders like AKI in time to be able to correct them. But the misuse and careless custody of user data by numerous big tech companies, including especially Facebook, suggests that sensitive personal data like medical files need to be carefully secured by effective legislation and regulation. And so far the regulatory system appears to be inadequate for the protection of individual privacy in a world of massive databases and largescale computing capabilities. The recent FTC $5 billion settlement imposed on Facebook, large as it is, may not suffice to change the business practices of Facebook (link).

(I didn’t find anything in the sports section today that illustrates organizational and regulatory dysfunction, but of course these kinds of failures occur in professional and college sports as well. Think of doping scandals in baseball, cycling, and track and field, sexual abuse scandals in gymnastics and swimming, and efforts by top college football programs to evade NCAA regulations on practice time and academic performance.)

Soviet nuclear disasters: Kyshtym

The 1986 meltdown of reactor number 4 at the Chernobyl Nuclear Power Plant was the greatest nuclear disaster the world has yet seen. Less well known is the Kyshtym disaster in 1957, which resulted in a massive release of radioactive material in the Eastern Ural region of the Soviet Union. This was a catastrophic underground explosion at a nuclear storage facility near the Mayak power plant in the Eastern Ural region of the USSR. Information about the disaster was tightly restricted by Soviet authorities, with predictably bad consequences.

Zhores Medvedev was one of the first qualified scientists to provide information and hypotheses about the Kyshtym disaster. His book Nuclear Disaster in the Urals was written while he was in exile in Great Britain and appeared in 1980. It is fascinating to learn that his reasoning is based on his study of ecological, biological, and environmental research done by Soviet scientists between 1957 and 1980. Medvedev was able to piece together the extent of contamination and the general nature of the cause of the event from basic information about radioactive contamination in lakes and streams in the region included incidentally in scientific reports from the period.

It is very interesting to find that scientists in the United States were surprisingly skeptical about Medvedev’s assertions. W. Stratton et al published a review analysis in Science in 1979 (link) that found Medvedev’s reasoning unpersuasive.

A steam explosion of one tank is not inconceivable but is most improbable, because the heat generation rate from a given amount of fission products is known precisely and is predictable. Means to dissipate this heat would be a part of the design and could be made highly reliable. (423)

They offer an alternative hypothesis about any possible radioactive contamination in the Kyshtym region — the handful of multimegaton nuclear weapons tests conducted by the USSR in the Novaya Zemlya area.

We suggest that the observed data can be satisfied by postulating localized fallout (perhaps with precipitation) from explosion of a large nuclear weapon, or even from more than one explosion, because we have no limits on the length of time that fallout continued. (425)

And they consider weather patterns during the relevant time period to argue that these tests could have been the source of radiation contamination identified by Medvedev. Novaya Zemlya is over 1000 miles north of Kyshtym (20 degrees of latitude). So the fallout from the nuclear tests may be a possible alternative hypothesis, but it is farfetched. They conclude:

We can only conclude that, though a radiation release incident may well be supported by the available evidence, the magnitude of the incident may have been grossly exaggerated, the source chosen uncritically, and the dispersal mechanism ignored. Even so we find it hard to believe that an area of this magnitude could become contaminated and the event not discussed in detail or by more than one individual for more than 20 years. (425)

The heart of their skepticism depends on an entirely indefensible assumption: that Soviet science, engineering, and management were entirely capable of designing and implementing a safe system for nuclear waste storage. They were perhaps right about the scientific and engineering capabilities of the Soviet system; but the management systems in place were woefully inadequate. Their account rested on an assumption of straightforward application of engineering knowledge to the problem; but they failed to take into account the defects of organization and oversight that were rampant within Soviet industrial systems. And in the end the core of Medvedev’s claims have been validated.

Another official report was compiled by Los Alamos scientists, released in 1982, that concluded unambiguously that Medvedev was mistaken, and that the widespread ecological devastation in the region resulted from small and gradual processes of contamination rather than a massive explosion of waste materials (link). Here is the conclusion put forward by the study’s authors:

What then did happen at Kyshtym? A disastrous nuclear accident that killed hundreds, injured thousands, and contaminated thousands of square miles of land? Or, a series of relatively minor incidents, embellished by rumor, and severely compounded by a history of sloppy practices associated with the complex? The latter seems more highly probable.

So Medvedev is dismissed.

After the collapse of the USSR voluminous records about the Kyshtym disaster became available from secret Soviet files, and those records make it plain that US scientists badly misjudged the nature of the Kyshtym disaster. Medvedev was much closer to the truth than were Stratton and his colleagues or the authors of the Los Alamos report.

A scientific report based on Soviet-era documents that were released after the fall of the Soviet Union appeared in the Journal of Radiological Protection in 2017 (A V Akleyev et al 2017; link). Here is their brief description of the accident:

Starting in the earliest period of Mayak PA activities, large amounts of liquid high-level radioactive waste from the radiochemical facility were placed into long-term controlled storage in metal tanks installed in concrete vaults. Each full tank contained 70–80 tons of radioactive wastes, mainly in the form of nitrate compounds. The tanks were water-cooled and equipped with temperature and liquid-level measurement devices. In September 1957, as a result of a failure of the temperature-control system of tank #14, cooling-water delivery became insufficient and radioactive decay caused an increase in temperature followed by complete evaporation of the water, and the nitrate salt deposits were heated to 330 °C–350 °C. The thermal explosion of tank #14 occurred on 29 September 1957 at 4:20 pm local time. At the time of the explosion the activity of the wastes contained in the tank was about 740 PBq [5, 6]. About 90% of the total activity settled in the immediate vicinity of the explosion site (within distances less than 5 km), primarily in the form of coarse particles. The explosion gave rise to a radioactive plume which dispersed into the atmosphere. About 2 × 106 Ci (74PBq) was dispersed by the wind (north-northeast direction with wind velocity of 5–10 m s−1) and caused the radioactive trace along the path of the plume [5]. Table 1 presents the latest estimates of radionuclide composition of the release used for reconstruction of doses in the EURT area. The mixture corresponded to uranium fission products formed in a nuclear reactor after a decay time of about 1 year, with depletion in 137Cs due to a special treatment of the radioactive waste involving the extraction of 137Cs [6]. (R20-21)

Here is the region of radiation contamination (EURT) that Akleyev et al identify:

This region represents a large area encompassing 23,000 square kilometers (8,880 square miles). Plainly Akleyev et al describe a massive disaster including a very large explosion in an underground nuclear waste storage facility, large-scale dispersal of nuclear materials, and evacuation of population throughout a large region. This is very close to the description provided by Medvedev.

A somewhat surprising finding of the Akleyev study is that the exposed population did not show dramatically worse health outcomes and mortality relative to unexposed populations. For example, “Leukemia mortality rates over a 30-year period after the accident did not differ from those in the group of unexposed people” (R30). Their epidemiological study for cancers overall likewise indicates only a small effect of accidental radiation exposure on cancer incidence:

The attributable risk (AR) of solid cancer incidence in the EURTC, which gives the proportion of excess cancer cases out of the sum of excess and baseline cases, calculated according to the linear model, made up 1.9% over the whole follow-up period. Therefore, only 27 cancer cases out of 1426 could be associated with accidental radiation exposure of the EURT population. AR is highest in the highest dose groups (250–500 mGy and >500 mGy) and exceeds 17%.

So why did the explosion occur? James Mahaffey examines the case in detail in Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima. Here is his account:

In the crash program to produce fissile bomb material, a great deal of plutonium was wasted in the crude separation process. Production officials decided that instead of being dumped irretrievably into the river, the plutonium that had failed to precipitate out, remaining in the extraction solution, should be saved for future processing. A big underground tank farm was built in 1953 to hold processed fission waste. Round steel tanks were installed in banks of 20, sitting on one large concrete slab poured at the bottom of an excavation, 27 feet deep. Each bank was equipped with a heat exchanger, removing the heat buildup from fission-product decay using water pipes wrapped around the tanks. The tanks were then buried under a backfill of dirt. The tanks began immediately to fill with various waste solutions from the extraction plant, with no particular distinction among the vessels. The tanks contained all the undesirable fission products, including cobalt-60, strontium-90, and cesium-137, along with unseparated plutonium and uranium, with both acetate and nitrate solutions pumped into the same volume. One tank could hold probably 100 tons of waste product. 

In 1956, a cooling-water pipe broke leading to one of the tanks. It would be a lot of work to dig up the tank, find the leak, and replace the pipe, so instead of going to all that trouble, the engineers in charge just turned off the water and forgot about it. 

A year passed. Not having any coolant flow and being insulated from the harsh Siberian winter by the fill dirt, the tank retained heat from the fission-product decay. Temperature inside reached 660 ° Fahrenheit, hot enough to melt lead and cast bullets. Under this condition, the nitrate solutions degraded into ammonium nitrate, or fertilizer, mixed with acetates. The water all boiled away, and what was left was enough solidified ANFO explosive to blow up Sterling Hall several times, being heated to the detonation point and laced with dangerous nuclides. [189] 

Sometime before 11: 00 P.M. on Sunday, September 29, 1957, the bomb went off, throwing a column of black smoke and debris reaching a kilometer into the sky, accented with larger fragments burning orange-red. The 160-ton concrete lid on the tank tumbled upward into the night like a badly thrown discus, and the ground thump was felt many miles away. Residents of Chelyabinsk rushed outside and looked at the lighted display to the northwest, as 20 million curies of radioactive dust spread out over everything sticking above ground. The high-level wind that night was blowing northeast, and a radioactive plume dusted the Earth in a tight line, about 300 kilometers long. This accident had not been a runaway explosion in an overworked Soviet production reactor. It was the world’s first “dirty bomb,” a powerful chemical explosive spreading radioactive nuclides having unusually high body burdens and guaranteed to cause havoc in the biosphere. The accidentally derived explosive in the tank was the equivalent of up to 100 tons of TNT, and there were probably 70 to 80 tons of radioactive waste thrown skyward. (KL 5295)

So what were the primary organizational and social causes of this disaster? One is the haste created in nuclear design and construction created by Stalin’s insistence on moving forward the Soviet nuclear weapons program as rapidly as possible. As is evident in the Chernobyl case as well, the political pressures on engineers and managers that followed from these political priorities often led to disastrous decisions and actions. A second is the institutionalized system of secrecy that surrounded industry generally, the military specifically, and the nuclear industry most especially. A third is the casual attitude taken by Soviet officials towards the health and wellbeing of the population. And a final cause highlighted by Mahaffey’s account is the low level of attention given at the plant level to safety and maintenance of highly risky facilities. Stratton et al based their analysis on the fact that the heat-generating characteristics of nuclear waste were well understood and that effective means existed for controlling those risks. That may be, but what they failed to anticipate is that these risks would be fundamentally disregarded on the ground and in the supervisory system above the Kyshtym reactor complex.

(It is interesting to note that Mahaffey himself underestimates the amount of information that is now available about the effects of the disaster. He writes that “studies of the effects of this disaster are extremely difficult, as records do not exist, and previous residents are hard to track down” (kl 5330). But the Akleyev study mentioned above provides extensive health details about the affected population made possible as a result of data collected during Soviet times and concealed.)

 

Nuclear power plant siting decisions

Readers may be skeptical about the practical importance of the topic of nuclear power plant siting decisions, since very few new nuclear plants have been proposed or approved in the United States for decades. However, the topic is one for which there is an extensive historical record, and it is a process that illuminates the challenge for government to balance risk and benefit, private gain and public cost.  Moreover, siting inherently brings up issues that are both of concern to the public in general (throughout a state or region of the country) and to the citizens who live in close proximity to the recommended site. The NIMBY problem is unavoidable — it is someone’s backyard, and it is a worrisome neighbor. So this is a good case in terms of which to think creatively about the responsibilities of government for ensuring the public good in the face of risky private activity, and the detailed institutions of regulation and oversight that would work to make wise public outcomes more likely.

I’ve been thinking quite a bit recently about technology failure, government regulation, and risky technologies, and there is a lot to learn about these subjects by looking at the history of nuclear power in the United States. Two books in particular have been interesting to me. Neither is particularly recent, but both shed valuable light on the public-policy context of nuclear decision-making. The first is Joan Aron’s account of the processes that led to the cancellation of the Shoreham nuclear power plant on Long Island in the 1970s (Licensed To Kill?: The Nuclear Regulatory Commission and the Shoreham Power Plant) and the second is Donald Stever, Jr.’s account of the licensing process for the Seabrook nuclear power plant in Seabrook and The Nuclear Regulatory Commission: The Licensing of a Nuclear Power Plant. Both are fascinating books and well worthy of study as a window into government decision-making and regulation. Stever’s book is especially interesting because it is a highly capable analysis of the licensing process, both at the state level and at the level of the NRC, and because Stever himself was a participant. As an assistant attorney general in New Hampshire he was assigned the role of Counsel for the Public throughout the process in New Hampshire.

Joan Aron’s 1997 book Licensed to Kill? is a detailed case study the effort to establish the Shoreham nuclear power plant on Long Island in the 1980s. LILCO had proposed the plant to respond to rising demand for electricity on Long Island as population and energy use rose. And Long Island is a long, narrow island on which traffic congestion at certain times of day is legendary. Evacuation planning was both crucial and in the end, perhaps impossible.

This is an intriguing story, because it led eventually to the cancellation of the operating license for the plant by the NRC after completion of the plant. And the cancellation resulted largely from the effectiveness of public opposition and interest-group political pressure. Aron provides a detailed account of the decisions made by the public utility company LILCO, the AEC and NRC, New York state and local authorities, and citizen activist groups that led to the costliest failed investment in the history of nuclear power in the United States.

In 1991 the NRC made the decision to rescind the operating license for the Shoreham plant, after completion at a cost of over $5 billion but before it had generated a kilowatt of electricity.

Aron’s basic finding is that the project collapsed in costly fiasco because of a loss of trust among the diverse stakeholders: LILCO, the Long Island public, state and local agencies and officials, scientific experts, and the Nuclear Regulatory Commission. The Long Island tabloid Newsday played a role as well, sensationalizing every step of the process and contributing to public distrust of the process. Aron finds that the NRC and LILCO underestimated the need for full analysis of safety and emergency preparedness issues raised by the plant’s design, including the issue of evacuation from a largely inaccessible island full of two million people in the event of disaster. LILCO’s decision to upscale the capacity of the plant in the middle of the process contributed to the failure as well. And the occurrence of the Three Mile Island disaster in 1979 gave new urgency to the concerns experienced by citizens living within fifty miles of the Shoreham site about the risks of a nuclear plant.

As we have seen, Shoreham failed to operate because of intense public opposition, in which the governor played a key role, inspired in part by the utility’s management incompetence and distrust of the NRC. Inefficiencies in the NRC licensing process were largely irrelevant to the outcome. The public by and large ignored NRC’s findings and took the nonsafety of the plant for granted. (131)

The most influential issue was public safety: would it be possible to perform an orderly evacuation of the population near the plant in the event of a serious emergency? Clarke and Perrow (included in Helmut Anheier, ed., When Things Go Wrong: Organizational Failures and Breakdowns) provide an extensive analysis of the failures that occurred during tests of the emergency evacuation plan designed by LILCO. As they demonstrate, the errors that occurred during the evacuation test were both “normal” and potentially deadly.

One thing that comes out of both books is the fact that the commissioning and regulatory processes are far from ideal examples of the rational development of sound public policy. Rather, business interests, institutional shortcomings, lack of procedural knowledge by committee chairs, and dozens of other factors lead to outcomes that appear to fall far short of what the public needs. But in addition to ordinary intrusions into otherwise rational policy deliberations, there are other reasons to believe that decision-making is more complicated and less rational than a simple model of rational public policy formation would suggest. Every decision-maker brings a set of “framing assumptions” about the reality concerning which he or she is deliberating. These framing assumptions impose an unavoidable kind of cognitive bias into collective decision-making. A business executive brings a worldview to the question of regulation of risk that is quite different from that of an ecologist or an environmental activist. This is different from the point often made about self-interest; our framing assumptions do not feel like expressions of self-interest, but rather simply secure convictions about how the world works and what is important in the world. This is one reason why the work of social scientists like Scott Page (The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies) on the value of diversity in problem-solving and decision-making is so important: by bringing multiple perspectives and cognitive frames to a problem, we are more likely to get a balanced decision that gives appropriate weight to the legitimate interests and concerns of all involved.

Here is an interesting concrete illustration of cognitive bias (with a generous measure of self-interest as well) in Stever’s discussion of siting decisions for nuclear power plants:

From the time a utility makes the critical in-house decision to choose a site, any further study of alternatives is necessarily negative in approach. Once sufficient corporate assets have been sunk into the chosen site to produce data adequate for state site review, the company’s management has a large enough stake in it to resist suggestions that a full study of site alternatives be undertaken as a part of the state (or for that matter as a part of the NEPA) review process. hence, the company’s methodological approach to evaluating alternates to the chosen site will always be oriented toward the desired conclusion that the chosen site is superior. (Stever 1980 : 30)

This is the bias of sunk costs, both inside the organization and in the cognitive frames of independent decision makers in state agencies.

Stever’s central point here is a very important one: the pace of site selection favors the energy company’s choices over the concerns and preferences of affected groups because the company is in a position to have dedicated substantial resources to development of the preferred site proposal. Likewise, scientific experts have a difficult time making their concerns about habitat or traffic flow heard in the context.
But here is a crucial thing to observe: the siting decision is only one of dozens in the development of a new power plant, which is itself only one of hundreds of government / business decisions made every year. What Stever describes is a structural bias in the regulatory process, not a one-off flaw. At its bottom, this is the task that government faces when considering the creation of a new nuclear power plant: “to assess the various public and private costs and benefits of a site proposed by a utility” (32); and Stever’s analysis makes it doubtful that existing public processes do this in a consistent and effective way. Stever argues that government needs to have more of a role in site selection, not less, as pro-market advocates demand: “The kind of social and environmental cost accounting required for a balanced initial assessment of, and development of, alternative sites should be done by a public body acting not as a reviewer of private choices, but as an active planner” (32).

Notice how this scheme shifts the pace and process from the company to the relevant state agency. The preliminary site selection and screening is done by a state site planning agency, with input then invited from the utilities companies, interest groups, and a formal environmental assessment. This places the power squarely in the hands of the government agency rather than the private owner of the plant — reflecting the overriding interest the public has in ensuring health, safety, and environmental controls.
Stever closes a chapter on regulatory issues with these cogent recommendations (38-39):

  1. Electric utility companies should not be responsible for decisions concerning early nuclear-site planning.
  2. Early site identification, evaluation, and inventorying is a public responsibility that should be undertaken by a public agency, with formal participation by utilities and interest groups, based upon criteria developed by the state legislature.
  3. Prior to the use of a particular site, the state should prepare a complete environmental assessment for it, and hold adjudicatory hearings on contested issues.
  4. Further effort should be made toward assessing the public risk of nuclear power plant sites.
  5. In areas like New England, characterized by geographically small states and high energy demand, serious efforts should be made to develop regional site planning and evaluation.
  6. Nuclear licensing reform should focus on the quality of decision-making.
  7. There should be a continued federal presence in nuclear site selection, and the resolution of environmental problems should not be delegated entirely to the states. 

(It is very interesting to me that I have not been able to locate a full organizational study of the Nuclear Regulatory Commission itself.)

The Morandi Bridge collapse and regulatory capture

Lower image: Eugenio Ceroni and Luca Cozzi, Ponte Morandi – Autopsia di una strage

A recurring topic in Understanding Society is the question of the organizational causes that lie in the background of major accidents and technological disasters. One such disaster is the catastrophic collapse of the Morandi Bridge in Genoa in August, 2018, which resulted in the deaths of 43 people. Was this a technological failure, a design failure — or importantly a failure in which private and public organizational features led to the disaster?

A major story in the New York Times on March 5, 2019 (link) makes it clear that social and organizational causes were central to this horrendous failure. (What could be more terrifying than having the highway bridge under your vehicle collapse to the earth 150 feet beneath you?) In this case it is evident from the Times coverage that a major cause of the disaster was the relationship between Autostrade per l’Italia, the private company that manages the bridge and derives enormous profit from it, and the regulatory ministries responsible for regulating and supervising safe operations of highways and bridges.

In a sign of the arrogance of wealth and power involved in the relationship, the Benetton family threatened a multimillion dollar lawsuit against the economist Marco Ponti who had served on an expert panel advising the government and had made strong statements about the one-sided relationship that existed. The threat was not acted upon, but the abuse of power is clear.

This appears to be a textbook case of “regulatory capture”, a situation in which the private owners of a risky enterprise or activity use their economic power to influence or intimidate the government regulatory agencies that nominally oversee their activities. “Autostrade reaped huge profits and acquired so much power that the state became a largely passive regulatory” (NYT March 5, 2019). Moreover, independent governmental oversight was crippled by the fact that “the company effectively regulated itself– because Autostrade’s parent company owned the inspection company responsible for safety checks on the Morandi Bridge” (NYT). The Times quotes Carlo Scarpa, and economics professor at the University of Brescia:

Any investor would have been worried about bidding. The Benettons, though, knew the system and they understood that the Ministry of Infrastructure and Transport, which was supposed to supervise the whole thing, was weak. They were able to calculate the weight the company would have in the political arena. (NYT March 5, 2019)

And this seems to have worked out as the family expected:

Autostrade became a political powerhouse, acquiring clout that the Ministry of Infrastructure and Transport, perpetually underfunded and employing a small fraction of the staff, could not match. (NYT March 5, 2019)

The story notes that the private company made a great deal of money from this contract, but that the state also benefited financially. “Autostrade has poured billions of euros into state coffers, paying nearly 600 million euros a year in corporate taxes, V.A.T. and license fees.”

The story also surfaces other social factors that played a role in the disaster, including opposition by Genoa residents to the construction involved in creating a potential bypass to the bridge.

Here is what the Times story has to say about the inspections that occurred:

Beyond fixing blame for the bridge collapse, a central question of the Morandi tragedy is what happened to safety inspections. The answer is that the inspectors worked for Autostrade more than for the state. For decades, Spea Engineering, a Milan-based company, has performed inspections on the bridge. If nominally independent, Spea is owned by Autostrade’s parent company, Atlantia, and Autostrade is also Spea’s largest customer. Spea’s offices in Rome and elsewhere are housed inside Autostrade. One former bridge design engineer for Spea, Giulio Rambelli, described Autostrade’s control over Spea as “absolute,” (NYT March 5, 2019)

The story notes that this relationship raises the possibility of conflicts of interest that are prohibited in other countries. The story quotes Professor Giuliano Fonderico: “All this suggests a system failure.”

The failure appears to be first and foremost a failure of the state to fulfill its obligations of regulation and oversight of dangerous activities. By ceding any real and effective system of safety inspection to the business firms who are benefitting from the operations of the bridge, the state has essentially given up its responsibility of ensuring the safety of the public.

It is also worth underlining the point made in the article about the huge mismatch that exists between the capacities of the business firms in question and the agencies nominally charged to regulate and oversee them. This is a system-level failure at a higher level, since it highlights the fact of the power imbalance that almost always exists between large corporate wealth and the government agencies charged to oversee their activities.

Here is an editorial from the Guardian that makes some similar points; link. There don’t appear to be book-length treatments of the Morandi Bridge disaster available in English. Here is an Italian book on the subject by Eugenio Ceroni and Luca Cozzi, Ponte Morandi – Autopsia di una strage: I motivi tecnici, le colpe, gli errori. Quel che si poteva fare e non si è fatto (Italian Edition), which appears to be a technical civil-engineering analysis of the collapse. The Kindle translate option using Bing is helpful for non-Italian readers to get the thrust of this short book. In the engineering analysis inadequate inspection and incomplete maintenance remediation are key factors in the collapse.

Philosophy and the study of technology failure

image: Adolf von Menzel, The Iron Rolling Mill (Modern Cyclopes)
 

Readers may have noticed that my current research interests have to do with organizational dysfunction and largescale technology failures. I am interested in probing the ways in which organizational failures and dysfunctions have contributed to large accidents like Bhopal, Fukushima, and the Deepwater Horizon disaster. I’ve had to confront an important question in taking on this research interest: what can philosophy bring to the topic that would not be better handled by engineers, organizational specialists, or public policy experts?

One answer is the diversity of viewpoint that a philosopher can bring to the discussion. It is evident that technology failures invite analysis from all of these specialized experts, and more. But there is room for productive contribution from reflective observers who are not committed to any of these disciplines. Philosophers have a long history of taking on big topics outside the defined canon of “philosophical problems”, and often those engagements have proven fruitful. In this particular instance, philosophy can look at organizations and technology in a way that is more likely to be interdisciplinary, and perhaps can help to see dimensions of the problem that are less apparent from a purely disciplinary perspective.

There is also a rationale based on the terrain of the philosophy of science. Philosophers of biology have usually attempted to learn as much about the science of biology as they can manage, but they lack the level of expertise of a research biologist, and it is rare for a philosopher to make an original contribution to the scientific biological literature. Nonetheless it is clear that philosophers have a great deal to add to scientific research in biology. They can contribute to better reasoning about the implications of various theories, they can probe the assumptions about confirmation and explanation that are in use, and they can contribute to important conceptual disagreements. Biology is in a better state because of the work of philosophers like David Hull and Elliot Sober.

Philosophers have also made valuable contributions to science and technology studies, bringing a viewpoint that incorporates insights from the philosophy of science and a sensitivity to the social groundedness of technology. STS studies have proven to be a fruitful place for interaction between historians, sociologists, and philosophers. Here again, the concrete study of the causes and context of large technology failure may be assisted by a philosophical perspective.

There is also a normative dimension to these questions about technology failure for which philosophy is well prepared. Accidents hurt people, and sometimes the causes of accidents involve culpable behavior by individuals and corporations. Philosophers have a long history of contribution to these kinds of problems of fault, law, and just management of risks and harms.

Finally, it is realistic to say that philosophy has an ability to contribute to social theory. Philosophers can offer imagination and critical attention to the problem of creating new conceptual schemes for understanding the social world. This capacity seems relevant to the problem of describing, analyzing, and explaining largescale failures and disasters.

The situation of organizational studies and accidents is in some ways more hospitable for contributions by a philosopher than other “wicked problems” in the world around us. An accident is complicated and complex but not particularly obscure. The field is unlike quantum mechanics or climate dynamics, which are inherently difficult for non-specialists to understand. The challenge with accidents is to identify a multi-layered analysis of the causes of the accident that permits observers to have a balanced and operative understanding of the event. And this is a situation where the philosopher’s perspective is most useful. We can offer higher-level descriptions of the relative importance of different kinds of causal factors. Perhaps the role here is analogous to messenger RNA, providing a cross-disciplinary kind of communications flow. Or it is analogous to the role of philosophers of history who have offered gentle critique of the cliometrics school for its over-dependence on a purely statistical approach to economic history.

So it seems reasonable enough for a philosopher to attempt to contribute to this set of topics, even if the disciplinary expertise a philosopher brings is more weighted towards conceptual and theoretical discussions than undertaking original empirical research in the domain.

What I expect to be the central finding of this research is the idea that a pervasive and often unrecognized cause of accidents is a systemic organizational defect of some sort, and that it is enormously important to have a better understanding of common forms of these deficiencies. This is a bit analogous to a paradigm shift in the study of accidents. And this view has important policy implications. We can make disasters less frequent by improving the organizations through which technology processes are designed and managed.

System safety

An ongoing thread of posts here is concerned with organizational causes of large technology failures. The driving idea is that failures, accidents, and disasters usually have a dimension of organizational causation behind them. The corporation, research office, shop floor, supervisory system, intra-organizational information flow, and other social elements often play a key role in the occurrence of a gas plant fire, a nuclear power plant malfunction, or a military disaster. There is a tendency to look first and foremost for one or more individuals who made a mistake in order to explain the occurrence of an accident or technology failure; but researchers such as Perrow, Vaughan, Tierney, and Hopkins have demonstrated in detail the importance of broadening the lens to seek out the social and organizational background of an accident.

It seems important to distinguish between system flaws and organizational dysfunction in considering all of the kinds of accidents mentioned here. We might specify system safety along these lines. Any complex process has the potential for malfunction. Good system design means creating a flow of events and processes that make accidents inherently less likely. Part of the task of the designer and engineer is to identify chief sources of harm inherent in the process — release of energy, contamination of food or drugs, unplanned fission in a nuclear plant — and design fail-safe processes so that these events are as unlikely as possible. Further, given the complexity of contemporary technology systems it is critical to attempt to anticipate unintended interactions among subsystems — each of which is functioning correctly but that lead to disaster in unusual but possible interaction scenarios.

In a nuclear processing plant, for example, there is the hazard of radioactive materials being brought into proximity with each other in a way that creates unintended critical mass. Jim Mahaffey’s Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima offers numerous examples of such unintended events, from the careless handling of plutonium scrap in a machining process to the transfer of a fissionable liquid from a vessel of one shape to another. We might try to handle these risks as an organizational problem: more and better training for operatives about the importance of handling nuclear materials according to established protocols, and effective supervision and oversight to ensure that the protocols are observed on a regular basis. But it is also possible to design the material processes within a nuclear plant in a way that makes unintended criticality virtually impossible — for example, by storing radioactive solutions in containers that simply cannot be brought into close proximity with each other.

Nancy Leveson is a national expert on defining and applying principles of system safety. Her book Engineering a Safer World: Systems Thinking Applied to Safety is a thorough treatment of her thinking about this subject. She offers a handful of compelling reasons for believing that safety is a system-level characteristic that requires a systems approach: the fast pace of technological change, reduced ability to learn from experience, the changing nature of accidents, new types of hazards, increasing complexity and coupling, decreasing tolerance for single accidents, difficulty in selecting priorities and making tradeoffs , more complex relationships between humans and automation, and changing regulatory and public view of safety (kl 130 ff.). Particularly important in this list is the comment about complexity and coupling: “The operation of some systems is so complex that it defies the understanding of all but a few experts, and sometimes even they have incomplete information about the system’s potential behavior” (kl 137). 

Given the fact that safety and accidents are products of whole systems, she is critical of the accident methodology generally applied to serious industrial, aerospace, and chemical accidents. This methodology involves tracing the series of events that led to the outcome, and identifying one or more events as the critical cause of the accident. However, she writes:

In general, event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the or industry. An accident model should encourage a broad view of accident mechanisms that expands the investigation beyond the proximate evens.A narrow focus on technological components and pure engineering activities or a similar narrow focus on operator errors may lead to ignoring some of the most important factors in terms of preventing future accidents. (kl 452)

Here is a definition of system safety offered later in ESW in her discussion of the emergence of the concept within the defense and aerospace fields in the 1960s:

System Safety … is a subdiscipline of system engineering. It was created at the same time and for the same reasons. The defense community tried using the standard safety engineering techniques on their complex new systems, but the limitations became clear when interface and component interaction problems went unnoticed until it was too late, resulting in many losses and near misses. When these early aerospace accidents were investigated, the causes of a large percentage of them were traced to deficiencies in design, operations, and management. Clearly, big changes were needed. System engineering along with its subdiscipline, System Safety, were developed to tackle these problems. (kl 1007)

Here Leveson mixes system design and organizational dysfunctions as system-level causes of accidents. But much of her work in this book and her earlier Safeware: System Safety and Computers gives extensive attention to the design faults and component interactions that lead to accidents — what we might call system safety in the narrow or technical sense.

A systems engineering approach to safety starts with the basic assumption that some properties of systems, in this case safety, can only be treated adequately in the context of the social and technical system as a whole. A basic assumption of systems engineering is that optimization of individual components or subsystems will not in general lead to a system optimum; in fact, improvement of a particular subsystem may actually worsen the overall system performance because of complex, nonlinear interactions among the components. (kl 1007) 

Overall, then, it seems clear that Leveson believes that both organizational features and technical system characteristics are part of the systems that created the possibility for accidents like Bhopal, Fukushima, and Three Mile Island. Her own accident model designed to help identify causes of accidents, STAMP (Systems-Theoretic Accident Model and Processes) emphasizes both kinds of system properties.

Using this new causality model … changes the emphasis in system safety from preventing failures to enforcing behavioral safety constraints. Component failure accidents are still included, but or conception of causality is extended to include component interaction accidents. Safety is reformulated as a control problem rather than a reliability problem. (kl 1062)

In this framework, understanding why an accident occurred requires determining why the control was ineffective. Preventing future accidents requires shifting from a focus on preventing failures to the broader goal of designing and implementing controls that will enforce the necessary constraints. (kl 1084)

 Leveson’s brief analysis of the Bhopal disaster in 1984 (kl 384 ff.) emphasizes the organizational dysfunctions that led to the accident — and that were completely ignored by the Indian state’s accident investigation of the accident: out-of-service gauges, alarm deficiencies, inadequate response to prior safety audits, shortage of oxygen masks, failure to inform the police or surrounding community of the accident, and an environment of cost cutting that impaired maintenance and staffing. “When all the factors, including indirect and systemic ones, are considered, it becomes clear that the maintenance worker was, in fact, only a minor and somewhat irrelevant player in the loss. Instead, degradation in the safety margin occurred over time and without any particular single decision to do so but simply as a series of decisions that moved the plant slowly toward a situation where any slight error would lead to a major accident” (kl 447).

Patient safety

An issue which is of concern to anyone who receives treatment in a hospital is the topic of patient safety. How likely is it that there will be a serious mistake in treatment — wrong-site surgery, incorrect medication or radiation dose, exposure to a hospital-acquired infection? The current evidence is alarming. (Martin Makary et al estimate that over 250,000 deaths per year result from medical mistakes — making medical error now the third leading cause of mortality in the United States (link).) And when these events occur, where should we look for assigning responsibility — at the individual providers, at the systems that have been implemented for patient care, at the regulatory agencies responsible for overseeing patient safety?

Medical accidents commonly demonstrate a complex interaction of factors, from the individual provider to the technologies in use to failures of regulation and oversight. We can look at a hospital as a place where caring professionals do their best to improve the health of their patients while scrupulously avoiding errors. Or we can look at it as an intricate system involving the recording and dissemination of information about patients; the administration of procedures to patients (surgery, medication, radiation therapy). In this sense a hospital is similar to a factory with multiple intersecting locations of activity. Finally, we can look at it as an organization — a system of division of labor, cooperation, and supervision by large numbers of staff whose joint efforts lead to health and accidents alike. Obviously each of these perspectives is partially correct. Doctors, nurses, and technicians are carefully and extensively trained to diagnose and treat their patients. The technology of the hospital — the digital patient record system, the devices that administer drugs, the surgical robots — can be designed better or worse from a safety point of view. And the social organization of the hospital can be effective and safe, or it can be dysfunctional and unsafe. So all three aspects are relevant both to safe operations and the possibility of chronic lack of safety.

So how should we analyze the phenomenon of patient safety? What factors can be identified that distinguish high safety hospitals from low safety? What lessons can be learned from the study of accidents and mistakes that cumulatively lead to a hospitals patient safety record?

The view that primarily emphasizes expertise and training of individual practitioners is very common in the healthcare industry, and yet this approach is not particularly useful as a basis for improving the safety of healthcare systems. Skill and expertise are necessary conditions for effective medical treatment; but the other two zones of accident space are probably more important for reducing accidents — the design of treatment systems and the organizational features that coordinate the activities of the various individuals within the system.

Dr. James Bagian is a strong advocate for the perspective of treating healthcare institutions as systems. Bagian considers both technical systems characteristics of processes and the organizational forms through which these processes are carried out and monitored. And he is very skilled at teasing out some of the ways in which features of both system and organization lead to avoidable accidents and failures. I recall his description of a safety walkthrough he had done in a major hospital. He said that during the tour he noticed a number of nurses’ stations which were covered with yellow sticky notes. He observed that this is both a symptom and a cause of an accident-prone organization. It means that individual caregivers were obligated to remind themselves of tasks and exceptions that needed to be observed. Far better was to have a set of systems and protocols that made sticky notes unnecessary. Here is the abstract from a short summary article by Bagian on the current state of patient safety:

Abstract

The traditional approach to patient safety in health care has ranged from reticence to outward denial of serious flaws. This undermines the otherwise remarkable advances in technology and information that have characterized the specialty of medical practice. In addition, lessons learned in industries outside health care, such as in aviation, provide opportunities for improvements that successfully reduce mishaps and errors while maintaining a standard of excellence. This is precisely the call in medicine prompted by the 1999 Institute of Medicine report “To Err Is Human: Building a Safer Health System.” However, to effect these changes, key components of a successful safety system must include: (1) communication, (2) a shift from a posture of reliance on human infallibility (hence “shame and blame”) to checklists that recognize the contribution of the system and account for human limitations, and (3) a cultivation of non-punitive open and/or de-identified/anonymous reporting of safety concerns, including close calls, in addition to adverse events.

(Here is the Institute of Medicine study to which Bagian refers; link.)

Nancy Leveson is an aeronautical and software engineer who has spent most of her career devoted to designing safe systems. Her book Engineering a Safer World: Systems Thinking Applied to Safety is a recent presentation of her theories of systems safety. She applies these approaches to problems of patient safety with several co-authors in “A Systems Approach to Analyzing and Preventing Hospital Adverse Events” (link). Here is the abstract and summary of findings for that article:

Objective:

This study aimed to demonstrate the use of a systems theory-based accident analysis technique in health care applications as a more powerful alternative to the chain-of-event accident models currently underpinning root cause analysis methods.

Method:

A new accident analysis technique, CAST [Causal Analysis based on Systems Theory], is described and illustrated on a set of adverse cardiovascular surgery events at a large medical center. The lessons that can be learned from the analysis are compared with those that can be derived from the typical root cause analysis techniques used today.

Results:

The analysis of the 30 cardiovascular surgery adverse events using CAST revealed the reasons behind unsafe individual behavior, which were related to the design of the system involved and not negligence or incompetence on the part of individuals. With the use of the system-theoretic analysis results, recommendations can be generated to change the context in which decisions are made and thus improve decision making and reduce the risk of an accident.

Conclusions:

The use of a systems-theoretic accident analysis technique can assist in identifying causal factors at all levels of the system without simply assigning blame to either the frontline clinicians or technicians involved. Identification of these causal factors in accidents will help health care systems learn from mistakes and design system-level changes to prevent them in the future.

Crucial in this article is this research group’s effort to identify causes “at all levels of the system without simply assigning blame to either the frontline clinicians or technicians involved”. The key result is this: “The analysis of the 30 cardiovascular surgery adverse events using CAST revealed the reasons behind unsafe individual behavior, which were related to the design of the system involved and not negligence or incompetence on the part of individuals.”

Bagian, Leveson, and others make a crucial point: in order to substantially increase the performance of hospitals and the healthcare system more generally when it comes to patient safety, it will be necessary to extend the focus of safety analysis from individual incidents and agents to the systems and organizations through which these accidents were possible. In other words, attention to systems and organizations is crucial if we are to significantly reduce the frequency of medical and hospital mistakes.

(The Makary et al estimate of 250,000 deaths caused by medical error has been questioned on methodological grounds. See Aaron Carroll’s thoughtful rebuttal (NYT 8/15/16; link).)

Safety culture or safety behavior?

Andrew Hopkins is a much-published expert on industrial safety who has an important set of insights into the causes of industrial accidents. Much of his career has focused on the oil and gas industry, but he has written on other sectors as well. Particularly interesting are several books: Failure to Learn: The BP Texas City Refinery DisasterDisastrous Decisions: The Human and Organisational Causes of the Gulf of Mexico Blowout; and Lessons from Longford: The ESSO Gas Plant Explosion. He also provides a number of interesting working papers here.

One of his interesting working papers is on the topic of safety culture in the drilling industry, “Why safety cultures don’t work” (link).

Companies that set out to create a “safety culture” often expend huge amounts of resource trying to change the way operatives, foremen and supervisory staff think and feel about safety. The results are often disappointing.(1)

Changing the way people think is nigh impossible, but setting up organizational structures that monitor compliance with procedure, even if that procedure is seen as redundant or unnecessary, is doable. (3)

Hopkins’ central point is that safety requires change of routine behavior, not in the first instance change of culture or thought. This means that management and regulatory agencies need to establish safe practices and then enforce compliance through internal and external measures. He uses the example of seat belt usage: campaigns to encourage the use of seat belts had little effect, but behavior changed when fines were imposed on drivers who continued to refrain from seat belt usage.

His central focus here, as in most of his books, is on the processes involved in the drilling industry. He makes the point that the incentives that are established in oil and gas drilling are almost entirely oriented towards maximizing speed and production. Exhortations towards “safe practices” are ineffectual in this context.

Much of his argument here comes down to the contrast between high-likelihood, low-harm accidents and low-likelihood, high-harm accidents. The steps required to prevent low-likelihood, high-harm accidents are generally not visible in the workplace, precisely because the sequences that lead to them are highly uncommon. Routine safety procedures will not reduce the likelihood of occurrence of the high-harm accident.

Hopkins offers the example of the air traffic control industry. The ultimate disaster in air traffic control is a mid-air collision. Very few such incidents have occurred. The incident Hopkins refers to was a mid-air collision over Uberlinger, Germany in 2002. But procedures in air traffic control give absolute priority to preventing such disasters, and the solution is to identify a key precursor event to a mid-air collision and ensure that these precursor events are recorded, investigated, and reacted to when they occur. The relevant precursor event in air traffic control is a proximity of two aircraft at a distance of 1.5 miles or less. The required separation is 2 miles. Air traffic control regulations and processes require a full investigation and reaction for all incidents of separation that occur with 1.5 miles of separation or less. Air traffic control is a high-reliability industry precisely because it gives priority and resources to the prevention, not only of the disastrous incidents themselves, but the the precursors that may lead to them. “This is a clear example of the way a highreliability organization operates. It works out what the most catastrophic event is likely to be, regardless of how rare such events are in recent experience, and devises good indicators of how well the prevention of that catastrophe is being managed. It is a way of thinking that is highly unusual in the oil and gas industry” (2).

The drilling industry does not commonly follow similar high-level safety management. A drilling blowout is the incident of greatest concern in the drilling industry. There are, according to Hopkins, several obvious precursor events to a well blowout: well kicks and cementing failures. It is Hopkins’ contention that safety in the drilling industry would be greatly enhanced (with respect to the catastrophic events that are both low-probability and high-harm) if procedures were reoriented so that priority attention and tracking were given to these kinds of precursor events. By reducing or eliminating the occurrence of the precursor events, major accidents would be prevented.

Another organizational factor that Hopkins highlights is the role that safety officers play within the organization. In high-reliability organizations, safety officers have an organizationally privileged role; in low-reliability organizations their voices seem to disappear in the competition among many managerial voices with other interests (speed, production, public relations). (This point is explored in an earlier post; link.)

Prior to Macondo [the Deepwater Horizon oil spill], BP’s process safety structure was decentralized. The safety experts had very little power. They lacked strong reporting lines to the centre and answered to commercial managers who tended to put production ahead of engineering excellence. After Macondo, BP reversed this. Now, what I call the “voices of safety” are powerful and heard loud and clear in the boardroom. (3)

Ominously, Hopkins makes a prescient point about the crucial role played by regulatory agencies in enhancing safety in high-risk industries.

Many regulatory regimes, however, particularly that of the US, are not functioning as they ought to. Regulators need to be highly skilled and resourced and must be able to match the best minds in industry in order to have competent discussions about the risk-management strategies of the corporations. In the US they’re not doing that yet. The best practice recognized worldwide is the safety case regime, in use in UK and Norway. (4)

Given the militantly anti-regulatory stance of the current US federal administration and the aggressive lack of attention its administrators pay to scientific and technical expertise, this is a very sobering source of worry about the future of industrial, chemical, and nuclear safety in the US.

Empowering the safety officer?

How can industries involving processes that create large risks of harm for individuals or populations be modified so they are more capable of detecting and eliminating the precursors of harmful accidents? How can nuclear accidents, aviation crashes, chemical plant explosions, and medical errors be reduced, given that each of these activities involves large bureaucratic organizations conducting complex operations and with substantial inter-system linkages? How can organizations be reformed to enhance safety and to minimize the likelihood of harmful accidents?

One of the lessons learned from the Challenger space shuttle disaster is the importance of a strongly empowered safety officer in organizations that deal in high-risk activities. This means the creation of a position dedicated to ensuring safe operations that falls outside the normal chain of command. The idea is that the normal decision-making hierarchy of a large organization has a built-in tendency to maintain production schedules and avoid costly delays. In other words, there is a built-in incentive to treat safety issues with lower priority than most people would expect.

If there had been an empowered safety officer in the launch hierarchy for the Challenger launch in 1986, there is a good chance this officer would have listened more carefully to the Morton-Thiokol engineering team’s concerns about low temperature damage to O-rings and would have ordered a halt to the launch sequence until temperatures in Florida raised to the critical value. The Rogers Commission faulted the decision-making process leading to the launch decision in its final report on the accident (The Report of the Presidential Commission on the Space Shuttle Challenger Accident – The Tragedy of Mission 51-L in 1986 – Volume OneVolume TwoVolume Three).

This approach is productive because empowering a safety officer creates a different set of interests in the management of a risky process. The safety officer’s interest is in safety, whereas other decision makers are concerned about revenues and costs, public relations, reputation, and other instrumental goods. So a dedicated safety officer is empowered to raise safety concerns that other officers might be hesitant to raise. Ordinary bureaucratic incentives may lead to underestimating risks or concealing faults; so lowering the accident rate requires giving some individuals the incentive and power to act effectively to reduce risks.

Similar findings have emerged in the study of medical and hospital errors. It has been recognized that high-risk activities are made less risky by empowering all members of the team to call a halt in an activity when they perceive a safety issue. When all members of the surgical team are empowered to halt a procedure when they note an apparent error, serious operating-room errors are reduced. (Here is a report from the American College of Obstetricians and Gynecologists on surgical patient safety; link. And here is a 1999 National Academy report on medical error; link.)

The effectiveness of a team-based approach to safety depends on one central fact. There is a high level of expertise embodied in the staff operating a surgical suite, an engineering laboratory, or a drug manufacturing facility. By empowering these individuals to stop a procedure when they judge there is an unrecognized error in play, this greatly extend the amount of embodied knowledge involved in a process. The surgeon, the commanding officer, or the lab director is no longer the sole expert whose judgments count.

But it also seems clear that these innovations don’t work equally well in all circumstances. Take nuclear power plant operations. In Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima James Mahaffey documents multiple examples of nuclear accidents that resulted from the efforts of mid-level workers to address an emerging problem in an improvised way. In the case of nuclear power plant safety, it appears that the best prescription for safety is to insist on rigid adherence to pre-established protocols. In this case the function of a safety officer is to monitor operations to ensure protocol conformance — not to exercise independent judgment about the best way to respond to an unfavorable reactor event.

It is in fact an interesting exercise to try to identify the kinds of operations in which these innovations are likely to be effective.

Here is a fascinating interview in Slate with Jim Bagian, a former astronaut, one-time director of the Veteran Administration’s National Center for Patient Safety, and distinguished safety expert; link. Bagian emphasizes the importance of taking a system-based approach to safety. Rather than focusing on finding blame for specific individuals whose actions led to an accident, Bagian emphasizes the importance of tracing back to the institutional, organizational, or logistic background of the accident. What can be changed in the process — of delivering medications to patients, of fueling a rocket, or of moving nuclear solutions around in a laboratory — that make the likelihood of an accident substantially lower?

The safety principles involved here seem fairly simple: cultivate a culture in which errors and near-misses are reported and investigated without blame; empower individuals within risky processes to halt the process if their expertise and experience indicates the possibility of a significant risky error; create individuals within organizations whose interests are defined in terms of the identification and resolution of unsafe practices or conditions; and share information about safety within the industry and with the public.

%d bloggers like this: