Empowering the safety officer?

How can industries involving processes that create large risks of harm for individuals or populations be modified so they are more capable of detecting and eliminating the precursors of harmful accidents? How can nuclear accidents, aviation crashes, chemical plant explosions, and medical errors be reduced, given that each of these activities involves large bureaucratic organizations conducting complex operations and with substantial inter-system linkages? How can organizations be reformed to enhance safety and to minimize the likelihood of harmful accidents?

One of the lessons learned from the Challenger space shuttle disaster is the importance of a strongly empowered safety officer in organizations that deal in high-risk activities. This means the creation of a position dedicated to ensuring safe operations that falls outside the normal chain of command. The idea is that the normal decision-making hierarchy of a large organization has a built-in tendency to maintain production schedules and avoid costly delays. In other words, there is a built-in incentive to treat safety issues with lower priority than most people would expect.

If there had been an empowered safety officer in the launch hierarchy for the Challenger launch in 1986, there is a good chance this officer would have listened more carefully to the Morton-Thiokol engineering team’s concerns about low temperature damage to O-rings and would have ordered a halt to the launch sequence until temperatures in Florida raised to the critical value. The Rogers Commission faulted the decision-making process leading to the launch decision in its final report on the accident (The Report of the Presidential Commission on the Space Shuttle Challenger Accident – The Tragedy of Mission 51-L in 1986 – Volume OneVolume TwoVolume Three).

This approach is productive because empowering a safety officer creates a different set of interests in the management of a risky process. The safety officer’s interest is in safety, whereas other decision makers are concerned about revenues and costs, public relations, reputation, and other instrumental goods. So a dedicated safety officer is empowered to raise safety concerns that other officers might be hesitant to raise. Ordinary bureaucratic incentives may lead to underestimating risks or concealing faults; so lowering the accident rate requires giving some individuals the incentive and power to act effectively to reduce risks.

Similar findings have emerged in the study of medical and hospital errors. It has been recognized that high-risk activities are made less risky by empowering all members of the team to call a halt in an activity when they perceive a safety issue. When all members of the surgical team are empowered to halt a procedure when they note an apparent error, serious operating-room errors are reduced. (Here is a report from the American College of Obstetricians and Gynecologists on surgical patient safety; link. And here is a 1999 National Academy report on medical error; link.)

The effectiveness of a team-based approach to safety depends on one central fact. There is a high level of expertise embodied in the staff operating a surgical suite, an engineering laboratory, or a drug manufacturing facility. By empowering these individuals to stop a procedure when they judge there is an unrecognized error in play, this greatly extend the amount of embodied knowledge involved in a process. The surgeon, the commanding officer, or the lab director is no longer the sole expert whose judgments count.

But it also seems clear that these innovations don’t work equally well in all circumstances. Take nuclear power plant operations. In Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima James Mahaffey documents multiple examples of nuclear accidents that resulted from the efforts of mid-level workers to address an emerging problem in an improvised way. In the case of nuclear power plant safety, it appears that the best prescription for safety is to insist on rigid adherence to pre-established protocols. In this case the function of a safety officer is to monitor operations to ensure protocol conformance — not to exercise independent judgment about the best way to respond to an unfavorable reactor event.

It is in fact an interesting exercise to try to identify the kinds of operations in which these innovations are likely to be effective.

Here is a fascinating interview in Slate with Jim Bagian, a former astronaut, one-time director of the Veteran Administration’s National Center for Patient Safety, and distinguished safety expert; link. Bagian emphasizes the importance of taking a system-based approach to safety. Rather than focusing on finding blame for specific individuals whose actions led to an accident, Bagian emphasizes the importance of tracing back to the institutional, organizational, or logistic background of the accident. What can be changed in the process — of delivering medications to patients, of fueling a rocket, or of moving nuclear solutions around in a laboratory — that make the likelihood of an accident substantially lower?

The safety principles involved here seem fairly simple: cultivate a culture in which errors and near-misses are reported and investigated without blame; empower individuals within risky processes to halt the process if their expertise and experience indicates the possibility of a significant risky error; create individuals within organizations whose interests are defined in terms of the identification and resolution of unsafe practices or conditions; and share information about safety within the industry and with the public.

Nuclear accidents

 
diagrams: Chernobyl reactor before and after
 

Nuclear fission is one of the world-changing discoveries of the mid-twentieth century. The atomic bomb projects of the United States led to the atomic bombing of Japan in August 1945, and the hope for limitless electricity brought about the proliferation of a variety of nuclear reactors around the world in the decades following World War II. And, of course, nuclear weapons proliferated to other countries beyond the original circle of atomic powers.

Given the enormous energies associated with fission and the dangerous and toxic properties of radioactive components of fission processes, the possibility of a nuclear accident is a particularly frightening one for the modern public. The world has seen the results of several massive nuclear accidents — Chernobyl and Fukushima in particular — and the devastating results they have had on human populations and the social and economic wellbeing of the regions in which they occurred.

Safety is therefore a paramount priority in the nuclear industry, both in research labs and military and civilian applications. So what is the situation of safety in the nuclear sector? Jim Mahaffey’s Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima is a detailed and carefully researched attempt to answer this question. And the information he provides is not reassuring. Beyond the celebrated and well-known disasters at nuclear power plants (Three Mile Island, Chernobyl, Fukushima), Mahaffey refers to hundreds of accidents involving reactors, research laboratories, weapons plants, and deployed nuclear weapons that have had less public awareness. These accidents resulted in a very low number of lives lost, but their frequency is alarming. They are indeed “normal accidents” (Perrow, Normal Accidents: Living with High-Risk Technologies. For example:

  • a Japanese fishing boat is contaminated by fallout from Castle Bravo test of hydrogen bomb; lots of radioactive fish at the markets in Japan (March 1, 1954) (kl 1706)
  • one MK-6 atomic bomb is dropped on Mars Bluff, South Carolina, after a crew member accidentally pulled the emergency bomb release handle (February 5, 1958) (kl 5774)
  • Fermi 1 liquid sodium plutonium breeder reactor experiences fuel meltdown during startup trials near Detroit (October 4, 1966) (kl 4127)

Mahaffey also provides detailed accounts of the most serious nuclear accidents and meltdowns during the past forty years, Three Mile Island, Chernobyl, and Fukushima.

The safety and control of nuclear weapons is of particular interest. Here is Mahaffey’s summary of “Broken Arrow” events — the loss of atomic and fusion weapons:

Did the Air Force ever lose an A-bomb, or did they just misplace a few of them for a short time? Did they ever drop anything that could be picked up by someone else and used against us? Is humanity going to perish because of poisonous plutonium spread that was snapped up by the wrong people after being somehow misplaced? Several examples will follow. You be the judge. 

Chuck Hansen [

U.S. Nuclear Weapons – The Secret History

] was wrong about one thing. He counted thirty-two “Broken Arrow” accidents. There are now sixty-five documented incidents in which nuclear weapons owned by the United States were lost, destroyed, or damaged between 1945 and 1989. These bombs and warheads, which contain hundreds of pounds of high explosive, have been abused in a wide range of unfortunate events. They have been accidentally dropped from high altitude, dropped from low altitude, crashed through the bomb bay doors while standing on the runway, tumbled off a fork lift, escaped from a chain hoist, and rolled off an aircraft carrier into the ocean. Bombs have been abandoned at the bottom of a test shaft, left buried in a crater, and lost in the mud off the coast of Georgia. Nuclear devices have been pounded with artillery of a foreign nature, struck by lightning, smashed to pieces, scorched, toasted, and burned beyond recognition. Incredibly, in all this mayhem, not a single nuclear weapon has gone off accidentally, anywhere in the world. If it had, the public would know about it. That type of accident would be almost impossible to conceal. (kl 5527)

There are a few common threads in the stories of accident and malfunction that Mahaffey provides. First, there are failures of training and knowledge on the part of front-line workers. The physics of nuclear fission are often counter-intuitive, and the idea of critical mass does not fully capture the danger of a quantity of fissionable material. The geometry of the storage of the material makes a critical difference in going critical. Fissionable material is often transported and manipulated in liquid solution; and the shape and configuration of the vessel in which the solution is held makes a difference to the probability of exponential growth of neutron emission — leading to runaway fission of the material. Mahaffey documents accidents that occurred in nuclear materials processing plants that resulted from plant workers applying what they knew from industrial plumbing to their efforts to solve basic shop-floor problems. All too often the result was a flash of blue light and the release of a great deal of heat and radioactive material.

Second, there is a fault at the opposite end of the knowledge spectrum — the tendency of expert engineers and scientists to believe that they can solve complicated reactor problems on the fly. This turned out to be a critical problem at Chernobyl (kl 6859).

The most difficult problem to handle is that the reactor operator, highly trained and educated with an active and disciplined mind, is liable to think beyond the rote procedures and carefully scheduled tasks. The operator is not a computer, and he or she cannot think like a machine. When the operator at NRX saw some untidy valve handles in the basement, he stepped outside the procedures and straightened them out, so that they were all facing the same way. (kl 2057)

There are also clear examples of inappropriate supervision in the accounts shared by Mahaffey. Here is an example from Chernobyl.

[Deputy chief engineer] Dyatlov was enraged. He paced up and down the control panel, berating the operators, cursing, spitting, threatening, and waving his arms. He demanded that the power be brought back up to 1,500 megawatts, where it was supposed to be for the test. The operators, Toptunov and Akimov, refused on grounds that it was against the rules to do so, even if they were not sure why. 

Dyatlov turned on Toptunov. “You lying idiot! If you don’t increase power, Tregub will!”  

Tregub, the Shift Foreman from the previous shift, was officially off the clock, but he had stayed around just to see the test. He tried to stay out of it. 

Toptunov, in fear of losing his job, started pulling rods. By the time he had wrestled it back to 200 megawatts, 205 of the 211 control rods were all the way out. In this unusual condition, there was danger of an emergency shutdown causing prompt supercriticality and a resulting steam explosion. At 1: 22: 30 a.m., a read-out from the operations computer advised that the reserve reactivity was too low for controlling the reactor, and it should be shut down immediately. Dyatlov was not worried. “Another two or three minutes, and it will be all over. Get moving, boys! (kl 6887)

This was the turning point in the disaster.

A related fault is the intrusion of political and business interests into the design and conduct of high-risk nuclear actions. Leaders want a given outcome without understanding the technical details of the processes they are demanding; subordinates like Toptunov are eventually cajoled or coerced into taking the problematic actions. The persistence of advocates for liquid sodium breeder reactors represents a higher-level example of the same fault. Associated with this role of political and business interests is an impulse towards secrecy and concealment when accidents occur and deliberate understatement of the public dangers created by an accident — a fault amply demonstrated in the Fukushima disaster.

Atomic Accidents provides a fascinating history of events of which most of us are unaware. The book is not primarily intended to offer an account of the causes of these accidents, but rather the ways in which they unfolded and the consequences they had for human welfare. (Generally speaking his view is that nuclear accidents in North America and Western Europe have had remarkably few human casualties.) And many of the accidents he describes are exactly the sorts of failures that are common in all largescale industrial and military processes.

(Largescale technology failure has come up frequently here. See these posts for analysis of some of the organizational causes of technology failure (link, link, link).)

Thinking about disaster

 

Charles Perrow is a very talented sociologist who has put his finger on some of the central weaknesses of the American social-economic-political system.  He has written about corporations (Organizing America: Wealth, Power, and the Origins of Corporate Capitalism), technology failure (Normal Accidents: Living with High-Risk Technologies), and organizations (Complex Organizations: A Critical Essay).  (Here is an earlier post on his historical account of the corporation in America; link.) These sound like very different topics — but they’re not, really.  Organizations, power, the conflict between private interests and the public good, and the social and technical causes of great public harms have been the organizing themes of his research for a very long time.

His current book is truly scary.  In The Next Catastrophe: Reducing Our Vulnerabilities to Natural, Industrial, and Terrorist Disasters he carefully surveys the conjunction of factors that make 21st-century America almost uniquely vulnerable to major disasters — actual and possible.  Hurricane Katrina is one place to start — a concentration of habitation, dangerous infrastructure, vulnerable toxic storage, and wholly inadequate policies of water and land use led to a horrific loss of life and a permanent crippling of a great American city.  The disaster was foreseeable and foreseen, and yet few effective steps were taken to protect the city and river system from catastrophic flooding.  And even more alarming — government and the private sector have taken almost none of the prudent steps after the disaster that would mitigate future flooding.

Perrow’s analysis includes natural disasters (floods, hurricanes, earthquakes), nuclear power plants, chemical plants, the electric power transmission infrastructure, and the Internet — as well as the threat of deliberate attacks by terrorists against high-risk targets.   In each case he documents the extreme risks that our society faces from a combination of factors: concentration of industry and population, lax regulation, ineffective organizations of management and oversight, and an inability on the part of Congress to enact legislation that seriously interferes with the business interests of major corporations even for the purpose of protecting the public.

His point is a simple one: we can’t change the weather, the physics of nuclear power, or the destructive energy contained in an LNG farm; but we can take precautions today that significantly reduce the possible effects of accidents caused by these factors in the future. His general conclusion is a very worrisome one: our society is essentially unprotected from major natural disasters and industrial accidents, and we have only very slightly increased our safety when it comes to preventing deliberate terrorist attacks.

This book has been about the inevitable inadequacy of our efforts to protect us from major disasters. It locates the inevitable inadequacy in the limitations of formal organizations. We cannot expect them to do an adequate job in protecting us from mounting natural, industrial, and terrorist disasters.  It locates the avoidable inadequacy of our efforts in our failure to reduce the size of the targets, and thus minimize the extent of harm these disasters can do. (chapter 9)

A specific failure in our current political system is the failure to construct an adequate and safety-enhancing system of regulation:

Stepping outside of the organization itself, we come to a third source of organizational failure, that of regulation. Every chapter on disasters in this book has ended with a call for better regulation and re-regulation, since we need both new regulations in the face of new technologies and threats and the restoration of past regulations that had disappeared or been weakened since the 1960s and 1970s. (chapter 9)

The central vulnerabilities that Perrow points to are systemic and virtually ubiquitous across the United States — concentration and centralization.  He is very concerned about the concentration of people in high-risk areas (flood and earthquake zones, for example); he is concerned about the centralized power wielded by mega-organizations and corporations in our society; and he is concerned about the concentration of highly dangerous infrastructure in places where it puts large populations at risk.  He refers repeatedly to the risk posed by the transport by rail of huge quantities of chlorine gas through densely populated areas — 90 tons at a time; the risk presented by LNG and propane storage farms in areas vulnerable to flooding and consequent release or explosion; the lethal consequences that would ensue from a winter-time massive failure of the electric power grid.

Perrow is an organizational expert; and he recognizes the deep implications that follow from the inherent obstacles that confront large organizations, both public or private.  Co-optation by powerful private interests, failure of coordination among agencies, lack of effective communication in the preparation of policies and emergency responses — these organizational tendencies can reduce organizations like FEMA or the NRC to almost complete inability to perform their public functions.

Organizations, as I have often noted, are tools that can be used by those within and without them for purposes that have little to do with their announced goals. (Kindle loc, 1686)

Throughout the book Perrow offers careful, detailed reviews of the effectiveness and consistency of the government agencies and the regulatory legislation that have been deployed to contain these risks.  Why was FEMA such an organizational failure?  What’s wrong with the Department of Homeland Security?  Why are chronic issues of system safety in nuclear power plants and chemical plants not adequately addressed by the corresponding regulatory agencies?  Perrow goes through these examples in great detail and demonstrates the very ordinary social mechanisms through which organizations lose effectiveness.  The book serves as a case-study review of organizational failures.

Perrow’s central point is stark: the American political system lacks the strength to take the long-term steps it needs to in order to mitigate the worst effects of natural (or intentional) disasters that are inevitable in our future.  We need consistent investment for long-term benefits; we need effective regulation of powerful actors; and we need long-term policies that mitigate future disasters.  But so far we have failed in each of these areas.  Private interests are too strong, an ideology of free choice and virtually unrestrained use of property leads to dangerous residential and business development, and Federal and state agencies lack the political will to enact the effective regulations that would be necessary to raise the safety threshold in dangerous industries and developments. And, of course, the determined attack on “government regulations” that has been underway from the right since the Reagan years just further worsens the ability of agencies to regulate these powerful businesses — the nuclear power industry, the chemical industry, the oil and gas industry, …

One might think that the risks that Perrow describes are fairly universal across modern societies.  But Perrow notes that these problems seem more difficult and fundamental in the United States than in Europe.  The Netherlands has centuries of experience in investing in and regulating developments having to do with the control of water; European countries have managed to cooperate on the management of rivers and flood plains; and most have much stronger regulatory regimes for the high risk technologies and infrastructure sectors.

The book is scary, and we need to pay attention to the social and natural risks that Perrow documents so vividly.  And we need collectively to take steps to realistically address these risks.  We need to improve the organizations we create, both public and private, aimed at mitigating large risks.  And we need to substantially improve upon the reach and effectiveness of the regulatory systems that govern these activities.  But Perrow insists that improving organizations and leadership, and creating better regulations, can only take us so far.  So we also need to reduce the scope of damage that will occur when disaster strikes.  We need to design our social system for “soft landings” when disasters occur.  Fundamentally, his advice is to decentralize dangerous infrastructure and to be much more cautious about development in high-risk zones.

Given the limited success we can expect from organizational, executive, and regulatory reform, we should attend to reducing the damage that organizations can do by reducing their size.  Smaller organizations have a smaller potential for harm, just as smaller concentrations of populations in areas vulnerable to natural, industrial, and terrorist disasters present smaller targets. (chapter 9)

If owners assume more responsibility for decisions about design and location — for example, by being required to purchase realistically priced flood or earthquake insurance — then there would be less new construction in hurricane alleyways or high-risk earthquake areas.  Rather than integrated mega-organizations and corporations providing goods and services, Perrow argues for the effectiveness of networks of small firms.  And he argues that regulations and law can be designed that give the right incentives to developers and home buyers about where to locate their businesses and homes, reflecting the true costs associated with risky locations. Realistically priced mandatory flood insurance would significantly alter the population density in hurricane alleys.  And our policies and regulations should make a systematic effort to disperse dangerous concentrations of industrial and nuclear materials wherever possible.

 

System safety engineering and the Deepwater Horizon

The Deepwater Horizon oil rig explosion, fire, and uncontrolled release of oil into the Gulf is a disaster of unprecedented magnitude.  This disaster in the Gulf of Mexico appears to be more serious in objective terms than the Challenger space shuttle disaster in 1986 — in terms both of immediate loss of life and in terms of overall harm created. And sadly, it appears likely that the accident will reveal equally severe failures of management of enormously hazardous processes, defects in the associated safety engineering analysis, and inadequacies of the regulatory environment within which the activity took place.  The Challenger disaster fundamentally changed the ways that we thought about safety in the aerospace field.  It is likely that this disaster too will force radical new thinking and new procedures concerning how to deal with the inherently dangerous processes associated with deep-ocean drilling.

Nancy Leveson is an important expert in the area of systems safety engineering, and her book, Safeware: System Safety and Computers, is a genuinely important contribution.  Leveson led the investigation of the role that software design might have played in the Challenger disaster (link).  Here is a short, readable white paper of hers on system safety engineering (link) that is highly relevant to the discussions that will need to occur about deep-ocean drilling.  The paper does a great job of laying out how safety has been analyzed in several high-hazard industries, and presents a set of basic principles for systems safety design.  She discusses aviation, the nuclear industry, military aerospace, and the chemical industry; and she points out some important differences across industries when it comes to safety engineering.  Here is an instructive description of the safety situation in military aerospace in the 1950s and 1960s:

Within 18 months after the fleet of 71 Atlas F missiles became operational, four blew up in their silos during operational testing. The missiles also had an extremely low launch success rate.  An Air Force manual describes several of these accidents: 

     An ICBM silo was destroyed because the counterweights, used to balance the silo elevator on the way up and down in the silo, were designed with consideration only to raising a fueled missile to the surface for firing. There was no consideration that, when you were not firing in anger, you had to bring the fueled missile back down to defuel. 

     The first operation with a fueled missile was nearly successful. The drive mechanism held it for all but the last five feet when gravity took over and the missile dropped back. Very suddenly, the 40-foot diameter silo was altered to about 100-foot diameter. 

     During operational tests on another silo, the decision was made to continue a test against the safety engineer’s advice when all indications were that, because of high oxygen concentrations in the silo, a catastrophe was imminent. The resulting fire destroyed a missile and caused extensive silo damage. In another accident, five people were killed when a single-point failure in a hydraulic system caused a 120-ton door to fall. 

     Launch failures were caused by reversed gyros, reversed electrical plugs, bypass of procedural steps, and by management decisions to continue, in spite of contrary indications, because of schedule pressures. (from the Air Force System Safety Handbook for Acquisition Managers, Air Force Space Division, January 1984)

Leveson’s illustrations from the history of these industries are fascinating.  But even more valuable are the principles of safety engineering that she recapitulates.  These principles seem to have many implications for deep-ocean drilling and associated technologies and systems.  Here is her definition of systems safety:

System safety uses systems theory and systems engineering approaches to prevent foreseeable accidents and to minimize the result of unforeseen ones.  Losses in general, not just human death or injury, are considered. Such losses may include destruction of property, loss of mission, and environmental harm. The primary concern of system safety is the management of hazards: their identification, evaluation, elimination, and control through analysis, design and management procedures.

Here are several fundamental principles of designing safe systems that she discusses:
  • System safety emphasizes building in safety, not adding it on to a completed design.
  • System safety deals with systems as a whole rather than with subsystems or components.
  • System safety takes a larger view of hazards than just failures.
  • System safety emphasizes analysis rather than past experience and standards.
  • System safety emphasizes qualitative rather than quantitative approaches.
  • Recognition of tradeoffs and conflicts.
  • System safety is more than just system engineering.

And here is an important summary observation about the complexity of safe systems:

Safety is an emergent property that arises at the system level when components are operating together. The events leading to an accident may be a complex combination of equipment failure, faulty maintenance, instrumentation and control problems, human actions, and design errors. Reliability analysis considers only the possibility of accidents related to failures; it does not investigate potential damage that could result from successful operation of the individual components.

How do these principles apply to the engineering problem of deep-ocean drilling?  Perhaps the most important implications are these: a safe system needs to be based on careful and comprehensive analysis of the hazards that are inherently involved in the process; it needs to be designed with an eye to handling those hazards safely; and it can’t be done in a piecemeal, “fly-test-fly” fashion.

It would appear that deep-ocean drilling is characterized by too little analysis and too much confidence in the ability of engineers to “correct” inadvertent outcomes (“fly-fix-fly”).  The accident that occurred in the Gulf last month can be analyzed into several parts. First is the explosion and fire that destroyed the drilling rig and led to the tragic loss of life of 11 rig workers. And the second is the uncalculated harms caused by the uncontrolled venting of perhaps a hundred thousand barrels of crude oil to date into the Gulf of Mexico, now threatening the coasts and ecologies of several states.  Shockingly, there is now no high-reliability method for capping the well at a depth of over 5,000 feet; so the harm can continue to worsen for a very extended period of time.

The safety systems on the platform itself will need to be examined in detail. But the bottom line will probably look something like this: the platform is a complex system vulnerable to explosion and fire, and there was always a calculable (though presumably small) probability of catastrophic fire and loss of the ship. This is pretty analogous to the problem of safety in aircraft and other complex electro-mechanical systems. The loss of life in the incident is terrible but confined.  Planes crash and ships sink.

What elevates this accident to a globally important catastrophe is what happened next: destruction of the pipeline leading from the wellhead 5,000 feet below sea level to containers on the surface; and the failure of the shutoff valve system on the ocean floor. These two failures have resulted in unconstrained release of a massive and uncontrollable flow of crude oil into the Gulf and the likelihood of environmental harms that are likely to be greater than the Exxon Valdez.

Oil wells fail on the surface, and they are difficult to control. But there is a well-developed technology that teams of oil fire specialists like Red Adair employ to cap the flow and end the damage. We don’t have anything like this for wells drilled under water at the depth of this incident; this accident is less accessible than objects in space for corrective intervention. So surface well failures conform to a sort of epsilon-delta relationship: an epsilon accident leads to a limited delta harm. This deep-ocean well failure in the Gulf is catastrophically different: the relatively small incident on the surface is resulting in an unbounded and spiraling harm.

So was this a foreseeable hazard? Of course it was. There was always a finite probability of total loss of the platform, leading to destruction of the pipeline. There was also a finite probability of failure of the massive sea-floor emergency shutoff valve. And, critically, it was certainly known that there is no high-reliability fix in the event of failure of the shutoff valve. The effort to use the dome currently being tried by BP is untested and unproven at this great depth. The alternative of drilling a second well to relieve pressure may work; but it will take weeks or months. So essentially, when we reach the end of this failure pathway, we arrive at this conclusion: catastrophic, unbounded failure. If you reach this point in the fault tree, there is almost nothing to be done. And this is a totally irrational outcome to tolerate; how could any engineer or regulatory agency have accepted the circumstances of this activity, given that one possible failure pathway would lead predictably to unbounded harms?

There is one line of thought that might have led to the conclusion that deep ocean drilling is acceptably safe: engineers and policy makers might have optimistically overestimated the reliability of the critical components. If we estimate that the probability of failure of the platform is 1/1000, failure of the pipeline is 1/100, and failure of the emergency shutoff valve is 1/10,000 — then one might say that the probability of the nightmare scenario is vanishingly small: one in a billion. Perhaps one might reason that we can disregard scenarios with this level of likelihood. Reasoning very much like this was involved in the original safety designs of the shuttle (Safeware: System Safety and Computers). But several things are now clear: this disaster was not virtually impossible. In fact, it actually occurred. And second, it seems likely enough that the estimates of component failure are badly understated.

What does this imply about deep ocean drilling? It seems inescapable that the current state of technology does not permit us to take the risk of this kind of total systems failure. Until there is a reliable and reasonably quick technology for capping a deep-ocean well, the small probability of this kind of failure makes the use of the technology entirely unjustifiable. It makes no sense at all to play Russian roulette when the cost of failure is massive and unconstrained ecological damage.

There is another aspect of this disaster that needs to be called out, and that is the issue of regulation. Just as the nuclear industry requires close, rigorous regulation and inspection, so deep-ocean drilling must be rigorously regulated. The stakes are too high to allow the oil industry to regulate itself. And unfortunately there are clear indications of weak regulation in this industry (link).

(Here are links to a couple of earlier posts on safety and technology failure (link, link).)

Patient safety — Canada and France


Patient safety is a key issue in managing and assessing a regional or national health system. There are very sizable variations in patient safety statistics across hospitals, with significantly higher rates of infection and mortality in some institutions than others. Why is this? And what can be done in order to improve the safety performance of low-safety institutions, and to improve the overall safety performance of the hospital environment nationally?

Previous posts have made the point that safety is the net effect of a complex system within a hospital or chemical plant, including institutions, rules, practices, training, supervision, and day-to-day behavior by staff and supervisors (post, post). And experts on hospital safety agree that improvements in safety require careful analysis of patient processes in order to redesign processes so as to make infections, falls, improper medications, and unnecessary mortality less likely. Institutional design and workplace culture have to change if safety performance is to improve consistently and sustainably. (Here is a posting providing a bit more discussion of the institutions of a hospital; post.)

But here is an important question: what are the features of the social and legal environment that will make it most likely that hospital administrators will commit themselves to a thorough-going culture and management of safety? What incentives or constraints need to exist to offset the impulses of cost-cutting and status quo management that threaten to undermine patient safety? What will drive the institutional change in a health system that improving patient safety requires?

Several measures seem clear. One is state regulation of hospitals. This exists in every state; but the effectiveness of regulatory regimes varies widely across context. So understanding the dynamics of regulation and enforcement is a crucial step to improving hospital quality and patient safety. The oversight of rigorous hospital accreditation agencies is another important factor for improvement. For example, the Joint Commission accredits thousands of hospitals in the United States (web page) through dozens of accreditation and certification programs. Patient safety is the highest priority underlying Joint Commission standards of accreditation. So regulation and the formulation of standards are part of the answer. But a particularly important policy tool for improving safety performance is the mandatory collection and publication of safety statistics, so that potential patients can decide between hospitals on the basis of their safety performance. Publicity and transparency are crucial parts of good management behavior; and secrecy is a refuge of poor performance in areas of public concern such as safety, corruption, or rule-setting. (See an earlier post on the relationship between publicity and corruption.)

But here we have a little bit of a conundrum: achieving mandatory publication of safety statistics is politically difficult, because hospitals have a business interest in keeping these data private. So there was a lot of resistance to mandatory reporting of basic patient safety data in the US over the past twenty years. Fortunately, the public interest in having these data readily available has largely prevailed, and hospitals are now required to publish a broader and broader range of data on patient safety, including hospital-induced infection rates, ventilator-induced pneumonias, patient falls, and mortality rates. Here is a useful tool from USA Today that lets the public and the patient gather information about his/her hospital options and how these compare with other hospitals regionally and nationally. This is an effective accountability mechanism that inevitably drives hospitals towards better performance.

Canada has been very active in this area. Here is a website published by the Ontario Ministry of Health and Long-Term Care. The province requires hospitals to report a number of factors that are good indicators of patient safety: several kinds of hospital-born infections; central-line primary bloodstream infection and ventilator-associated pneumonia; surgical-site infection prevention activity; and hospital-standardized mortality ratio. The user can explore the site and find that there are in fact wide variations across hospitals in the province. This is likely to change patient choice; but it also serves as an instant guide for regulatory agencies and local hospital administrators as they attempt to focus attention on poor management practices and institutional arrangements. (It would be helpful for the purpose of comparison if the data could be easily downloaded into a spreadsheet.)

On first principles, it seems likely that any country that has a hospital system in which the safety performance of each hospital is kept secret will also show a wide distribution of patient safety outcomes across institutions, and will have an overall safety record that is much lower than it could be. This is because secrecy gives hospital administrators the ability to conceal the risks their institutions impose on patients through bad practices. So publicity and regular publication of patient safety information seems to be a necessary precondition to maintaining a high-safety hospital system.

But here is the crucial point: many countries continue to permit secrecy when it comes to hospital safety. In particular, this seems to be true in France. It seems that the French medical and hospital system continues to display a very high degree of secrecy and opacity when it comes to patient safety. In fact, anecdotal information about French hospitals suggests a wide range of levels of hospital-born infections in different hospitals. Hospital-born infections (infections nosocomiales) are an important and rising cause of patient illness and morbidity. And there are well-known practices and technologies that substantially reduce the incidence of these infections. But the implementation of these practices requires strong commitment and dedication at the unit level; and this degree of commitment is unlikely to occur in an environment of secrecy.

In fact, I have not been able to discover any of the tools that are now available for measuring patient safety in hospitals in North America in application to hospitals in France. But without this regular reporting, there is no mechanism through which institutions with bad safety performance can be “ratcheted” up into better practices and better safety outcomes. The impression that is given in the French medical system is that the doctors and the medical authorities are sacrosanct; patients are not expected to question their judgment, and the state appears not to require institutions to report and publish fundamental safety information. Patients have very little power and the media so far seem to have paid little attention to the issues of patient safety in French hospitals. This 2007 article in LePoint seems to be a first for France in that it provides quantitative rankings of a large number of hospitals in their treatment of a number of diseases. But it does not provide the kinds of safety information — infections, falls, pneumonias — that are core measures of patient safety.

There is a French state agency, OFFICE NATIONAL D’INDEMNISATION DES ACCIDENTS MÉDICAUX (ONIAM), that provides compensation to patients who can demonstrate that their injuries are the result of hospital-induced causes, including especially hospital-associated infections. But it appears that this agency is restricted to after-the-fact recognition of hospital errors rather than pro-active programs designed to reduce hospital errors. And here is a French government web site devoted to the issue of hospital infections. It announces a multi-pronged strategy for controlling the problem of infections nosocomiales, including the establishment of a national program of surveillance of the rates of these infections. So far, however, I have not been able to locate web resources that would provide hospital-level data about infection rates.

So I am offering a hypothesis that I would be very happy to find to be refuted: that the French medical establishment continues to be bureaucratically administered with very little public exposure of actual performance when it comes to patient safety. And without this system of publicity, it seems very likely that there are wide and tragic variations across French hospitals with regard to patient safety.

Are there French medical sociologists and public health researchers who are working on the issue of patient safety in French hospitals? Can good contemporary French sociologists like Céline Béraud, Baptiste Coulmont, and Philippe Masson offer some guidance on this topic (post)? If readers are aware of databases and patient safety research programs in France that are relevant to these topics, I would be very happy to hear about them.

Update: Baptiste Coulmont (blog) passes on this link to Réseau d’alerte d’investigations et de surveillance des infections nosocomia (RAISIN) within the Institut de veille sanitaire. The site provides research reports and regional assessments of nosocomia incidence. It does not appear to provide data at the level of the specific hospitals and medical centers. Baptiste refers also to work by Jean Peneff, a French medical sociologist and author of La France malade de ses médecins. Here is a link to a subsequent research report by Peneff. Thanks, Baptiste.

Institutions, procedures, norms


One of the noteworthy aspects of the framing offered by Victor Nee and Mary Brinton of the assumptions of the new institutionalism is the very close connection they postulate between institutions and norms. (See the prior posting on this subject). So what is the connection between institutions and norms?

The idea that an institution is nothing more than a collections of norms, formal and informal, seems incomplete on its face. Institutions also depend on rules, procedures, protocols, sanctions, and habits and practices. These other social behavioral factors perhaps intersect in various ways with the workings of social norms, but they are not reducible to a set of norms. And this is to say that institutions are not reducible to a collection of norms.

Consider for example the institutions that embody the patient safety regime in a hospital. What are the constituents of the institutions through which hospitals provide for patient safety? Certainly there are norms, both formal and informal, that are deliberately inculcated and reinforced and that influence the behavior of nurses, pharmacists, technicians, and doctors. But there are also procedures — checklists in operating rooms; training programs — rehearsals of complex crisis activities; routinized behaviors — “always confirm the patient’s birthday before initiating a procedure”; and rules — “physicians must disclose financial relationships with suppliers”. So the institutions defining the management of patient safety are a heterogeneous mix of social factors and processes.

A key feature of an institution, then, is the set of procedures and protocols that it embodies. In fact, we might consider a short-hand way of specifying an institution in terms of the set of procedures it specifies for behavior in stereotyped circumstances of crisis, conflict, cooperation, and mundane interactions with stakeholders. Organizations have usually created specific ways of handling typical situations: handling an intoxicated customer in a restaurant, making sure that no “wrong site” surgeries occur in an operating room, handling the flow of emergency supplies into a region when a large disaster occurs. The idea here is that the performance of the organization, and the individuals within it, will be more effective at achieving the desired goals of the organization if plans and procedures have been developed to coordinate actions in the most effective way possible. This is the purpose of an airline pilot’s checklist before takeoff; it forces the pilot to go through a complete procedure that has been developed for the purpose of avoiding mistakes. Spontaneous, improvised action is sometimes unavoidable; but organizations have learned that they are more effective when they thoughtfully develop procedures for handling their high-risk activities.

This is the point at which the categories of management oversight and staff training come into play. It is one thing to have designed an effective set of procedures for handling a given complex task; but this achievement is only genuinely effective if agents within the organization in fact follow the procedures and protocols. Training is the umbrella activity that describes the processes through which the organization attempts to achieve a high level of shared knowledge about the organization’s procedures. And management oversight is the umbrella activity that describes the processes of supervision and motivation through which the organization attempts to ensure that its agents follow the procedures and protocols.

In fact, one of the central findings in the area of safety research is that the specific content of the procedures of an organization that engages in high-risk activities is crucially important to the overall safety performance of the organization. Apparently small differences in procedure can have an important effect on safety. To take a fairly trivial example, the construction of a stylized vocabulary and syntax for air traffic controllers and pilots increases safety by reducing the possibility of ambiguous communications; so two air traffic systems that were identical except with respect to the issue of standardized communications protocols will be expected to have different safety records. Another key finding falls more on the “norms and culture” side of the equation; it is frequently observed that high-risk organizations need to embody a culture of safety that permeates the whole organization.

We might postulate that norms come into the story when we get to the point of asking what motivates a person to conform to the prescribed procedure or rule — though there are several other social-behavioral mechanisms that work at this level as well (trained habits, well enforced sanctions, for example). But more fundamentally, the explanatory value of the micro-institutional analysis may come in at the level of the details of the procedures and rules in contrast to other possible embodiments — rather than at the level of the question, what makes these procedures effective in most participants’ conduct?

We might say, then, that an institution can be fully specified when we provide information about:

  • the procedures, policies, and protocols it imposes on its participants
  • the training and educational processes the institution relies on for instilling appropriate knowledge about its procedures and rules in its participants
  • the management, supervision, enforcement, and incentive mechanisms it embodies to assure a sufficient level of compliance among its participants
  • the norms of behavior that typical participants have internalized with respect to action within the institution

And the distinctive performance characteristics of the institution may derive from the specific nature of the arrangements that are described at each of these levels.

System safety is a good example to consider from the point of view of the new institutionalism. Two airlines may have significantly different safety records. And the explanation may be at any of these levels: they may have differences in formalized procedures, they may have differences in training regimes, they may have differences in management oversight effectiveness, or they may have different normative cultures at the rank-and-file level. It is a central insight of the new institutionalism that the first level may be the most important for explaining the overall safety records of the two companies, even though mechanisms may fail at any of the other levels as well. Procedural differences generally lead to significant and measurable differences in the quality of organizational results. (Nancy Leveson’s Safeware: System Safety and Computers provides a great discussion of many of these issues.)

Safety as a social effect


Some organizations pose large safety issues for the public because of the technologies and processes they encompass. Industrial factories, chemical and nuclear plants, farms, mines, and aviation all represent sectors where safety issues are critically important because of the inherent risks of the processes they involve. However, “safety” is not primarily a technological characteristic; instead, it is an aggregate outcome that depends as much on the social organization and management of the processes involved as it does on the technologies they employ. (See an earlier posting on technology failure.)

We can define safety by relating it to the concept of “harmful incident”. A harmful incident is an occurrence that leads to injury or death of one or more persons. Safety is a relative concept, in that it involves analysis and comparison of the frequencies of harmful incidents relative to some measure of the volume of activity. If the claim is made that interstate highways are safer than county roads, this amounts to the assertion that there are fewer accidents per vehicle-mile on the former than the latter. If it is held that commercial aviation is safer than automobile transportation, this amounts to the claim that there are fewer harms per passenger-mile in air travel than auto travel. And if it is observed that the computer assembly industry is safer than the mining industry, this can be understood to mean that there are fewer harms per person-day in the one sector than the other. (We might give a parallel analysis of the concept of a healthy workplace.)

This analysis highlights two dimensions of industrial safety: the inherent capacity for creating harms associated with the technology and processes in use (heavy machinery, blasting, and uncertain tunnel stability in mining, in contrast to a computer and a red pencil on the editorial offices of a newspaper), and the processes and systems that are in place to guard against harm. The first set of factors is roughly “technological,” while the second set is social and organizational.

Variations in safety records across industries and across sites within a given industry provide an excellent tool for analyzing the effects of various institutional arrangements. It is often possible to pinpoint a crucial difference in organization — supervision, training, internal procedures, inspection protocols, etc. — that can account for a high accident rate in one factory and a low rate in an otherwise similar factory in a different state.

One of the most important findings of safety engineering is that organization and culture play critical roles in enhancing the safety characteristics of a given activity — that is to say, safety is strongly influenced by social factors that define and organize the behaviors of workers, users, or managers. (See Charles Perrow, Normal Accidents: Living with High-Risk Technologies and Nancy Leveson, Safeware: System Safety and Computers, for a couple of excellent treatments of the sociological dimensions of safety.)

This isn’t to say that only social factors can influence safety performance within an activity or industry. In fact, a central effort by safety engineers involves modifying the technology or process so as to remove the source of harm completely — what we might call “passive” safety. So, for example, if it is possible to design a nuclear reactor in such a way that a loss of coolant leads automatically to shutdown of the fission reaction, then we have designed out of the system the possibility of catastrophic meltdown and escape of radioactive material. This might be called “design for soft landings”.

However, most safety experts agree that the social and organizational characteristics of the dangerous activity are the most common causes of bad safety performance. Poor supervision and inspection of maintenance operations leads to mechanical failures, potentially harming workers or the public. A workplace culture that discourages disclosure of unsafe conditions makes the likelihood of accidental harm much greater. A communications system that permits ambiguous or unclear messages to occur can lead to air crashes and wrong-site surgeries.

This brings us at last to the point of this posting: the observation that safety data in a variety of industries and locations permit us to probe organizational features and their effects with quite a bit of precision. This is a place where institutions and organizations make a big difference in observable outcomes; safety is a consequence of a specific combination of technology, behaviors, and organizational practices. This is a good opportunity for combining comparative and statistical research methods in support of causal inquiry, and it invites us to probe for the social mechanisms that underlie the patterns of high or low safety performance that we discover.

Consider one example. Suppose we are interested in discovering some of the determinants of safety records in deep mining operations. We might approach the question from several points of view.

  • We might select five mines with “best in class” safety records and compare them in detail with five “worst in class” mines. Are there organizational or techology features that distinguish the cases?
  • We might do the large-N version of this study: examine a sample of mines from “best in class” and “worst in class” and test whether there are observed features that explain the differences in safety records. (For example, we may find that 75% of the former group but only 10% of the latter group are subject to frequent unannounced safety inspection. This supports the notion that inspections enhance safety.)
  • We might compare national records for mine safety–say, Poland and Britain. We might then attempt to identify the general characteristics that describe mines in the two countries and attempt to explain observed differences in safety records on the basis of these characteristics. Possible candidates might include degree of regulatory authority, capital investment per mine, workers per mine, …
  • We might form a hypothesis about a factor that should be expected to enhance safety — a company-endorsed safety education program, let’s say — and then randomly assign a group of mines to “treated” and “untreated” groups and compare safety records. (This is a quasi-experiment; see an earlier posting for a discussion of this mode of reasoning.) If we find that the treated group differs significantly in average safety performance, this supports the claim that the treatment is causally relevant to the safety outcome.

Investigations along these lines can establish an empirical basis for judging that one or more organizational features A, B, C have consequences for safety performance. In order to be confident in these judgments, however, we need to supplement the empirical analysis with a theory of the mechanisms through which features like A, B, C influence behavior in such a way as to make accidents more or less likely.

Safety, then, seems to be a good area of investigation for researchers within the general framework of the new institutionalism, because the effects of institutional and organizational differences emerge as observable differences in the rates of accidents in comparable industrial settings. (See Mary Brinton and Victor Nee, The New Institutionalism in Sociology, for a collection of essays on this approach.)

%d bloggers like this: