Organizations and dysfunction

A recurring theme in recent months in Understanding Society is organizational dysfunction and the organizational causes of technology failure. Helmut Anheier’s volume When Things Go Wrong: Organizational Failures and Breakdowns is highly relevant to this topic, and it makes for very interesting reading. The volume includes contributions by a number of leading scholars in the sociology of organizations.

And yet the volume seems to miss the mark in some important ways. For one thing, it is unduly focused on the question of “mortality” of firms and other organizations. Bankruptcy and organizational death are frequent synonyms for “failure” here. This frame is evident in the summary the introduction offers of existing approaches in the field: organizational aspects, political aspects, cognitive aspects, and structural aspects. All bring us back to the causes of extinction and bankruptcy in a business organization. Further, the approach highlights the importance of internal conflict within an organization as a source of eventual failure. But it gives no insight into the internal structure and workings of the organization itself, the ways in which behavior and internal structure function to systematically produce certain kinds of outcomes that we can identify as dysfunctional.

Significantly, however, dysfunction does not routinely lead to death of a firm. (Seibel’s contribution in the volume raises this possibility, which Seibel refers to as “successful failures“). This is a familiar observation from political science: what looks dysfunctional from the outside may be perfectly well tuned to a different set of interests (for example, in Robert Bates’s account of pricing boards in Africa in Markets and States in Tropical Africa: The Political Basis of Agricultural Policies). In their introduction to this volume Anheier and Moulton refer to this possibility as a direction for future research: “successful for whom, a failure for whom?” (14).

The volume tends to look at success and failure in terms of profitability and the satisfaction of stakeholders. But we can define dysfunction in a more granular way by linking characteristics of performance to the perceived “purposes and goals” of the organization. A regulatory agency exists in order to effectively project the health and safety of the public. In this kind of case, failure is any outcome in which the agency flagrantly and avoidably fails to prevent a serious harm — release of radioactive material, contamination of food, a building fire resulting from defects that should have been detected by inspection. If it fails to do so as well as it might then it is dysfunctional.

Why do dysfunctions persist in organizations? It is possible to identify several possible causes. The first is that a dysfunction from one point of view may well be a desirable feature from another point of view. The lack of an authoritative safety officer in a chemical plant may be thought to be dysfunctional if we are thinking about the safety of workers and the public as a primary goal of the plant (link). But if profitability and cost-savings are the primary goals from the point of view of the stakeholders, then the cost-benefit analysis may favor the lack of the safety officer.

Second, there may be internal failures within an organization that are beyond the reach of any executive or manager who might want to correct them. The complexity and loose-coupling of large organizations militate against house cleaning on a large scale.

Third, there may be powerful factions within an organization for whom the “dysfunctional” feature is an important component of their own set of purposes and goals. Fligstein and McAdam argue for this kind of disaggregation with their theory of strategic action fields (link). By disaggregating purposes and goals to the various actors who figure in the life cycle of the organization – founders, stakeholders, executives, managers, experts, frontline workers, labor organizers – it is possible to see the organization as a whole as simply the aggregation of the multiple actions and purposes of the actors within and adjacent to the organization. This aggregation does not imply that the organization is carefully adjusted to serve the public good or to maximize efficiency or to protect the health and safety of the public. Rather, it suggests that the resultant organizational structure serves the interests of the various actors to the fullest extent each actor is able to manage.

Consider the account offered by Thomas Misa of the decline of the steel industry in the United States in the first part of the twentieth century in A Nation of Steel: The Making of Modern America, 1865-1925. Misa’s account seems to point to a massive dysfunction in the steel corporations of the inter-war period, a deliberate and sustained failure to invest in research on new steel technologies in metallurgy and production. Misa argues that the great steel corporations — US Steel in particular — failed to remain competitive in their industry in the early years of the twentieth century because management persistently pursued short-term profits and financial advantage for the company through domination of the market at the expense of research and development. It relied on market domination instead of research and development for its source of revenue and profits.

In short, U.S. Steel was big but not illegal. Its price leadership resulted from its complete dominance in the core markets for steel…. Indeed, many steelmakers had grown comfortable with U.S. Steel’s overriding policy of price and technical stability, which permitted them to create or develop markets where the combine chose not to compete, and they testified to the court in favor of the combine. The real price of stability … was the stifling of technological innovation. (255)

The result was that the modernized steel industries in Europe leap-frogged the previous US advantage and eventually led to unviable production technology in the United States.

At the periphery of the newest and most promising alloy steels, dismissive of continuous-sheet rolling, actively hostile to new structural shapes, a price leader but not a technical leader: this was U.S. Steel. What was the company doing with technological innovation? (257)

Misa is interested in arriving at a better way of understanding the imperatives leading to technical change — better than neoclassical economics and labor history. His solution highlights the changing relationships that developed between industrial consumers and producers in the steel industry.

We now possess a series of powerful insights into the dynamics of technology and social change. Together, these insights offer the realistic promise of being better able, if we choose, to modulate the complex process of technical change. We can now locate the range of sites for technical decision making, including private companies, trade organizations, engineering societies, and government agencies. We can suggest a typology of user-producer interactions, including centralized, multicentered, decentralized, and direct-consumer interactions, that will enable certain kinds of actions while constraining others. We can even suggest a range of activities that are likely to effect technical change, including standards setting, building and zoning codes, and government procurement. Furthermore, we can also suggest a range of strategies by which citizens supposedly on the “outside” may be able to influence decisions supposedly made on the “inside” about technical change, including credibility pressure, forced technology choice, and regulatory issues. (277-278)

In fact Misa places the dynamic of relationship between producer and large consumer at the center of the imperatives towards technological innovation:

In retrospect, what was wrong with U.S. Steel was not its size or even its market power but its policy of isolating itself from the new demands from users that might have spurred technical change. The resulting technological torpidity that doomed the industry was not primarily a matter of industrial concentration, outrageous behavior on the part of white- and blue-collar employees, or even dysfunctional relations among management, labor, and government. What went wrong was the industry’s relations with its consumers. (278)

This relative “callous treatment of consumers” was profoundly harmful when international competition gave large industrial users of steel a choice. When US Steel had market dominance, large industrial users had little choice; but this situation changed after WWII. “This favorable balance of trade eroded during the 1950s as German and Japanese steelmakers rebuilt their bombed-out plants with a new production technology, the basic oxygen furnace (BOF), which American steelmakers had dismissed as unproven and unworkable” (279). Misa quotes a president of a small steel producer: “The Big Steel companies tend to resist new technologies as long as they can … They only accept a new technology when they need it to survive” (280).

*****

Here is an interesting table from Misa’s book that sheds light on some of the economic and political history in the United States since the post-war period, leading right up to the populist politics of 2016 in the Midwest. This chart provides mute testimony to the decline of the rustbelt industrial cities. Michigan, Illinois, Ohio, Pennsylvania, and western New York account for 83% of the steel production on this table. When American producers lost the competitive battle for steel production in the 1980s, the Rustbelt suffered disproportionately, and eventually blue collar workers lost their places in the affluent economy.

Is corruption a social thing?

When we discuss the ontology of various aspects of the social world, we are often thinking of such things as institutions, organizations, social networks, value systems, and the like. These examples pick out features of the world that are relatively stable and functional. Where does an imperfection or dysfunction of social life like corruption fit into our social ontology?

We might say that “corruption” is a descriptive category that is aimed at capturing a particular range of behavior, like stealing, gossiping, or asceticism. This makes corruption a kind of individual behavior, or even a characteristic of some individuals. “Mayor X is corrupt.”

This initial effort does not seem satisfactory, however. The idea of corruption is tied to institutions, roles, and rules in a very direct way, and therefore we cannot really present the concept accurately without articulating these institutional features of the concept of corruption. Corruption might be paraphrased in these terms:

  • Individual X plays a role Y in institution Z; role Y prescribes honest and impersonal performance of duties; individual X accepts private benefits to take actions that are contrary to the prescriptions of Y. In virtue of these facts X behaves corruptly.

Corruption, then, involves actions taken by officials that deviate from the rules governing their role, in order to receive private benefits from the subjects of those actions. Absent the rules and role, corruption cannot exist. So corruption is a feature that presupposes certain social facts about institutions. (Perhaps there is a link to Searle’s social ontology here; link.)

We might consider that corruption is analogous to friction in physical systems. Friction is a factor that affects the performance of virtually all mechanical systems, but that is a second-order factor within classical mechanics. And it is possible to give mechanical explanations of the ubiquity of friction, in terms of the geometry of adjoining physical surfaces, the strength of inter-molecular attractions, and the like. Analogously, we can offer theories of the frequency with which corruption occurs in organizations, public and private, in terms of the interests and decision-making frameworks of variously situated actors (e.g. real estate developers, land value assessors, tax assessors, zoning authorities …). Developers have a business interest in favorable rulings from assessors and zoning authorities; some officials have an interest in accepting gifts and favors to increase personal income and wealth; each makes an estimate of the likelihood of detection and punishment; and a certain rate of corrupt exchanges is the result.

This line of thought once again makes corruption a feature of the actors and their calculations. But it is important to note that organizations themselves have features that make corrupt exchanges either more likely or less likely (link, link). Some organizations are corruption-resistant in ways in which others are corruption-neutral or corruption-enhancing. These features include internal accounting and auditing procedures; whistle-blowing practices; executive and supervisor vigilance; and other organizational features. Further, governments and systems of law can make arrangements that discourage corruption; the incidence of corruption is influenced by public policy. For example, legal requirements on transparency in financial practices by firms, investment in investigatory resources in oversight agencies, and weighty penalties to companies found guilty of corrupt practices can affect the incidence of corruption. (Robert Klitgaard’s treatment of corruption is relevant here; he provides careful analysis of some of the institutional and governmental measures that can be taken that discourage corrupt practices; link, link. And there are cross-country indices of corruption (e.g. Transparency International) that demonstrate the causal effectiveness of anti-corruption measures at the state level. Finland, Norway, and Switzerland rank well on the Transparency International index.)

So — is corruption a thing? Does corruption need to be included in a social ontology? Does a realist ontology of government and business organization have a place for corruption? Yes, yes, and yes. Corruption is a real property of individual actors’ behavior, observable in social life. It is a consequence of strategic rationality by various actors. Corruption is a social practice with its own supporting or inhibiting culture. Some organizations effectively espouse a core set of values of honesty and correct performance that make corruption less frequent. And corruption is a feature of the design of an organization or bureau, analogous to “mean-time-between-failure” as a feature of a mechanical design. Organizations can adopt institutional protections and cultural commitments that minimize corrupt behavior, while other organizations fail to do so and thereby encourage corrupt behavior. So “corruption-vulnerability” is a real feature of organizations and corruption has a social reality.

System effects

Quite a few posts here have focused on the question of emergence in social ontology, the idea that there are causal processes and powers at work at the level of social entities that do not correspond to similar properties at the individual level. Here I want to raise a related question, the notion that an important aspect of the workings of the social world derives from “system effects” of the organizations and institutions through which social life transpires. A system accident or effect is one that derives importantly from the organization and configuration of the system itself, rather than the specific properties of the units.

What are some examples of system effects? Consider these phenomena:

  • Flash crashes in stock markets as a result of automated trading
  • Under-reporting of land values in agrarian fiscal regimes 
  • Grade inflation in elite universities 
  • Increase in product defect frequency following a reduction in inspections 
  • Rising frequency of industrial errors at the end of work shifts 

Here is how Nancy Leveson describes systems causation in Engineering a Safer World: Systems Thinking Applied to Safety:

Safety approaches based on systems theory consider accidents as arising from the interactions among system components and usually do not specify single causal variables or factors. Whereas industrial (occupational) safety models and event chain models focus on unsafe acts or conditions, classic system safety models instead look at what went wrong with the system’s operation or organization to allow the accident to take place. (KL 977)

Charles Perrow offers a taxonomy of systems as a hierarchy of composition in Normal Accidents: Living with High-Risk Technologies:

Consider a nuclear plant as the system. A part will be the first level — say a valve. This is the smallest component of the system that is likely to be identified in analyzing an accident. A functionally related collection of parts, as, for example, those that make up the steam generator, will be called a unit, the second level. An array of units, such as the steam generator and the water return system that includes the condensate polishers and associated motors, pumps, and piping, will make up a subsystem, in this case the secondary cooling system. This is the third level. A nuclear plan has around two dozen subsystems under this rough scheme. They all come together in the fourth level, the nuclear plant or system. Beyond this is the environment. (65)

Large socioeconomic systems like capitalism and collectivized socialism have system effects — chronic patterns of low productivity and corruption in the latter case, a tendency to inequality and immiseration in the former case. In each case the observed effect is the result of embedded features of property and labor in the two systems that result in specific kinds of outcomes. And an important dimension of social analysis is to uncover the ways in which ordinary actors pursuing ordinary goals within the context of the two systems, lead to quite different outcomes at the level of the “mode of production”. And these effects do not depend on there being a distinctive kind of actor in each system; in fact, one could interchange the actors and still find the same macro-level outcomes.

Here is a preliminary effort at a definition for this concept in application to social organizations:

A system effect is an outcome that derives from the embedded characteristics of incentive and opportunity within a social arrangement that lead normal actors to engage in activity leading to the hypothesized aggregate effect.

Once we see what the incentive and opportunity structures are, we can readily see why some fraction of actors modify their behavior in ways that lead to the outcome. In this respect the system is the salient causal factor rather than the specific properties of the actors — change the system properties and you will change the social outcome.

 

When we refer to system effects we often have unintended consequences in mind — unintended both by the individual actors and the architects of the organization or practice. But this is not essential; we can also think of examples of organizational arrangements that were deliberately chosen or designed to bring about the given outcome. In particular, a given system effect may be intended by the designer and unintended by the individual actors. But when the outcomes in question are clearly dysfunctional or “catastrophic”, it is natural to assume that they are unintended. (This, however, is one of the specific areas of insight that comes out of the new institutionalism: the dysfunctional outcome may be favorable for some sets of actors even as they are unfavorable for the workings of the system as a whole.)

 
Another common assumption about system effects is that they are remarkably stable through changes of actors and efforts to reverse the given outcome. In this sense they are thought to be somewhat beyond the control of the individuals who make up the system. The only promising way of undoing the effect is to change the incentives and opportunities that bring it about. But to the extent that a given configuration has emerged along with supporting mechanisms protecting it from deformation, changing the configuration may be frustratingly difficult.

Safety and its converse are often described as system effects. By this is often meant two things. First, there is the important insight that traditional accident analysis favors “unit failure” at the expense of more systemic factors. And second, there is the idea that accidents and failures often result from “tightly linked” features of systems, both social and technical, in which variation in one component of a system can have unexpected consequences for the operation of other components of the system. Charles Perrow describes the topic of loose and tight coupling in social systems in Normal Accidents; 89 ff,)

Philosophy and the study of technology failure

image: Adolf von Menzel, The Iron Rolling Mill (Modern Cyclopes)
 

Readers may have noticed that my current research interests have to do with organizational dysfunction and largescale technology failures. I am interested in probing the ways in which organizational failures and dysfunctions have contributed to large accidents like Bhopal, Fukushima, and the Deepwater Horizon disaster. I’ve had to confront an important question in taking on this research interest: what can philosophy bring to the topic that would not be better handled by engineers, organizational specialists, or public policy experts?

One answer is the diversity of viewpoint that a philosopher can bring to the discussion. It is evident that technology failures invite analysis from all of these specialized experts, and more. But there is room for productive contribution from reflective observers who are not committed to any of these disciplines. Philosophers have a long history of taking on big topics outside the defined canon of “philosophical problems”, and often those engagements have proven fruitful. In this particular instance, philosophy can look at organizations and technology in a way that is more likely to be interdisciplinary, and perhaps can help to see dimensions of the problem that are less apparent from a purely disciplinary perspective.

There is also a rationale based on the terrain of the philosophy of science. Philosophers of biology have usually attempted to learn as much about the science of biology as they can manage, but they lack the level of expertise of a research biologist, and it is rare for a philosopher to make an original contribution to the scientific biological literature. Nonetheless it is clear that philosophers have a great deal to add to scientific research in biology. They can contribute to better reasoning about the implications of various theories, they can probe the assumptions about confirmation and explanation that are in use, and they can contribute to important conceptual disagreements. Biology is in a better state because of the work of philosophers like David Hull and Elliot Sober.

Philosophers have also made valuable contributions to science and technology studies, bringing a viewpoint that incorporates insights from the philosophy of science and a sensitivity to the social groundedness of technology. STS studies have proven to be a fruitful place for interaction between historians, sociologists, and philosophers. Here again, the concrete study of the causes and context of large technology failure may be assisted by a philosophical perspective.

There is also a normative dimension to these questions about technology failure for which philosophy is well prepared. Accidents hurt people, and sometimes the causes of accidents involve culpable behavior by individuals and corporations. Philosophers have a long history of contribution to these kinds of problems of fault, law, and just management of risks and harms.

Finally, it is realistic to say that philosophy has an ability to contribute to social theory. Philosophers can offer imagination and critical attention to the problem of creating new conceptual schemes for understanding the social world. This capacity seems relevant to the problem of describing, analyzing, and explaining largescale failures and disasters.

The situation of organizational studies and accidents is in some ways more hospitable for contributions by a philosopher than other “wicked problems” in the world around us. An accident is complicated and complex but not particularly obscure. The field is unlike quantum mechanics or climate dynamics, which are inherently difficult for non-specialists to understand. The challenge with accidents is to identify a multi-layered analysis of the causes of the accident that permits observers to have a balanced and operative understanding of the event. And this is a situation where the philosopher’s perspective is most useful. We can offer higher-level descriptions of the relative importance of different kinds of causal factors. Perhaps the role here is analogous to messenger RNA, providing a cross-disciplinary kind of communications flow. Or it is analogous to the role of philosophers of history who have offered gentle critique of the cliometrics school for its over-dependence on a purely statistical approach to economic history.

So it seems reasonable enough for a philosopher to attempt to contribute to this set of topics, even if the disciplinary expertise a philosopher brings is more weighted towards conceptual and theoretical discussions than undertaking original empirical research in the domain.

What I expect to be the central finding of this research is the idea that a pervasive and often unrecognized cause of accidents is a systemic organizational defect of some sort, and that it is enormously important to have a better understanding of common forms of these deficiencies. This is a bit analogous to a paradigm shift in the study of accidents. And this view has important policy implications. We can make disasters less frequent by improving the organizations through which technology processes are designed and managed.

System safety

An ongoing thread of posts here is concerned with organizational causes of large technology failures. The driving idea is that failures, accidents, and disasters usually have a dimension of organizational causation behind them. The corporation, research office, shop floor, supervisory system, intra-organizational information flow, and other social elements often play a key role in the occurrence of a gas plant fire, a nuclear power plant malfunction, or a military disaster. There is a tendency to look first and foremost for one or more individuals who made a mistake in order to explain the occurrence of an accident or technology failure; but researchers such as Perrow, Vaughan, Tierney, and Hopkins have demonstrated in detail the importance of broadening the lens to seek out the social and organizational background of an accident.

It seems important to distinguish between system flaws and organizational dysfunction in considering all of the kinds of accidents mentioned here. We might specify system safety along these lines. Any complex process has the potential for malfunction. Good system design means creating a flow of events and processes that make accidents inherently less likely. Part of the task of the designer and engineer is to identify chief sources of harm inherent in the process — release of energy, contamination of food or drugs, unplanned fission in a nuclear plant — and design fail-safe processes so that these events are as unlikely as possible. Further, given the complexity of contemporary technology systems it is critical to attempt to anticipate unintended interactions among subsystems — each of which is functioning correctly but that lead to disaster in unusual but possible interaction scenarios.

In a nuclear processing plant, for example, there is the hazard of radioactive materials being brought into proximity with each other in a way that creates unintended critical mass. Jim Mahaffey’s Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima offers numerous examples of such unintended events, from the careless handling of plutonium scrap in a machining process to the transfer of a fissionable liquid from a vessel of one shape to another. We might try to handle these risks as an organizational problem: more and better training for operatives about the importance of handling nuclear materials according to established protocols, and effective supervision and oversight to ensure that the protocols are observed on a regular basis. But it is also possible to design the material processes within a nuclear plant in a way that makes unintended criticality virtually impossible — for example, by storing radioactive solutions in containers that simply cannot be brought into close proximity with each other.

Nancy Leveson is a national expert on defining and applying principles of system safety. Her book Engineering a Safer World: Systems Thinking Applied to Safety is a thorough treatment of her thinking about this subject. She offers a handful of compelling reasons for believing that safety is a system-level characteristic that requires a systems approach: the fast pace of technological change, reduced ability to learn from experience, the changing nature of accidents, new types of hazards, increasing complexity and coupling, decreasing tolerance for single accidents, difficulty in selecting priorities and making tradeoffs , more complex relationships between humans and automation, and changing regulatory and public view of safety (kl 130 ff.). Particularly important in this list is the comment about complexity and coupling: “The operation of some systems is so complex that it defies the understanding of all but a few experts, and sometimes even they have incomplete information about the system’s potential behavior” (kl 137). 

Given the fact that safety and accidents are products of whole systems, she is critical of the accident methodology generally applied to serious industrial, aerospace, and chemical accidents. This methodology involves tracing the series of events that led to the outcome, and identifying one or more events as the critical cause of the accident. However, she writes:

In general, event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the or industry. An accident model should encourage a broad view of accident mechanisms that expands the investigation beyond the proximate evens.A narrow focus on technological components and pure engineering activities or a similar narrow focus on operator errors may lead to ignoring some of the most important factors in terms of preventing future accidents. (kl 452)

Here is a definition of system safety offered later in ESW in her discussion of the emergence of the concept within the defense and aerospace fields in the 1960s:

System Safety … is a subdiscipline of system engineering. It was created at the same time and for the same reasons. The defense community tried using the standard safety engineering techniques on their complex new systems, but the limitations became clear when interface and component interaction problems went unnoticed until it was too late, resulting in many losses and near misses. When these early aerospace accidents were investigated, the causes of a large percentage of them were traced to deficiencies in design, operations, and management. Clearly, big changes were needed. System engineering along with its subdiscipline, System Safety, were developed to tackle these problems. (kl 1007)

Here Leveson mixes system design and organizational dysfunctions as system-level causes of accidents. But much of her work in this book and her earlier Safeware: System Safety and Computers gives extensive attention to the design faults and component interactions that lead to accidents — what we might call system safety in the narrow or technical sense.

A systems engineering approach to safety starts with the basic assumption that some properties of systems, in this case safety, can only be treated adequately in the context of the social and technical system as a whole. A basic assumption of systems engineering is that optimization of individual components or subsystems will not in general lead to a system optimum; in fact, improvement of a particular subsystem may actually worsen the overall system performance because of complex, nonlinear interactions among the components. (kl 1007) 

Overall, then, it seems clear that Leveson believes that both organizational features and technical system characteristics are part of the systems that created the possibility for accidents like Bhopal, Fukushima, and Three Mile Island. Her own accident model designed to help identify causes of accidents, STAMP (Systems-Theoretic Accident Model and Processes) emphasizes both kinds of system properties.

Using this new causality model … changes the emphasis in system safety from preventing failures to enforcing behavioral safety constraints. Component failure accidents are still included, but or conception of causality is extended to include component interaction accidents. Safety is reformulated as a control problem rather than a reliability problem. (kl 1062)

In this framework, understanding why an accident occurred requires determining why the control was ineffective. Preventing future accidents requires shifting from a focus on preventing failures to the broader goal of designing and implementing controls that will enforce the necessary constraints. (kl 1084)

 Leveson’s brief analysis of the Bhopal disaster in 1984 (kl 384 ff.) emphasizes the organizational dysfunctions that led to the accident — and that were completely ignored by the Indian state’s accident investigation of the accident: out-of-service gauges, alarm deficiencies, inadequate response to prior safety audits, shortage of oxygen masks, failure to inform the police or surrounding community of the accident, and an environment of cost cutting that impaired maintenance and staffing. “When all the factors, including indirect and systemic ones, are considered, it becomes clear that the maintenance worker was, in fact, only a minor and somewhat irrelevant player in the loss. Instead, degradation in the safety margin occurred over time and without any particular single decision to do so but simply as a series of decisions that moved the plant slowly toward a situation where any slight error would lead to a major accident” (kl 447).

What the boss wants to hear …

According to David Halberstam in his outstanding history of the war in Vietnam, The Best and the Brightest, a prime cause of disastrous decision-making by Presidents Kennedy and Johnson was an institutional imperative in the Defense Department to come up with a set of facts that conformed to what the President wanted to hear. Robert McNamara and McGeorge Bundy were among the highest-level miscreants in Halberstam’s account; they were determined to craft an assessment of the situation on the ground in Vietnam that conformed best with their strategic advice to the President.

Ironically, a very similar dynamic led to one of modern China’s greatest disasters, the Great Leap Forward famine in 1959. The Great Helmsman was certain that collective agriculture would be vastly more productive than private agriculture; and following the collectivization of agriculture, party officials in many provinces obliged this assumption by reporting inflated grain statistics throughout 1958 and 1959. The result was a famine that led to at least twenty million excess deaths during a two-year period as the central state shifted resources away from agriculture (Frank DikötterMao’s Great Famine: The History of China’s Most Devastating Catastrophe, 1958-62).

More mundane examples are available as well. When information about possible sexual harassment in a given department is suppressed because “it won’t look good for the organization” and “the boss will be unhappy”, the organization is on a collision course with serious problems. When concerns about product safety or reliability are suppressed within the organization for similar reasons, the results can be equally damaging, to consumers and to the corporation itself. General Motors, Volkswagen, and Michigan State University all seem to have suffered from these deficiencies of organizational behavior. This is a serious cause of organizational mistakes and failures. It is impossible to make wise decisions — individual or collective — without accurate and truthful information from the field. And yet the knowledge of higher-level executives depends upon the truthful and full reporting of subordinates, who sometimes have career incentives that work against honesty.

So how can this unhappy situation be avoided? Part of the answer has to do with the behavior of the leaders themselves. It is important for leaders to explicitly and implicitly invite the truth — whether it is good news or bad news. Subordinates must be encouraged to be forthcoming and truthful; and bearers of bad news must not be subject to retaliation. Boards of directors, both private and public, need to make clear their own expectations on this score as well: that they expect leading executives to invite and welcome truthful reporting, and that they expect individuals throughout the organization to provide truthful reporting. A culture of honesty and transparency is a powerful antidote to the disease of fabrications to please the boss.

Anonymous hotlines and formal protection of whistle-blowers are other institutional arrangements that lead to greater honesty and transparency within an organization. These avenues have the advantage of being largely outside the control of the upper executives, and therefore can serve as a somewhat independent check on dishonest reporting.

A reliable practice of accountability is also a deterrent to dishonest or partial reporting within an organization. The truth eventually comes out — whether about sexual harassment, about hidden defects in a product, or about workplace safety failures. When boards of directors and organizational policies make it clear that there will be negative consequences for dishonest behavior, this gives an ongoing incentive of prudence for individuals to honor their duties of honesty within the organization.

This topic falls within the broader question of how individual behavior throughout an organization has the potential for giving rise to important failures that harm the public and harm the organization itself. 


Empowering the safety officer?

How can industries involving processes that create large risks of harm for individuals or populations be modified so they are more capable of detecting and eliminating the precursors of harmful accidents? How can nuclear accidents, aviation crashes, chemical plant explosions, and medical errors be reduced, given that each of these activities involves large bureaucratic organizations conducting complex operations and with substantial inter-system linkages? How can organizations be reformed to enhance safety and to minimize the likelihood of harmful accidents?

One of the lessons learned from the Challenger space shuttle disaster is the importance of a strongly empowered safety officer in organizations that deal in high-risk activities. This means the creation of a position dedicated to ensuring safe operations that falls outside the normal chain of command. The idea is that the normal decision-making hierarchy of a large organization has a built-in tendency to maintain production schedules and avoid costly delays. In other words, there is a built-in incentive to treat safety issues with lower priority than most people would expect.

If there had been an empowered safety officer in the launch hierarchy for the Challenger launch in 1986, there is a good chance this officer would have listened more carefully to the Morton-Thiokol engineering team’s concerns about low temperature damage to O-rings and would have ordered a halt to the launch sequence until temperatures in Florida raised to the critical value. The Rogers Commission faulted the decision-making process leading to the launch decision in its final report on the accident (The Report of the Presidential Commission on the Space Shuttle Challenger Accident – The Tragedy of Mission 51-L in 1986 – Volume OneVolume TwoVolume Three).

This approach is productive because empowering a safety officer creates a different set of interests in the management of a risky process. The safety officer’s interest is in safety, whereas other decision makers are concerned about revenues and costs, public relations, reputation, and other instrumental goods. So a dedicated safety officer is empowered to raise safety concerns that other officers might be hesitant to raise. Ordinary bureaucratic incentives may lead to underestimating risks or concealing faults; so lowering the accident rate requires giving some individuals the incentive and power to act effectively to reduce risks.

Similar findings have emerged in the study of medical and hospital errors. It has been recognized that high-risk activities are made less risky by empowering all members of the team to call a halt in an activity when they perceive a safety issue. When all members of the surgical team are empowered to halt a procedure when they note an apparent error, serious operating-room errors are reduced. (Here is a report from the American College of Obstetricians and Gynecologists on surgical patient safety; link. And here is a 1999 National Academy report on medical error; link.)

The effectiveness of a team-based approach to safety depends on one central fact. There is a high level of expertise embodied in the staff operating a surgical suite, an engineering laboratory, or a drug manufacturing facility. By empowering these individuals to stop a procedure when they judge there is an unrecognized error in play, this greatly extend the amount of embodied knowledge involved in a process. The surgeon, the commanding officer, or the lab director is no longer the sole expert whose judgments count.

But it also seems clear that these innovations don’t work equally well in all circumstances. Take nuclear power plant operations. In Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima James Mahaffey documents multiple examples of nuclear accidents that resulted from the efforts of mid-level workers to address an emerging problem in an improvised way. In the case of nuclear power plant safety, it appears that the best prescription for safety is to insist on rigid adherence to pre-established protocols. In this case the function of a safety officer is to monitor operations to ensure protocol conformance — not to exercise independent judgment about the best way to respond to an unfavorable reactor event.

It is in fact an interesting exercise to try to identify the kinds of operations in which these innovations are likely to be effective.

Here is a fascinating interview in Slate with Jim Bagian, a former astronaut, one-time director of the Veteran Administration’s National Center for Patient Safety, and distinguished safety expert; link. Bagian emphasizes the importance of taking a system-based approach to safety. Rather than focusing on finding blame for specific individuals whose actions led to an accident, Bagian emphasizes the importance of tracing back to the institutional, organizational, or logistic background of the accident. What can be changed in the process — of delivering medications to patients, of fueling a rocket, or of moving nuclear solutions around in a laboratory — that make the likelihood of an accident substantially lower?

The safety principles involved here seem fairly simple: cultivate a culture in which errors and near-misses are reported and investigated without blame; empower individuals within risky processes to halt the process if their expertise and experience indicates the possibility of a significant risky error; create individuals within organizations whose interests are defined in terms of the identification and resolution of unsafe practices or conditions; and share information about safety within the industry and with the public.

Mechanisms, singular and general

Let’s think again about the semantics of causal ascriptions. Suppose that we want to know what  caused a building crane to collapse during a windstorm. We might arrive at an account something like this:

  • An unusually heavy gust of wind at 3:20 pm, in the presence of this crane’s specific material and structural properties, with the occurrence of the operator’s effort to adjust the crane’s extension at 3:21 pm, brought about cascading failures of structural elements of the crane, leading to collapse at 3:25 pm.

The process described here proceeds from the “gust of wind striking the crane” through an account of the material and structural properties of the device, incorporating the untimely effort by the operator to readjust the device’s extension, leading to a cascade from small failures to a large failure. And we can identify the features of causal necessity that were operative at the several links of the chain.

Notice that there are few causal regularities or necessary and constant conjunctions in this account. Wind does not usually bring about the collapse of cranes; if the operator’s intervention had occurred a few minutes earlier or later, perhaps the failure would not have occurred; and small failures do not always lead to large failures. Nonetheless, in the circumstances described here there is causal necessity extending from the antecedent situation at 3:15 pm to the full catastrophic collapse at 3:25 pm.

Does this narrative identify a causal mechanism? Are we better off describing this as a sequences of cause-effect sequences, none of which represents a causal mechanism per se? Or, on the contrary, can we look at the whole sequence as a single causal mechanism — though one that is never to be repeated? Does a causal mechanism need to be a recurring and robust chain of events, or can it be a highly unique and contingent chain?

Most mechanisms theorists insist on a degree of repeatability in the sequences that they describe as “mechanisms”. A causal mechanism is the triggering pathway through which one event leads to the production of another event in a range of circumstances in an environment. Fundamentally a causal mechanism is a “molecule” of causal process which can recur in a range of different social settings.

For example:

  • X typically brings about O.

Whenever this sequence of events occurs, in the appropriate timing, the outcome O is produced. This ensemble of events {X, O} is a single mechanism.

And here is the crucial point: to call this a mechanism requires that this sequence recurs in multiple instances across a range of background conditions.

This suggests an answer to the question about the collapsing crane: the sequence from gust to operator error to crane collapse is not a mechanism, but is rather a unique causal sequence. Each part of the sequence has a causal explanation available; each conveys a form of causal necessity in the circumstances. But the aggregation of these cause-effect connections falls short of constituting a causal mechanism because the circumstances in which it works are all but unique. A satisfactory causal explanation of the internal cause-effect pairs will refer to real repeatable mechanisms — for example, “twisting a steel frame leads to a loss of support strength”. But the concatenation does not add up to another, more complex, mechanism.

Contrast this with “stuck valve” accidents in nuclear power reactors. Valves control the flow of cooling fluids around the critical fuel. If the fuel is deprived of coolant it rapidly overheats and melts. A “stuck valve-loss of fluid-critical overheating” sequence is a recognized mechanism of nuclear meltdown, and has been observed in a range of nuclear-plant crises. It is therefore appropriate to describe this sequence as a genuine causal mechanism in the creation of a nuclear plant failure.

(Stuart Glennan takes up a similar question in “Singular and General Causal Relations: A Mechanist Perspective”; link.)

Nuclear accidents

 
diagrams: Chernobyl reactor before and after
 

Nuclear fission is one of the world-changing discoveries of the mid-twentieth century. The atomic bomb projects of the United States led to the atomic bombing of Japan in August 1945, and the hope for limitless electricity brought about the proliferation of a variety of nuclear reactors around the world in the decades following World War II. And, of course, nuclear weapons proliferated to other countries beyond the original circle of atomic powers.

Given the enormous energies associated with fission and the dangerous and toxic properties of radioactive components of fission processes, the possibility of a nuclear accident is a particularly frightening one for the modern public. The world has seen the results of several massive nuclear accidents — Chernobyl and Fukushima in particular — and the devastating results they have had on human populations and the social and economic wellbeing of the regions in which they occurred.

Safety is therefore a paramount priority in the nuclear industry, both in research labs and military and civilian applications. So what is the situation of safety in the nuclear sector? Jim Mahaffey’s Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima is a detailed and carefully researched attempt to answer this question. And the information he provides is not reassuring. Beyond the celebrated and well-known disasters at nuclear power plants (Three Mile Island, Chernobyl, Fukushima), Mahaffey refers to hundreds of accidents involving reactors, research laboratories, weapons plants, and deployed nuclear weapons that have had less public awareness. These accidents resulted in a very low number of lives lost, but their frequency is alarming. They are indeed “normal accidents” (Perrow, Normal Accidents: Living with High-Risk Technologies. For example:

  • a Japanese fishing boat is contaminated by fallout from Castle Bravo test of hydrogen bomb; lots of radioactive fish at the markets in Japan (March 1, 1954) (kl 1706)
  • one MK-6 atomic bomb is dropped on Mars Bluff, South Carolina, after a crew member accidentally pulled the emergency bomb release handle (February 5, 1958) (kl 5774)
  • Fermi 1 liquid sodium plutonium breeder reactor experiences fuel meltdown during startup trials near Detroit (October 4, 1966) (kl 4127)

Mahaffey also provides detailed accounts of the most serious nuclear accidents and meltdowns during the past forty years, Three Mile Island, Chernobyl, and Fukushima.

The safety and control of nuclear weapons is of particular interest. Here is Mahaffey’s summary of “Broken Arrow” events — the loss of atomic and fusion weapons:

Did the Air Force ever lose an A-bomb, or did they just misplace a few of them for a short time? Did they ever drop anything that could be picked up by someone else and used against us? Is humanity going to perish because of poisonous plutonium spread that was snapped up by the wrong people after being somehow misplaced? Several examples will follow. You be the judge. 

Chuck Hansen [

U.S. Nuclear Weapons – The Secret History

] was wrong about one thing. He counted thirty-two “Broken Arrow” accidents. There are now sixty-five documented incidents in which nuclear weapons owned by the United States were lost, destroyed, or damaged between 1945 and 1989. These bombs and warheads, which contain hundreds of pounds of high explosive, have been abused in a wide range of unfortunate events. They have been accidentally dropped from high altitude, dropped from low altitude, crashed through the bomb bay doors while standing on the runway, tumbled off a fork lift, escaped from a chain hoist, and rolled off an aircraft carrier into the ocean. Bombs have been abandoned at the bottom of a test shaft, left buried in a crater, and lost in the mud off the coast of Georgia. Nuclear devices have been pounded with artillery of a foreign nature, struck by lightning, smashed to pieces, scorched, toasted, and burned beyond recognition. Incredibly, in all this mayhem, not a single nuclear weapon has gone off accidentally, anywhere in the world. If it had, the public would know about it. That type of accident would be almost impossible to conceal. (kl 5527)

There are a few common threads in the stories of accident and malfunction that Mahaffey provides. First, there are failures of training and knowledge on the part of front-line workers. The physics of nuclear fission are often counter-intuitive, and the idea of critical mass does not fully capture the danger of a quantity of fissionable material. The geometry of the storage of the material makes a critical difference in going critical. Fissionable material is often transported and manipulated in liquid solution; and the shape and configuration of the vessel in which the solution is held makes a difference to the probability of exponential growth of neutron emission — leading to runaway fission of the material. Mahaffey documents accidents that occurred in nuclear materials processing plants that resulted from plant workers applying what they knew from industrial plumbing to their efforts to solve basic shop-floor problems. All too often the result was a flash of blue light and the release of a great deal of heat and radioactive material.

Second, there is a fault at the opposite end of the knowledge spectrum — the tendency of expert engineers and scientists to believe that they can solve complicated reactor problems on the fly. This turned out to be a critical problem at Chernobyl (kl 6859).

The most difficult problem to handle is that the reactor operator, highly trained and educated with an active and disciplined mind, is liable to think beyond the rote procedures and carefully scheduled tasks. The operator is not a computer, and he or she cannot think like a machine. When the operator at NRX saw some untidy valve handles in the basement, he stepped outside the procedures and straightened them out, so that they were all facing the same way. (kl 2057)

There are also clear examples of inappropriate supervision in the accounts shared by Mahaffey. Here is an example from Chernobyl.

[Deputy chief engineer] Dyatlov was enraged. He paced up and down the control panel, berating the operators, cursing, spitting, threatening, and waving his arms. He demanded that the power be brought back up to 1,500 megawatts, where it was supposed to be for the test. The operators, Toptunov and Akimov, refused on grounds that it was against the rules to do so, even if they were not sure why. 

Dyatlov turned on Toptunov. “You lying idiot! If you don’t increase power, Tregub will!”  

Tregub, the Shift Foreman from the previous shift, was officially off the clock, but he had stayed around just to see the test. He tried to stay out of it. 

Toptunov, in fear of losing his job, started pulling rods. By the time he had wrestled it back to 200 megawatts, 205 of the 211 control rods were all the way out. In this unusual condition, there was danger of an emergency shutdown causing prompt supercriticality and a resulting steam explosion. At 1: 22: 30 a.m., a read-out from the operations computer advised that the reserve reactivity was too low for controlling the reactor, and it should be shut down immediately. Dyatlov was not worried. “Another two or three minutes, and it will be all over. Get moving, boys! (kl 6887)

This was the turning point in the disaster.

A related fault is the intrusion of political and business interests into the design and conduct of high-risk nuclear actions. Leaders want a given outcome without understanding the technical details of the processes they are demanding; subordinates like Toptunov are eventually cajoled or coerced into taking the problematic actions. The persistence of advocates for liquid sodium breeder reactors represents a higher-level example of the same fault. Associated with this role of political and business interests is an impulse towards secrecy and concealment when accidents occur and deliberate understatement of the public dangers created by an accident — a fault amply demonstrated in the Fukushima disaster.

Atomic Accidents provides a fascinating history of events of which most of us are unaware. The book is not primarily intended to offer an account of the causes of these accidents, but rather the ways in which they unfolded and the consequences they had for human welfare. (Generally speaking his view is that nuclear accidents in North America and Western Europe have had remarkably few human casualties.) And many of the accidents he describes are exactly the sorts of failures that are common in all largescale industrial and military processes.

(Largescale technology failure has come up frequently here. See these posts for analysis of some of the organizational causes of technology failure (link, link, link).)

Accident analysis and systems thinking

Complex socio-technical systems fail; that is, accidents occur. And it is enormously important for engineers and policy makers to have a better way of thinking about accidents than is the current protocol following an air crash, a chemical plant fire, or the release of a contaminated drug. We need to understand better what the systems and organizational causes of an accident are; even more importantly, we need to have a basis for improving the safe functioning of complex socio-technical systems by identifying better processes and better warning indicators of impending failure.

A long-term leader in the field of systems-safety thinking is Nancy Leveson, a professor of aeronautics and astronautics at MIT and the author of Safeware: System Safety and Computers (1995) and Engineering a Safer World: Systems Thinking Applied to Safety (2012). Leveson has been a particular advocate for two insights: looking at safety as a systems characteristic, and looking for the organizational and social components of safety and accidents as well as the technical event histories that are more often the focus of accident analysis. Her approach to safety and accidents involves looking at a technology system in terms of the set of controls and constraints that have been designed into the process to prevent accidents. “Accidents are seen as resulting from inadequate control or enforcement of constraints on safety-related behavior at each level of the system development and system operations control structures.” (25)

The abstract for her essay “A New Accident Model for Engineering Safety” (link) captures both points.

New technology is making fundamental changes in the etiology of accidents and is creating a need for changes in the explanatory mechanisms used. We need better and less subjective understanding of why accidents occur and how to prevent future ones. The most effective models will go beyond assigning blame and instead help engineers to learn as much as possible about all the factors involved, including those related to social and organizational structures. This paper presents a new accident model founded on basic systems theory concepts. The use of such a model provides a theoretical foundation for the introduction of unique new types of accident analysis, hazard analysis, accident prevention strategies including new approaches to designing for safety, risk assessment techniques, and approaches to designing performance monitoring and safety metrics.

The accident model she describes in this article and elsewhere is STAMP (Systems-Theoretic Accident Model and Processes). Here is a short description of the approach.

In STAMP, systems are viewed as interrelated components that are kept in a state of dynamic equilibrium by feedback loops of information and control. A system in this conceptualization is not a static design—it is a dynamic process that is continually adapting to achieve its ends and to react to changes in itself and its environment. The original design must not only enforce appropriate constraints on behavior to ensure safe operation, but the system must continue to operate safely as changes occur. The process leading up to an accident (loss event) can be described in terms of an adaptive feedback function that fails to maintain safety as performance changes over time to meet a complex set of goals and values…. 

The basic concepts in STAMP are constraints, control loops and process models, and levels of control. (12)

The other point of emphasis in Leveson’s treatment of safety is her consistent effort to include the social and organizational forms of control that are a part of the safe functioning of a complex technological system.

Event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management deficiencies, and flaws in the safety culture of the company or industry. An accident model should encourage a broad view of accident mechanisms that expands the investigation from beyond the proximate events. (6)

She treats the organizational backdrop of the technology process in question as being a crucial component of the safe functioning of the process.

Social and organizational factors, such as structural deficiencies in the organization, flaws in the safety culture, and inadequate management decision making and control are directly represented in the model and treated as complex processes rather than simply modeling their reflection in an event chain. (26)

And she treats organizational features as another form of control system (along the lines of Jay Forrester’s early definitions of systems in Industrial Dynamics.

Modeling complex organizations or industries using system theory involves dividing them into hierarchical levels with control processes operating at the interfaces between levels (Rasmussen, 1997). Figure 4 shows a generic socio-technical control model. Each system, of course, must be modeled to reflect its specific features, but all will have a structure that is a variant on this one. (17)

Here is figure 4:

The approach embodied in the STAMP framework is that safety is a systems effect, dynamically influenced by the control systems embodied in the total process in question.

In STAMP, systems are viewed as interrelated components that are kept in a state of dynamic equilibrium by feedback loops of information and control. A system in this conceptualization is not a static design—it is a dynamic process that is continually adapting to achieve its ends and to react to changes in itself and its environment. The original design must not only enforce appropriate constraints on behavior to ensure safe operation, but the system must continue to operate safely as changes occur. The process leading up to an accident (loss event) can be described in terms of an adaptive feedback function that fails to maintain safety as performance changes over time to meet a complex set of goals and values. (12) 

And:

In systems theory, systems are viewed as hierarchical structures where each level imposes constraints on the activity of the level beneath it—that is, constraints or lack of constraints at a higher level allow or control lower-level behavior (Checkland, 1981). Control laws are constraints on the relationships between the values of system variables. Safety-related control laws or constraints therefore specify those relationships between system variables that constitute the nonhazardous system states, for example, the power must never be on when the access door is open. The control processes (including the physical design) that enforce these constraints will limit system behavior to safe changes and adaptations. (17)

Leveson’s understanding of systems theory brings along with it a strong conception of “emergence”. She argues that higher levels of systems possess properties that cannot be reduced to the properties of the components, and that safety is one such property:

In systems theory, complex systems are modeled as a hierarchy of levels of organization, each more complex than the one below, where a level is characterized by having emergent or irreducible properties. Hierarchy theory deals with the fundamental differences between one level of complexity and another. Its ultimate aim is to explain the relationships between different levels: what generates the levels, what separates them, and what links them. Emergent properties associated with a set of components at one level in a hierarchy are related to constraints upon the degree of freedom of those components. (11)

But her understanding of “irreducible” seems to be different from that commonly used in the philosophy of science. She does in fact believe that these higher-level properties can be explained by the system of properties at the lower levels — for example, in this passage she asks “… what generates the levels” and how the emergent properties are “related to constraints” imposed on the lower levels. In other words, her position seems to be similar to that advanced by Dave Elder-Vass (link): emergent properties are properties at a higher level that are not possessed by the components, but which depend upon the interactions and composition of the lower-level components.

The domain of safety engineering and accident analysis seems like a particularly suitable place for Bayesian analysis. It seems unavoidable that accident analysis involves both frequency-based probabilities (e.g. the frequency of pump failure) and expert-based estimates of the likelihood of a particular kind of failure (e.g. the likelihood that a train operator will slacken attention to track warnings in response to company pressure on timetable). Bayesian techniques are suitable for the task of combining these various kinds of estimates of risk into a unified calculation.

The topic of safety and accidents is particularly relevant to Understanding Society because it expresses very clearly the causal complexity of the social world in which we live. And rather than simply ignoring that complexity, the systematic study of accidents gives us an avenue for arriving at better ways of representing, modeling, and intervening in parts of that complex world.

 
%d bloggers like this: