Are randomized controlled trials the “gold standard” for establishing causation?

The method of randomized controlled trials (RCT) is often thought to be the best possible way of establishing causation, whether in biology, or medicine or social science. An experiment based on random controlled trials can be described simply. It is hypothesized that

  • (H) X causes Y in a population of units P.

An experiment testing H is designed by randomly selecting a number of individuals from P into Gtest (the test group) and randomly assigning a different set of individuals from P into Gcontrol (the control group). Gtest and Gcontrol are exposed to X (the treatment) under carefully controlled conditions designed to ensure that the ambient conditions surrounding both tests are approximately the same. The status of each group is then measure with regard to Y, and the difference in the value of Y between the two groups is said to be the “average treatment effect” (ATE).

This research methodology is often thought to capture the logical core of experimentation, and is thought to constitute the strongest evidence possible for establishing or refuting a causal relationship between X and Y. It is thought to represent a purely observational way of establishing causal relations among factors. This is so because of the random assignment of individuals to the two groups (so potentially causally relevant individual differences are averaged out in each group) and because of the strong efforts to isolate the administration of the test so that each group is exposed to the same unknown factors that may themselves influence the outcome to be measured. As Handley et al put the point in their review article “Selecting and Improving Quasi-Experimental Designs in Effectiveness and Implementation Research” (2018): “Random allocation minimizes selection bias and maximizes the likelihood that measured and unmeasured confounding variables are distributed equally, enabling any differences in outcomes between the intervention and control arms to be attributed to the intervention under study” (Handley et al 2018: 6). Sociology is interested in discovering and measuring the causal effects of large social conditions and interventions – “treatments”, as they are often called in medicine and policy studies. It might seem plausible, then, that empirical social science should make use of random controlled trials whenever possible, in efforts to discover or validate causal connections.

The supposed “gold standard” status of random controlled trials has been especially controversial in the last several years. Serious methodological and inferential criticisms have been raised of common uses of RCT experiments, and philosopher of science Nancy Cartwright has played a key role in advancing these criticisms. Cartwright and Hardie’s Evidence-Based Policy: A Practical Guide to Doing It Better (link) provided a strong critique of the use of RCT methodology in areas of public policy, and Cartwright and others have offered strong arguments to show that inferences about causation based on RCT experiments are substantially more limited and conditional than generally believed.

A pivotal debate among experts in a handful of fields about RCT methodology took place in a special issue of Social Science and Medicine in 2018. This volume is essential reading for anyone interested in causal reasoning. Especially important is Deaton and Cartwright’s article “Understanding and misunderstanding randomized controlled trials” (link). Here is the abstract to the Deaton and Cartwright article:

ABSTRACT Randomized Controlled Trials (RCTs) are increasingly popular in the social sciences, not only in medicine. We argue that the lay public, and sometimes researchers, put too much trust in RCTs over other methods of investigation. Contrary to frequent claims in the applied literature, randomization does not equalize everything other than the treatment in the treatment and control groups, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) covariates. Finding out whether an estimate was generated by chance is more difficult than commonly believed. At best, an RCT yields an unbiased estimate, but this property is of limited practical value. Even then, estimates apply only to the sample selected for the trial, often no more than a convenience sample, and justification is required to extend the results to other groups, including any population to which the trial sample belongs, or to any individual, including an individual in the trial. Demanding ‘external validity’ is unhelpful because it expects too much of an RCT while undervaluing its potential contribution. RCTs do indeed require minimal assumptions and can operate with little prior knowledge. This is an advantage when persuading distrustful audiences, but it is a disadvantage for cumulative scientific progress, where prior knowledge should be built upon, not discarded. RCTs can play a role in building scientific knowledge and useful predictions but they can only do so as part of a cumulative program, combining with other methods, including conceptual and theoretical development, to discover not ‘what works’, but ‘why things work’.

Deaton and Cartwright put their central critique of RCT methodology in these terms:

We argue that the lay public, and sometimes researchers, put too much trust in RCTs over other methods of investigation. Contrary to frequent claims in the applied literature, randomization does not equalize everything other than the treatment in the treatment and control groups, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) covariates…. We argue that any special status for RCTs is unwarranted. (Deaton and Cartwright 2018: 2).

They provide an interpretation of RCT methodology that places it within a range of strategies of empirical and theoretical investigation, and they argue that researchers need to choose methods that are suitable to the problems that they study.

One of the key concerns they express has to do with extrapolating and generalizing from RCT studies (3). A given RCT study is carried out in a specific and limitation set of cases, and the question arises whether the effects documented for the intervention in this study can be extrapolated to a broader population. Do the results of a drug study, a policy study, or a behavioral study give a basis for believing that these results will obtain in the larger population? Their general answer is that extrapolation must be done very carefully. “The ‘gold standard or truth’ view does harm when it undermines the obligation of science to reconcile RCTs results with other evidence in a process of cumulative understanding” (5). And even more emphatically, “we strongly contest the often-expressed idea that the ATE calculated from an RCT is automatically reliable, that randomization automatically controls for unobservables, or worst of all, that the calculated ATE is true [of the whole population]” (10).

In his contribution to the SSM volume Robert Sampson (link) shares this last concern about the limits of extending RCT results to new contexts/settings:

For example, will a program that was evaluated in New York work in Chicago? To translate an RCT into future actions, we must ask hard questions about the potential mechanisms through which a treatment influences an outcome, heterogeneous treatment effects, contextual variations, unintended consequences or policies that change incentive and opportunity structures, and the scale at which implementing policies changes their anticipated effects. (Sampson 2018: 67)

The general perspective from which Deaton and Cartwright proceed is that empirical research about causal relationships — including experimentation—requires a broad swath of knowledge about the processes, mechanisms, and causal powers at work in the given domain. This background knowledge is needed in order to interpret the results of empirical research and to assess the degree to which the findings of a specific study can plausibly be extrapolated to other populations.

These methodological and logical concerns about the design and interpretation of experiments based on randomized controlled trials make it clear that it is crucial for social scientists to treat RCT methodology carefully and critically. Is RCT experimentation a valuable component of the toolkit of sociological investigation? Yes, of course. But as Cartwright demonstrates, it is important to keep several philosophical points in mind. First, there is no “gold-standard” method for research in any field; rather, it is necessary to adapt methods to the nature of the data and causal patterns in a given field. Second, she (like most philosophers of science) is insistent that empirical research, whether experimental, observational, statistical, or Millian, always requires theoretical inquiry into the underlying mechanisms that can be hypothesized to be at work in the field. Only in the context of a range of theoretical knowledge is it possible to arrive at reasonable interpretations of (and generalizations from) a set of empirical findings.

So, what about it? Should we imagine that randomized controlled trials constitute the aspirational gold standard for sociological research, in sociology or medicine or public policy? The answer seems to be clear: RCT methodology is a legitimate and important tool for sociological research, but it is not fundamentally superior to the many other methods of empirical investigation and inference in use in the social sciences.

Experimental methods in sociology

An earlier post noted the increasing importance of experimentation in some areas of economics (link), and posed the question of whether there is a place for experimentation in sociology as well. Here I’d like to examine that question a bit further.

Let’s begin by asking the simple question: what is an experiment? An experiment is an intervention through which a scientist seeks to identify the possible effects of a given factor or “treatment”. The effect may be thought to be deterministic (whenever X occurs, Y occurs); or it may be probabilistic (the occurrence of X influences the probability of the occurrence of Y). Plainly, the experimental evaluation of probabilistic causal hypotheses requires repeating the experiment a number of times and evaluating the results statistically; whereas a deterministic causal hypothesis can in principle be refuted by a single trial.

In “The Principles of Experimental Design and Their Application in Sociology” (link) Michelle Jackson and D.R. Cox provide a simple and logical specification of experimentation:

We deal here with investigations in which the effects of a number of alternative conditions or treatments are to be compared. Broadly, the investigation is an experiment if the investigator controls the allocation of treatments to the individuals in the study and the other main features of the work, whereas it is observational if, in particular, the allocation of treatments has already been determined by some process outside the investigator’s control and detailed knowledge. The allocation of treatments to individuals is commonly labeled manipulation in the social science context. (Jackson and Cox 2013: 28)

There are several relevant kinds of causal claims in sociology that might admit of experimental investigation, corresponding to all four causal linkages implied by the model of Coleman’s boat (Foundations of Social Theory)—micro-macro, macro-micro, micro-micro, and macro-macro (link). Sociologists generally pay close attention to the relationships that exist between structures and social actors, extending in both directions. Hypotheses about causation in the social world require testing or other forms of empirical evaluation through the collection of evidence. It is plausible to ask whether the methods associated with experimentation are available to sociology. In many instances, the answer is, yes.

There appear to be three different kinds of experiments that would possibly make sense in sociology.

  1. Experiments evaluating hypotheses about features of human motivation and behavior
  2. Experiments evaluating hypotheses about the effects of features of the social environment on social behavior
  3. Experiments evaluating hypotheses about the effects of “interventions” on the characteristics of an organization or local institution

First, sociological theories generally make use of more or less explicit theories of agents and their behavior. These theories could be evaluated using laboratory-based design for experimental subjects in specified social arrangements, parallel to existing methods in experimental economics. For example, Durkheim, Goffman, Coleman, and Hedström all provide different accounts of the actors who constitute social phenomena. It is feasible to design experiments along the lines of experimental economics to evaluate the behavioral hypotheses advanced by various sociologists.

Second, sociology is often concerned with the effects of social relationships on social behavior—for example, friendships, authority relations, or social networks. It would appear that these effects can be probed through direct experimentation, where the researcher creates artificial social relationships and observes behavior. Matthew Salganik et al’s internet-based experiments (20062009) on “culture markets” fall in this category (Hedström 2006). Hedström describes the research by Salganik, Dodds, and Watts (2006) in these terms:

Salganik et al. (2) circumvent many of these problems [of survey-based methodology] by using experimental rather than observational data. They created a Web-based world where more than 14,000 individuals listened to previously unknown songs, rated them, and freely downloaded them if they so desired. Subjects were randomly assigned to different groups. Individuals in only some groups were informed about how many times others in their group had downloaded each song. The experiment assessed whether this social influence had any effects on the songs the individuals seemed to prefer. 

As expected, the authors found that individuals’ music preferences were altered when they were exposed to information about the preferences of others. Furthermore, and more importantly, they found that the extent of social influence had important consequences for the collective outcomes that emerged. The greater the social influence, the more unequal and unpredictable the collective outcomes became. Popular songs became more popular and unpopular songs became less popular when individuals influenced one another, and it became more difficult to predict which songs were to emerge as the most popular ones the more the individuals influenced one another. (787)

Third, some sociologists are especially interested in the effects of micro-context on individual actors and their behavior. Erving Goffman and Harold Garfinkel offer detailed interpretations of the causal dynamics of social interactions at the micro level, and their work appears to be amenable to experimental treatment. Garfinkel (Studies in Ethnomethodology), in particular, made use of research methods that are especially suggestive of controlled experimental designs.

Fourth, sociologists are interested in macro-causes of individual social action. For example, sociologists would like to understand the effects of ideologies and normative systems on individual actors, and others would like to understand the effects of differences in large social structures on individual social actors. Weber hypothesized that the Protestant ethic caused a certain kind of behavior. Theoretically it should be possible to establish hypotheses about the kind of influence a broad cultural factor is thought to exercise over individual actors, and then design experiments to evaluate those hypotheses. Given the scope and pervasiveness of these kinds of macro-social factors, it is difficult to see how their effects could be assessed within a laboratory context. However, there are a range of other experimental designs that could be used, including quasi-experiments (link) and field experiments and natural experiments (link),  in which the investigator designs appropriate comparative groups of individuals in observably different ideological, normative, or social-structural arrangements and observes the differences that can be discerned at the level of social behavior. Does one set of normative arrangements result in greater altruism? Does a culture of nationalism promote citizens’ propensity for aggression against outsiders? Does greater ethnic homogeneity result in higher willingness to comply with taxation, conscription, and other collective duties?

Finally, sociologists are often interested in macro- to macro-causation. For example, consider the claims that “defeat in war leads to weak state capacity in the subsequent peace” or “economic depression leads to xenophobia”. Of course it is not possible to design an experiment in which “defeat in war” is a treatment; but it is possible to develop quasi-experiments or natural experiments that are designed to evaluate this hypothesis. (This is essentially the logic of Theda Skocpol’s (1979) analysis of the causes of social revolution in States and Social Revolutions: A Comparative Analysis of France, Russia, and China.) Or consider a research question in contentious politics, does widespread crop failure give rise to rebellions? Here again, the direct logic of experimentation is generally not available; but the methods articulated in the fields of quasi-experimentation, natural experiments, and field experiments offer an avenue for research designs that have a great deal in common with experimentation. A researcher could compile a dataset for historical China that records weather, crop failure, crop prices, and incidents of rebellion and protest. This dataset could support a “natural experiment” in which each year is assigned to either “control group” or “intervention group”; the control group consists of years in which crop harvests were normal, while the intervention group would consist of years in which crop harvests are below normal (or below subsistence). The experiment is then a simple one: what is the average incidence of rebellious incident in control years and intervention years?

So it is clear that causal reasoning that is very similar to the logic of experimentation is common throughout many areas of sociology. That said, the zone of sociological theorizing that is amenable to laboratory experimentation under random selection and a controlled environment is largely in the area of theories of social action and behavior: the reasons actor behave as they do, hypotheses about how their choices would differ under varying circumstances, and (with some ingenuity) how changing background social conditions might alter the behavior of actors. Here there are very direct parallels between sociological investigation and the research done by experimental and behavioral economists like Richard Thaler (Misbehaving: The Making of Behavioral Economics). And in this way, sociological experiments have much in common with experimental research in social psychology and other areas of the behavioral sciences.

Debates about field experiments in the social sciences


Questions about the empirical validation of hypotheses about social causation have been of interest in the past several weeks here. Relevant to that question is Dawn Langan Teele’s recent volume, Field Experiments and Their Critics: Essays on the Uses and Abuses of Experimentation in the Social Sciences. The essays in the book make for interesting reading for philosophers of the social sciences. But the overall impression that I take away is that the assumptions this research community makes about social causation are excessively empiricist and under-theorized. These are essentially the assumptions that come along with an econometrician’s view of social reality. The researchers approach causation consistently as “empirical social arrangement,” “intervention,” and “net effect”. But this is not a satisfactory way of capturing the workings of social causation. Instead, we need to attempt to construct adequate theories of the institutions, norms, and patterns of action through which various social arrangements work, and the causal mechanisms and processes to which these social realities give rise.

The debates considered here surround the relative effectiveness of controlled observation and RCT-style experiments, with Gerber, Green, and Kaplan arguing on Bayesian statistical grounds that the epistemic weight of observation-based research is close to zero.

We find that unless researchers have prior information about the biases associated with observational research, observational findings are accorded zero weight regardless of sample size, and researchers learn about causality exclusively through experimental results. (kl 211)

A field experiment is defined as “randomized controlled trials carried out in a real-world setting” (kl 92). Observational data relevant to causation often derives from what researchers often call “natural experiments”, in which otherwise similar groups of subjects are exposed to different influences thought to have causal effect. If we believe that trauma affects students’ learning, we might compare a group of first-grade classrooms in a city that experienced a serious tornado with a comparable group of first-grade classrooms in a city without an abrupt and disruptive crisis. If the tornado classrooms showed lower achievement scores than the no-tornado classrooms, we might regard this as a degree of support for the causal hypothesis.

The radical skeptics about observational data draw strong conclusions; if we accept this line of thought, then it would appear that observational evidence about causation is rarely useful. The italicized qualification in the GGK quote is crucial, however, since researchers generally do have prior information about the factors influencing outcomes and the selection of cases in the studies they undertake, as Susan Stokes argues in her response essay:

Do observational researchers “know nothing” about the processes that generate independent variables and are they hence “entirely uncertain” about bias? Is the “strong possibility” of unobserved confounding factors “always omnipresent” in observational research? Are rival hypotheses ‘always plausible”? Can one do nothing more than “assume nonconfoundedness”? To the extent that the answers to these questions are no, radical skepticism is undermined. (kl 751)

Stokes provides a clear exposition of how the influence of unrelated other causes Xij and confounders Zik figure in the linear causal equation for outcome ϒ depending on variable Χ (kl 693):

This model is offered as a representation of the “true” causation of ϒ, including both observed and unobserved factors. We might imagine that we have full observational data on ϒ, Χ, observations for some but not all Χij, and no observations for Zik.

The logical advantage of a randomized field experiment is that random assignment of individuals to the treatment and non-treatment classes guarantees that there is no bias in the populations with respect to a hidden characteristic that may be relevant to the causal workings of the treatment. In the hypothetical tornado-and-learning study mentioned above, there will be a spatial difference between the treatment and control groups; but regional and spatial differences among children may be relevant to learning. So the observed difference in learning may be the effect of the trauma of tornado, or it may be the coincidental effect of the regional difference between midwestern and northeastern students.

Andrew Gellman takes a step back and assesses the larger significance of this debate for social-science research. Here is his general characterization of the policy and epistemic interests that motivate social scientists (along the lines of an earlier post on policy and experiment; link):

Policy analysis (and, more generally, social science) proceeds in two ways. From one direction, there are questions whose answers we seek—how can we reduce poverty, fight crime, help people live happier and healthier lives, increase the efficiency of government, better translate public preferences into policy, and so forth? From another direction, we can gather discrete bits of understanding about pieces of the puzzle: estimates of the effects of particular programs as implemented in particular places. (kl 3440)

Gellman concisely captures the assumptions about causality that underlie this paradigm of social-science research: that causal factors can take the form of pretty much any configuration of social intervention and structure, and we can always ask what the effects of a given configuration are. But this is a view of causation that most realists would reject, because it represents causes in a highly untheorized way. On this ontological mindset, anything can be a cause, and its causal significance is simply the net difference it makes in the world in contrast to its absence. But this is a faulty understanding of real social causation.

Consider an example. Some American school systems have K-8 and 9-12 systems of elementary school and high school; other systems have K-6, 7-8, and 9-12 systems. These configurations might be thought of as “causal factors”, and we might ask, “what is the net effect of system A or system B on educational performance of students by grade 12” (or “juvenile delinquency rates by grade 10”)? But a realist would argue that this is too macular a perspective on causation for a complex social system like education. Instead,we need to identify more granular and more pervasive causes at a more fundamental individual and institutional level, which can then perhaps be aggregated into larger system-level effects. For example, if we thought that the socialization process of children between 11 and 14 is particularly sensitive to bullying and if we thought that high schools create a more welcoming environment for bullying, then we might have reason to expect that the middle school model would be more conducive to the educational socialization of children in these ages. But these two hypotheses can be separately investigated. And the argument that System A produces better educational outcomes than System B will now rest on reasoning about more fundamental causal processes rather than empirical and experimental findings based on examination of the outcomes associated with the two systems. Moreover, it is possible that the causal-mechanism reasoning that I’ve just described is valid and a good guide to policy choice, even though the observations and experiments at the level of full educational systems do not demonstrate a statistical difference between them.

More generally, arbitrary descriptions of “social factors” do not serve as causal factors whose effects we can investigate purely through experimentation and observation. Rather, as the realists argue, we need to have a theory of the workings of the social factors in which we are interested, and we then need to empirically study the causal characteristics of those underlying features of actors, norms, institutions, and structures. Only then can we have a basis for judging that this or that macro-level empirical arrangement will have specific consequences. Bhaskar is right in this key ontological prescription for the social sciences: we need to work towards finding theories of the underlying mechanisms and structures that give rise to the observable workings of the social world. And crude untheorized empirical descriptions of “factors” do not contribute to a better understanding of the social world. The framework here is “empiricist,” not because it gives primacy to empirical validation, but because it elides the necessity of offering realistic accounts of underlying social mechanisms, processes, and structures.

Quasi-experimental data?

Stan Lieberson is one of a group of sociologists for whom I have great respect when it comes to intelligent thinking about social science methodology. His 1985 book, Making It Count: The Improvement of Social Research and Theory, is a good example of some of this thinking about the foundations of social science knowledge, and I also admire A Matter of Taste: How Names, Fashions, and Culture Change in the way it offers a genuinely novel topic and method of approach.

Lieberson urges us to consider “a different way of thinking about the rigorous study of society implied by the phrase ‘science of society'” instead of simply assuming that social science should resemble natural science (3-4). His particular object of criticism in this book is the tendency of quantitative social scientists to use the logic of experiments to characterize the data they study.

An experiment is an attempt to measure the causal effects of one factor X on another factor Z by isolating a domain of phenomena — holding constant all other causal factors — and systematically varying one causal factor to observe the effect this factor has on an outcome of interest. The basic assumption is that an outcome is the joint effect of a set of (as yet unknown) causal conditions:

C1 & C2 & … & Cn cause Z,

where we do not yet know the contents of the list Ci. We consider the hypothesis that Cm is one of the causes of Z. We design an experimental environment in which we are able to hold constant all the potentially relevant causal conditions we can think of (thereby holding fixed Ci), and we systematically vary the presence or absence of Cm and observe the state of the outcome Z. If Z varies appropriately with the presence or absence of Cm, we tentatively conclude that Cm is one of the causes of Z.

In cases where individual differences among samples or subjects may affect the outcome, or where the causal processes in question are probabilistic rather than deterministic, experimentation requires treating populations rather than individuals and assuring randomization of subjects across “treatment” and “no-treatment” groups. This involves selecting a number of subjects, randomly assigning them to controlled conditions in which all other potential causal factors are held constant, exposing one set of subjects to the treatment X while withholding the treatment from the other group, and measuring the outcome variable in the two groups. If there is a significant difference in the mean value of the outcome variable between the treatment group and the control group, then we can tentatively conclude that X causes Z and perhaps estimate the magnitude of the effect. Take tomato yields per square meter (Z) as affected by fertilizer X: plants in the control group are subjected to a standard set of growing conditions, while the treatment group receives these conditions plus the measured dose of X. We then measure the quantity produced by the two plots and estimate the effect of X. The key ideas here are causal powers, random assignment, control, and single-factor treatment.

However, Lieberson insists that most social data are not collected under experimental conditions. It is normally not possible to randomly assign individuals to groups and then observe the effects of interventions. Likewise, it is not possible to systematically control the factors that are present or absent for different groups of subjects. If we want to know whether “presence of hate speech on radio broadcasts” causes “situations of ethnic conflict” to progress to “situations of ethnic violence” — we don’t have the option of identifying a treatment group and a control group of current situations of ethnic conflict, and then examine whether the treatment with “hate speech on radio broadcasts” increases the incidence of ethnic violence in the treatment group relative to the control group. And it is fallacious to reason about non-experimental data using the assumptions developed for analysis of experiments. This fallacy involves making “assumptions that appear to be matters of convenience but in reality generate analyses that are completely off the mark” (6).

Suppose we want to investigate whether being a student athlete affects academic performance in college. In order to treat this topic experimentally we would need to select a random group of newly admitted students; randomly assign one group of individuals to athletic programs and the other group to a non-athletic regime; and measure the academic performance of each individual after a period of time. Let’s say that GPA is the performance measure and that we find that the athlete group has a mean GPA of 3.1 while the non-athlete group has an average of 2.8. This would be an experimental confirmation of the hypothesis that “participation in athletics improves academic performance.”

However, this thought experiment demonstrates the common problem about social data: it is not possible to perform this experiment. Rather, students decide for themselves whether they want to compete in athletics, and their individual characteristics will determine whether they will succeed. Instead, we have to work with the social realities that exist; and this means identifying a group of students who have chosen to participate in athletics; comparing them with a “comparable” group of students who have chosen not to participate in athletics; and measuring the academic performance of the two groups. But here we have to confront two crucial problems: selectivity and the logic of “controlling” for extraneous factors.

Selectivity comes in when we consider that the same factors that lead a college student to participate in athletics may also influence his/her academic performance; so measuring the difference between the two groups may only measure the effects of this selective difference between membership in the groups — not the effect of the experience of participating in athletics on academic performance. In order to correct for selectivity, the researcher may attempt to control for potentially influential differences between the two groups; so he/she may attempt to control for family factors, socio-economic status, performance in secondary school, and a set of psycho-social variables. “Controlling” in this context means selecting sub-groups within the two populations that are statistically similar with respect to the variables to be controlled for. Group A and Group B have approximately the same distribution of family characteristics, parental income, and high school GPA; the individuals in the two groups are “substantially similar”. We have “controlled” for these potentially relevant causal factors — so any observed differences between academic performance across the two groups can be attributed to the treatment, “participation in athletics.”

But Lieberson makes a critical point about this approach: there is commonly unmeasured selectivity within the control variables themselves — crudely, students with the same family characteristics, parental income, and high school GPA who have selected athletics may nonetheless be different from those who have not selected athletics, in ways that influence academic performance. As Lieberson puts the point, “quasi-experimental research almost inevitably runs into a profound selectivity issue” (41).

There is lots more careful, rigorous analysis of social-science reasoning in the book. Lieberson crosses over between statistical methodology and philosophy of social science in a very useful way, and what is most fundamental is his insistence that we need to substantially rethink the assumptions we make in assigning causal influence on the basis of social variation.

Piecemeal empirical assessment of social theories

The philosophy of science devotes a large fraction of its wattage to this question: what is the logic of empirical confirmation for scientific beliefs? (A good short introduction is Samir Okasha, Philosophy of Science: A Very Short Introduction.) In the natural sciences this question became entangled with the parochial fact about the natural sciences, that scientific theories postulated unobservable entities and processes and that the individual statements or axioms of a theory could not be separately confirmed or tested. So a logic of confirmation was developed according to which theories are empirically evaluated as wholes; we need to draw out a set of deductive or probabilistic consequences of the theory; observe the truth or falsity of these consequences based on experiment or observation; and then assign a degree of empirical credibility to the theory based on the success of the observational consequences. This could be put as a slogan: “No piecemeal confirmation of scientific beliefs!”

This is the familiar hypothetico-deductive model of confirmation (H-D), articulated most rigorously by Carl Hempel and criticized and amended by philosophers such as Karl Popper, Nelson Goodman, Norwood Hanson, and Imre Lakatos. These debates constituted most of the content of the evolution of positivist philosophy of science into post-positivist philosophy of science throughout the 1960s and 1970s.

I don’t want to dive into this set of debates, because I am interested in knowledge in the social sciences; and I don’t think that the theory-holism that this train of thought depends upon actually has much relevance for the social sciences. The H-D model of confirmation is approximately well suited — but only to a certain range of scientific areas of knowledge (mathematical physics, mostly). But the social sciences are not theoretical in the relevant sense. Social science “theories” are mid-level formulations about social mechanisms and structures; they are “theories of the middle range” (Robert Merton, On Theoretical Sociology). They often depend on formulations of ideal types of social entities or organizations of interest — and then concrete empirical investigation of specific organizations to determine the degree to which they conform or diverge from the ideal-typical features specified by the theory. And these mid-level theories and hypotheses can usually be empirically investigated fairly directly through chains of observations and inferences.

This is not a trivial task, of course, and there are all sorts of challenging methodological and conceptual issues that must be addressed as the researcher undertakes to consider whether the world actually conforms to the statements he/she makes about it. But it is logically very different from the holistic empirical evaluation that is required of the special theory of relativity or the string theory of fundamental physics. The language of hypothesis-testing is not quite right for most of the social sciences. Instead, the slogan for social science epistemology ought to be, “Hurrah, piecemeal empirical evaluation!”

I want to argue, further, that this epistemological feature of social knowledge is a derivative of some basic facts about social ontology: social processes, entities, and structures lack the rigidity and law-governedness that is characteristic of natural processes, entities, and structures. So general, universal theories of social entities that cover all instances are unlikely. But second, it is a feature of the accessibility of social things: we interact with social entities in a fairly direct manner, and these interactions permit us to engage in scientific observation of these entities in a way that permits the piecemeal empirical investigation that is highlighted here. And we can construct chains of observations and inferences from primary observations (entries in an archival source) to empirical estimates of a more abstract fact (the level of crop productivity in the Lower Yangzi in 1800).

Let’s say that we were considering a theory that social unrest was gradually rising in a region of China in the nineteenth century because of a gradual shift in the sex ratios found in rural society. The connection between sex ratios and social unrest isn’t directly visible; but we can observe features of both ends of the equation. So we can gather population and family data from registries and family histories; we can gather information about social unrest from gazettes and other local sources; and we can formulate subsidiary theories about the social mechanisms that might connect a rising male-female ratio to the incidence of social unrest. In other words — we can directly investigate each aspect of the hypothesis (cause, effect, mechanism), and we can put forward an empirical argument in favor of the hypothesis (or critical of the hypothesis).

This is an example of what I mean by “piecemeal empirical investigation”. And the specific methodologies of the various social and historical sciences are largely devoted to the concrete tasks of formulating and gathering empirical data in the particular domain. Every discipline is concerned to develop methods of empirical inquiry and evaluation; but, I hold, the basic logic of inquiry and evaluation is similar across all disciplines. The common logic is piecemeal inquiry and evaluation.

(I find Tom Kelly’s article on “Evidence” in the Stanford Encyclopedia of Philosophy to be a better approach to justification in the social sciences than does the hypothetico-deductive model of confirmation, and one that is consistent with this piecemeal approach to justification. Kelly also reviews the essentials of H-D confirmation theory.)

Paired comparisons

Sidney Tarrow is a gifted and prolific student of comparative politics. (Listen to my interview with Professor Tarrow.) He has spent much of his career trying to understand social movements, contentious politics, and the causes of differences in political behavior across national settings. And one of his special contributions is his ability to think clearly about the methods that social scientists use.

Tarrow attaches a lot of weight to the idea of “paired comparisons” as a method of research and discovery: Locate a few cases that are broadly similar in many respects but different in a way that is important, interesting, or surprising. Then examine the cases in greater detail to attempt to discover what explains the difference between the two cases. (One of his early books that employs this method is From center to periphery: Alternative models of national-local policy impact and an application to France and Italy.)

Nothing special turns on “pairs” here; what Tarrow is describing is really the logic of small-N comparative research. The point about the broad similarity that is the basis for choosing the cases follows from the logic of causation: if we presuppose that the outcome P is caused by some set of antecedent social and political conditions and we know that C1 and C2 have different outcomes — then the more factors we can “control” for by finding cases in which these factors are constant, the better. This is so, because it demonstrates that none of the constant factors in the two cases are the cause of variation in outcome. And this limits our investigation of possible causes to the factors in which the cases differ.

If this sounds like Mill’s methods of similarity and difference, that’s appropriate — the logic is the same, so far as I can see. Here is Mill’s method of difference:

A B C D -E => P
A B -C D -E => -P

And in this case — making the heroic assumption that A,B,C,D,E exhaust all possible causes of P, and that the cause of P is deterministic rather than probabilistic — then we can infer that the presence of C causes P.

This reasoning doesn’t take us to a valid conclusion to the effect that C is the only factor that is causally relevant to the occurrence of P; it is possible, for example, that there is a third case along these lines:

-A B -C D -E => -P

This would demonstrate that A is a necessary condition for the occurrence of P; withhold A and P disappears. And each of the other factors might also play a role as a necessary condition. So it would be necessary to observe as many as 32 cases (2^5) in order to sort out the status of A through E as either necessary or sufficient conditions for the occurrence of P. (The logic of this kind of causal reasoning is explored more fully in my essay, “An Experiment in Causal Reasoning,” which is also published in Microfoundations, Methods, and Causation.)

But I don’t think that Tarrow is intending to advance the method of paired comparison as a formal method of causal inference, along the lines of inductive or deductive logic. Instead, I think he is making the plausible point that this method should be understood as a part of an intelligent research strategy. Social processes are complex. We are interested in explaining variation across cases. And we probably have the best likelihood of discovering important causal relationships if we can reduce the number of moving parts (the other kinds of variation that occur across the cases).

Tarrow gives an example of the application of the method of paired comparisions in the context of his early study of the political fortunes of the Italian Communist Party (PCI) in the south of Italy. In this case the paired comparison involves northern Italy and southern Italy. Both are subject to the same national political structures; both populations speak Italian; both populations have an Italian national identity. However, the PCI was fairly successful in mobilizing support and winning elections based on a militant political program in the north, and was persistently unsuccessful in doing these things in the south. What explains the difference?

As Tarrow explains his reasoning, his expectation in conducting the research was a “structural” one. He expected that there would be large structural factors in post-war Italy — features of economic and political institutions — that would explain the difference in outcome for PCI political activism. And there were indeed large structural differences in social organization in the two regions. Northern Italy possessed an economy in which industrial labor played a key role and constituted a substantial part of the population. Southern Italy was agrarian and backward, with a large percentage of exploited peasants and only a small percentage of industrial workers.

But, very significantly, Tarrow now believes that these “structural” expectations are probably too “macro” to serve as the basis of social explanation. Instead, he favors the importance of looking at the dynamics of social processes and the specific causal mechanisms that can be discovered in particular social-historical settings. This means looking for causal factors that work at a more strategic and meso level. In terms of the southern Italian PCI outcome that he was interested in explaining thirty years ago — he now believes that the causal mechanism of “brokerage” would have shed much more light on the political outcomes that were of interest in Italy. (This is the heart of the approach that he takes, along with Doug McAdam and Chuck Tilly, in Dynamics of Contention.)

This finding doesn’t invalidate the heuristic of paired comparisons. But it probably does invalidate the expectation that we might discover large “structure-structure” patterns of causation through such comparisons. Instead, what the method facilitates is a more focused research effort on the part of the comparativist, in the context of which he/she can search out the lower-level causal mechanisms and processes that are at work in the various settings under study.

Coverage of the social sciences

Suppose we took the view that the social sciences ought to provide sufficient conceptual and methodological tools to analyze and explain any kind of social behavior. This would be a certain kind of completeness: not theoretical or explanatory completeness, in the sense of having a finished set of theories that can explain everything, but conceptual completeness, in the sense that there are sufficient conceptual resources to give a basis for describing every form of social behavior, and methodological completeness, in the sense that for every possible research question there are starting point for inquiry in the social sciences. And, finally, suppose we stipulate that there are always new hypotheses to be discovered and new theories to be invented.

If this is one of the ultimate aspirations for the social sciences, then we can ask — how close is the current corpus of social science research and knowledge to this goal?

One possible answer is that we have already reached this goal. The conceptual resources of anthropology, economics, political science, and sociology serve as a “fish-scale” system of conceptual coverage that gives us a vocabulary for describing any possible configuration of social behavior. And the most basic ideas about empirical research, causal reasoning, hypothetical thinking, and interpretation of meaning give us a preliminary basis for probing and investigating any of the “new” phenomena we might discover.

Another possible answer goes in the opposite direction. The concepts of the social science disciplines are parochial and example-based. When new forms of social interaction emerge we will need new concepts on the basis of which to describe and represent these social behaviors. So concepts and empirical knowledge must go hand in hand, and new discoveries will stimulate new concepts as well.

Consider this thought experiment. Suppose the social sciences had developed to this point minus micro-economics. The reduced scheme would involve many aspects of behavior and thought, but it would have omitted the category of “rational self-interest.” Is this a possible scenario? Would the reduced set be complete in the sense described above? And what kind of discovery would be required in order for these alternative-world social scientists to progress?

The incompleteness of alternative-world social science is fairly evident. There would be important ranges of behavior that would be inscrutable without the concept of rational self-interest (market equilibria, free-rider problems). And the solution would appear fairly evident as well. These gaps in explanatory scope would lead investigators to ask, what is the hidden factor we are not considering? And they would be led to discover the concept of rational self-interest.

The moral seems to be this: it is always possible that new discoveries of anomalous phenomena will demonstrate the insufficiency of the current conceptual scheme. And therefore there is never a point at which we can declare that science is now complete, and no new concepts will be needed.

At the same time, we do in fact have a rough-and-ready pragmatic confidence that the social sciences as an extended body of theories, concepts, and results have pretty well covered the primary scope of human behavior. And this suggests a vision of the way the social sciences cover the domain of the social as well: not as a comprehensive deductive theory but rather as an irregular, overlapping collection of concepts, methods, and theories — a set of fish-scales rather than an architect’s blueprint for all social phenomena.