Debates about field experiments in the social sciences

 

Questions about the empirical validation of hypotheses about social causation have been of interest in the past several weeks here. Relevant to that question is Dawn Langan Teele’s recent volume, Field Experiments and Their Critics: Essays on the Uses and Abuses of Experimentation in the Social Sciences. The essays in the book make for interesting reading for philosophers of the social sciences. But the overall impression that I take away is that the assumptions this research community makes about social causation are excessively empiricist and under-theorized. These are essentially the assumptions that come along with an econometrician’s view of social reality. The researchers approach causation consistently as “empirical social arrangement,” “intervention,” and “net effect”. But this is not a satisfactory way of capturing the workings of social causation. Instead, we need to attempt to construct adequate theories of the institutions, norms, and patterns of action through which various social arrangements work, and the causal mechanisms and processes to which these social realities give rise.

The debates considered here surround the relative effectiveness of controlled observation and RCT-style experiments, with Gerber, Green, and Kaplan arguing on Bayesian statistical grounds that the epistemic weight of observation-based research is close to zero.

We find that unless researchers have prior information about the biases associated with observational research, observational findings are accorded zero weight regardless of sample size, and researchers learn about causality exclusively through experimental results. (kl 211)

A field experiment is defined as “randomized controlled trials carried out in a real-world setting” (kl 92). Observational data relevant to causation often derives from what researchers often call “natural experiments”, in which otherwise similar groups of subjects are exposed to different influences thought to have causal effect. If we believe that trauma affects students’ learning, we might compare a group of first-grade classrooms in a city that experienced a serious tornado with a comparable group of first-grade classrooms in a city without an abrupt and disruptive crisis. If the tornado classrooms showed lower achievement scores than the no-tornado classrooms, we might regard this as a degree of support for the causal hypothesis.

The radical skeptics about observational data draw strong conclusions; if we accept this line of thought, then it would appear that observational evidence about causation is rarely useful. The italicized qualification in the GGK quote is crucial, however, since researchers generally do have prior information about the factors influencing outcomes and the selection of cases in the studies they undertake, as Susan Stokes argues in her response essay:

Do observational researchers “know nothing” about the processes that generate independent variables and are they hence “entirely uncertain” about bias? Is the “strong possibility” of unobserved confounding factors “always omnipresent” in observational research? Are rival hypotheses ‘always plausible”? Can one do nothing more than “assume nonconfoundedness”? To the extent that the answers to these questions are no, radical skepticism is undermined. (kl 751)

Stokes provides a clear exposition of how the influence of unrelated other causes Xij and confounders Zik figure in the linear causal equation for outcome ϒ depending on variable Χ (kl 693):

This model is offered as a representation of the “true” causation of ϒ, including both observed and unobserved factors. We might imagine that we have full observational data on ϒ, Χ, observations for some but not all Χij, and no observations for Zik.

The logical advantage of a randomized field experiment is that random assignment of individuals to the treatment and non-treatment classes guarantees that there is no bias in the populations with respect to a hidden characteristic that may be relevant to the causal workings of the treatment. In the hypothetical tornado-and-learning study mentioned above, there will be a spatial difference between the treatment and control groups; but regional and spatial differences among children may be relevant to learning. So the observed difference in learning may be the effect of the trauma of tornado, or it may be the coincidental effect of the regional difference between midwestern and northeastern students.

Andrew Gellman takes a step back and assesses the larger significance of this debate for social-science research. Here is his general characterization of the policy and epistemic interests that motivate social scientists (along the lines of an earlier post on policy and experiment; link):

Policy analysis (and, more generally, social science) proceeds in two ways. From one direction, there are questions whose answers we seek—how can we reduce poverty, fight crime, help people live happier and healthier lives, increase the efficiency of government, better translate public preferences into policy, and so forth? From another direction, we can gather discrete bits of understanding about pieces of the puzzle: estimates of the effects of particular programs as implemented in particular places. (kl 3440)

Gellman concisely captures the assumptions about causality that underlie this paradigm of social-science research: that causal factors can take the form of pretty much any configuration of social intervention and structure, and we can always ask what the effects of a given configuration are. But this is a view of causation that most realists would reject, because it represents causes in a highly untheorized way. On this ontological mindset, anything can be a cause, and its causal significance is simply the net difference it makes in the world in contrast to its absence. But this is a faulty understanding of real social causation.

Consider an example. Some American school systems have K-8 and 9-12 systems of elementary school and high school; other systems have K-6, 7-8, and 9-12 systems. These configurations might be thought of as “causal factors”, and we might ask, “what is the net effect of system A or system B on educational performance of students by grade 12” (or “juvenile delinquency rates by grade 10”)? But a realist would argue that this is too macular a perspective on causation for a complex social system like education. Instead,we need to identify more granular and more pervasive causes at a more fundamental individual and institutional level, which can then perhaps be aggregated into larger system-level effects. For example, if we thought that the socialization process of children between 11 and 14 is particularly sensitive to bullying and if we thought that high schools create a more welcoming environment for bullying, then we might have reason to expect that the middle school model would be more conducive to the educational socialization of children in these ages. But these two hypotheses can be separately investigated. And the argument that System A produces better educational outcomes than System B will now rest on reasoning about more fundamental causal processes rather than empirical and experimental findings based on examination of the outcomes associated with the two systems. Moreover, it is possible that the causal-mechanism reasoning that I’ve just described is valid and a good guide to policy choice, even though the observations and experiments at the level of full educational systems do not demonstrate a statistical difference between them.

More generally, arbitrary descriptions of “social factors” do not serve as causal factors whose effects we can investigate purely through experimentation and observation. Rather, as the realists argue, we need to have a theory of the workings of the social factors in which we are interested, and we then need to empirically study the causal characteristics of those underlying features of actors, norms, institutions, and structures. Only then can we have a basis for judging that this or that macro-level empirical arrangement will have specific consequences. Bhaskar is right in this key ontological prescription for the social sciences: we need to work towards finding theories of the underlying mechanisms and structures that give rise to the observable workings of the social world. And crude untheorized empirical descriptions of “factors” do not contribute to a better understanding of the social world. The framework here is “empiricist,” not because it gives primacy to empirical validation, but because it elides the necessity of offering realistic accounts of underlying social mechanisms, processes, and structures.

Quasi-experimental data?

Stan Lieberson is one of a group of sociologists for whom I have great respect when it comes to intelligent thinking about social science methodology. His 1985 book, Making It Count: The Improvement of Social Research and Theory, is a good example of some of this thinking about the foundations of social science knowledge, and I also admire A Matter of Taste: How Names, Fashions, and Culture Change in the way it offers a genuinely novel topic and method of approach.

Lieberson urges us to consider “a different way of thinking about the rigorous study of society implied by the phrase ‘science of society'” instead of simply assuming that social science should resemble natural science (3-4). His particular object of criticism in this book is the tendency of quantitative social scientists to use the logic of experiments to characterize the data they study.

An experiment is an attempt to measure the causal effects of one factor X on another factor Z by isolating a domain of phenomena — holding constant all other causal factors — and systematically varying one causal factor to observe the effect this factor has on an outcome of interest. The basic assumption is that an outcome is the joint effect of a set of (as yet unknown) causal conditions:

C1 & C2 & … & Cn cause Z,

where we do not yet know the contents of the list Ci. We consider the hypothesis that Cm is one of the causes of Z. We design an experimental environment in which we are able to hold constant all the potentially relevant causal conditions we can think of (thereby holding fixed Ci), and we systematically vary the presence or absence of Cm and observe the state of the outcome Z. If Z varies appropriately with the presence or absence of Cm, we tentatively conclude that Cm is one of the causes of Z.

In cases where individual differences among samples or subjects may affect the outcome, or where the causal processes in question are probabilistic rather than deterministic, experimentation requires treating populations rather than individuals and assuring randomization of subjects across “treatment” and “no-treatment” groups. This involves selecting a number of subjects, randomly assigning them to controlled conditions in which all other potential causal factors are held constant, exposing one set of subjects to the treatment X while withholding the treatment from the other group, and measuring the outcome variable in the two groups. If there is a significant difference in the mean value of the outcome variable between the treatment group and the control group, then we can tentatively conclude that X causes Z and perhaps estimate the magnitude of the effect. Take tomato yields per square meter (Z) as affected by fertilizer X: plants in the control group are subjected to a standard set of growing conditions, while the treatment group receives these conditions plus the measured dose of X. We then measure the quantity produced by the two plots and estimate the effect of X. The key ideas here are causal powers, random assignment, control, and single-factor treatment.

However, Lieberson insists that most social data are not collected under experimental conditions. It is normally not possible to randomly assign individuals to groups and then observe the effects of interventions. Likewise, it is not possible to systematically control the factors that are present or absent for different groups of subjects. If we want to know whether “presence of hate speech on radio broadcasts” causes “situations of ethnic conflict” to progress to “situations of ethnic violence” — we don’t have the option of identifying a treatment group and a control group of current situations of ethnic conflict, and then examine whether the treatment with “hate speech on radio broadcasts” increases the incidence of ethnic violence in the treatment group relative to the control group. And it is fallacious to reason about non-experimental data using the assumptions developed for analysis of experiments. This fallacy involves making “assumptions that appear to be matters of convenience but in reality generate analyses that are completely off the mark” (6).

Suppose we want to investigate whether being a student athlete affects academic performance in college. In order to treat this topic experimentally we would need to select a random group of newly admitted students; randomly assign one group of individuals to athletic programs and the other group to a non-athletic regime; and measure the academic performance of each individual after a period of time. Let’s say that GPA is the performance measure and that we find that the athlete group has a mean GPA of 3.1 while the non-athlete group has an average of 2.8. This would be an experimental confirmation of the hypothesis that “participation in athletics improves academic performance.”

However, this thought experiment demonstrates the common problem about social data: it is not possible to perform this experiment. Rather, students decide for themselves whether they want to compete in athletics, and their individual characteristics will determine whether they will succeed. Instead, we have to work with the social realities that exist; and this means identifying a group of students who have chosen to participate in athletics; comparing them with a “comparable” group of students who have chosen not to participate in athletics; and measuring the academic performance of the two groups. But here we have to confront two crucial problems: selectivity and the logic of “controlling” for extraneous factors.

Selectivity comes in when we consider that the same factors that lead a college student to participate in athletics may also influence his/her academic performance; so measuring the difference between the two groups may only measure the effects of this selective difference between membership in the groups — not the effect of the experience of participating in athletics on academic performance. In order to correct for selectivity, the researcher may attempt to control for potentially influential differences between the two groups; so he/she may attempt to control for family factors, socio-economic status, performance in secondary school, and a set of psycho-social variables. “Controlling” in this context means selecting sub-groups within the two populations that are statistically similar with respect to the variables to be controlled for. Group A and Group B have approximately the same distribution of family characteristics, parental income, and high school GPA; the individuals in the two groups are “substantially similar”. We have “controlled” for these potentially relevant causal factors — so any observed differences between academic performance across the two groups can be attributed to the treatment, “participation in athletics.”

But Lieberson makes a critical point about this approach: there is commonly unmeasured selectivity within the control variables themselves — crudely, students with the same family characteristics, parental income, and high school GPA who have selected athletics may nonetheless be different from those who have not selected athletics, in ways that influence academic performance. As Lieberson puts the point, “quasi-experimental research almost inevitably runs into a profound selectivity issue” (41).

There is lots more careful, rigorous analysis of social-science reasoning in the book. Lieberson crosses over between statistical methodology and philosophy of social science in a very useful way, and what is most fundamental is his insistence that we need to substantially rethink the assumptions we make in assigning causal influence on the basis of social variation.


Piecemeal empirical assessment of social theories

The philosophy of science devotes a large fraction of its wattage to this question: what is the logic of empirical confirmation for scientific beliefs? (A good short introduction is Samir Okasha, Philosophy of Science: A Very Short Introduction.) In the natural sciences this question became entangled with the parochial fact about the natural sciences, that scientific theories postulated unobservable entities and processes and that the individual statements or axioms of a theory could not be separately confirmed or tested. So a logic of confirmation was developed according to which theories are empirically evaluated as wholes; we need to draw out a set of deductive or probabilistic consequences of the theory; observe the truth or falsity of these consequences based on experiment or observation; and then assign a degree of empirical credibility to the theory based on the success of the observational consequences. This could be put as a slogan: “No piecemeal confirmation of scientific beliefs!”

This is the familiar hypothetico-deductive model of confirmation (H-D), articulated most rigorously by Carl Hempel and criticized and amended by philosophers such as Karl Popper, Nelson Goodman, Norwood Hanson, and Imre Lakatos. These debates constituted most of the content of the evolution of positivist philosophy of science into post-positivist philosophy of science throughout the 1960s and 1970s.

I don’t want to dive into this set of debates, because I am interested in knowledge in the social sciences; and I don’t think that the theory-holism that this train of thought depends upon actually has much relevance for the social sciences. The H-D model of confirmation is approximately well suited — but only to a certain range of scientific areas of knowledge (mathematical physics, mostly). But the social sciences are not theoretical in the relevant sense. Social science “theories” are mid-level formulations about social mechanisms and structures; they are “theories of the middle range” (Robert Merton, On Theoretical Sociology). They often depend on formulations of ideal types of social entities or organizations of interest — and then concrete empirical investigation of specific organizations to determine the degree to which they conform or diverge from the ideal-typical features specified by the theory. And these mid-level theories and hypotheses can usually be empirically investigated fairly directly through chains of observations and inferences.

This is not a trivial task, of course, and there are all sorts of challenging methodological and conceptual issues that must be addressed as the researcher undertakes to consider whether the world actually conforms to the statements he/she makes about it. But it is logically very different from the holistic empirical evaluation that is required of the special theory of relativity or the string theory of fundamental physics. The language of hypothesis-testing is not quite right for most of the social sciences. Instead, the slogan for social science epistemology ought to be, “Hurrah, piecemeal empirical evaluation!”

I want to argue, further, that this epistemological feature of social knowledge is a derivative of some basic facts about social ontology: social processes, entities, and structures lack the rigidity and law-governedness that is characteristic of natural processes, entities, and structures. So general, universal theories of social entities that cover all instances are unlikely. But second, it is a feature of the accessibility of social things: we interact with social entities in a fairly direct manner, and these interactions permit us to engage in scientific observation of these entities in a way that permits the piecemeal empirical investigation that is highlighted here. And we can construct chains of observations and inferences from primary observations (entries in an archival source) to empirical estimates of a more abstract fact (the level of crop productivity in the Lower Yangzi in 1800).

Let’s say that we were considering a theory that social unrest was gradually rising in a region of China in the nineteenth century because of a gradual shift in the sex ratios found in rural society. The connection between sex ratios and social unrest isn’t directly visible; but we can observe features of both ends of the equation. So we can gather population and family data from registries and family histories; we can gather information about social unrest from gazettes and other local sources; and we can formulate subsidiary theories about the social mechanisms that might connect a rising male-female ratio to the incidence of social unrest. In other words — we can directly investigate each aspect of the hypothesis (cause, effect, mechanism), and we can put forward an empirical argument in favor of the hypothesis (or critical of the hypothesis).

This is an example of what I mean by “piecemeal empirical investigation”. And the specific methodologies of the various social and historical sciences are largely devoted to the concrete tasks of formulating and gathering empirical data in the particular domain. Every discipline is concerned to develop methods of empirical inquiry and evaluation; but, I hold, the basic logic of inquiry and evaluation is similar across all disciplines. The common logic is piecemeal inquiry and evaluation.

(I find Tom Kelly’s article on “Evidence” in the Stanford Encyclopedia of Philosophy to be a better approach to justification in the social sciences than does the hypothetico-deductive model of confirmation, and one that is consistent with this piecemeal approach to justification. Kelly also reviews the essentials of H-D confirmation theory.)

Paired comparisons


Sidney Tarrow is a gifted and prolific student of comparative politics. (Listen to my interview with Professor Tarrow.) He has spent much of his career trying to understand social movements, contentious politics, and the causes of differences in political behavior across national settings. And one of his special contributions is his ability to think clearly about the methods that social scientists use.

Tarrow attaches a lot of weight to the idea of “paired comparisons” as a method of research and discovery: Locate a few cases that are broadly similar in many respects but different in a way that is important, interesting, or surprising. Then examine the cases in greater detail to attempt to discover what explains the difference between the two cases. (One of his early books that employs this method is From center to periphery: Alternative models of national-local policy impact and an application to France and Italy.)

Nothing special turns on “pairs” here; what Tarrow is describing is really the logic of small-N comparative research. The point about the broad similarity that is the basis for choosing the cases follows from the logic of causation: if we presuppose that the outcome P is caused by some set of antecedent social and political conditions and we know that C1 and C2 have different outcomes — then the more factors we can “control” for by finding cases in which these factors are constant, the better. This is so, because it demonstrates that none of the constant factors in the two cases are the cause of variation in outcome. And this limits our investigation of possible causes to the factors in which the cases differ.

If this sounds like Mill’s methods of similarity and difference, that’s appropriate — the logic is the same, so far as I can see. Here is Mill’s method of difference:

A B C D -E => P
A B -C D -E => -P

And in this case — making the heroic assumption that A,B,C,D,E exhaust all possible causes of P, and that the cause of P is deterministic rather than probabilistic — then we can infer that the presence of C causes P.

This reasoning doesn’t take us to a valid conclusion to the effect that C is the only factor that is causally relevant to the occurrence of P; it is possible, for example, that there is a third case along these lines:

-A B -C D -E => -P

This would demonstrate that A is a necessary condition for the occurrence of P; withhold A and P disappears. And each of the other factors might also play a role as a necessary condition. So it would be necessary to observe as many as 32 cases (2^5) in order to sort out the status of A through E as either necessary or sufficient conditions for the occurrence of P. (The logic of this kind of causal reasoning is explored more fully in my essay, “An Experiment in Causal Reasoning,” which is also published in Microfoundations, Methods, and Causation.)

But I don’t think that Tarrow is intending to advance the method of paired comparison as a formal method of causal inference, along the lines of inductive or deductive logic. Instead, I think he is making the plausible point that this method should be understood as a part of an intelligent research strategy. Social processes are complex. We are interested in explaining variation across cases. And we probably have the best likelihood of discovering important causal relationships if we can reduce the number of moving parts (the other kinds of variation that occur across the cases).

Tarrow gives an example of the application of the method of paired comparisions in the context of his early study of the political fortunes of the Italian Communist Party (PCI) in the south of Italy. In this case the paired comparison involves northern Italy and southern Italy. Both are subject to the same national political structures; both populations speak Italian; both populations have an Italian national identity. However, the PCI was fairly successful in mobilizing support and winning elections based on a militant political program in the north, and was persistently unsuccessful in doing these things in the south. What explains the difference?

As Tarrow explains his reasoning, his expectation in conducting the research was a “structural” one. He expected that there would be large structural factors in post-war Italy — features of economic and political institutions — that would explain the difference in outcome for PCI political activism. And there were indeed large structural differences in social organization in the two regions. Northern Italy possessed an economy in which industrial labor played a key role and constituted a substantial part of the population. Southern Italy was agrarian and backward, with a large percentage of exploited peasants and only a small percentage of industrial workers.

But, very significantly, Tarrow now believes that these “structural” expectations are probably too “macro” to serve as the basis of social explanation. Instead, he favors the importance of looking at the dynamics of social processes and the specific causal mechanisms that can be discovered in particular social-historical settings. This means looking for causal factors that work at a more strategic and meso level. In terms of the southern Italian PCI outcome that he was interested in explaining thirty years ago — he now believes that the causal mechanism of “brokerage” would have shed much more light on the political outcomes that were of interest in Italy. (This is the heart of the approach that he takes, along with Doug McAdam and Chuck Tilly, in Dynamics of Contention.)

This finding doesn’t invalidate the heuristic of paired comparisons. But it probably does invalidate the expectation that we might discover large “structure-structure” patterns of causation through such comparisons. Instead, what the method facilitates is a more focused research effort on the part of the comparativist, in the context of which he/she can search out the lower-level causal mechanisms and processes that are at work in the various settings under study.

Coverage of the social sciences

Suppose we took the view that the social sciences ought to provide sufficient conceptual and methodological tools to analyze and explain any kind of social behavior. This would be a certain kind of completeness: not theoretical or explanatory completeness, in the sense of having a finished set of theories that can explain everything, but conceptual completeness, in the sense that there are sufficient conceptual resources to give a basis for describing every form of social behavior, and methodological completeness, in the sense that for every possible research question there are starting point for inquiry in the social sciences. And, finally, suppose we stipulate that there are always new hypotheses to be discovered and new theories to be invented.

If this is one of the ultimate aspirations for the social sciences, then we can ask — how close is the current corpus of social science research and knowledge to this goal?

One possible answer is that we have already reached this goal. The conceptual resources of anthropology, economics, political science, and sociology serve as a “fish-scale” system of conceptual coverage that gives us a vocabulary for describing any possible configuration of social behavior. And the most basic ideas about empirical research, causal reasoning, hypothetical thinking, and interpretation of meaning give us a preliminary basis for probing and investigating any of the “new” phenomena we might discover.

Another possible answer goes in the opposite direction. The concepts of the social science disciplines are parochial and example-based. When new forms of social interaction emerge we will need new concepts on the basis of which to describe and represent these social behaviors. So concepts and empirical knowledge must go hand in hand, and new discoveries will stimulate new concepts as well.

Consider this thought experiment. Suppose the social sciences had developed to this point minus micro-economics. The reduced scheme would involve many aspects of behavior and thought, but it would have omitted the category of “rational self-interest.” Is this a possible scenario? Would the reduced set be complete in the sense described above? And what kind of discovery would be required in order for these alternative-world social scientists to progress?

The incompleteness of alternative-world social science is fairly evident. There would be important ranges of behavior that would be inscrutable without the concept of rational self-interest (market equilibria, free-rider problems). And the solution would appear fairly evident as well. These gaps in explanatory scope would lead investigators to ask, what is the hidden factor we are not considering? And they would be led to discover the concept of rational self-interest.

The moral seems to be this: it is always possible that new discoveries of anomalous phenomena will demonstrate the insufficiency of the current conceptual scheme. And therefore there is never a point at which we can declare that science is now complete, and no new concepts will be needed.

At the same time, we do in fact have a rough-and-ready pragmatic confidence that the social sciences as an extended body of theories, concepts, and results have pretty well covered the primary scope of human behavior. And this suggests a vision of the way the social sciences cover the domain of the social as well: not as a comprehensive deductive theory but rather as an irregular, overlapping collection of concepts, methods, and theories — a set of fish-scales rather than an architect’s blueprint for all social phenomena.

%d bloggers like this: