by a literal banana
As a banana who lives among humans, I am naturally interested in humans, and in the social sciences they use to study themselves. This essay is my current response to the Thiel question: “What important truth do very few people agree with you on?” And my answer is that surveys are bullshit.
In the abstract, I think a lot of people would agree with me that surveys are bullshit. What I don’t think is widely known is how much “knowledge” is based on survey evidence, and what poor evidence it makes in the contexts in which it is used. The nutrition study that claims that eating hot chili peppers makes you live longer is based on surveys. The twin study about the heritability of joining a gang or carrying a gun is based on surveys of young people. The economics study claiming that long commutes reduce happiness is based on surveys, as are all studies of happiness, like the one that claims that people without a college degree are much less happy than they were in the 1970s. The study that claims that pornography is a substitute for marriage is based on surveys. That criminology statistic about domestic violence or sexual assault or drug use or the association of crime with personality factors is almost certainly based on surveys. (Violent crime studies and statistics are particularly likely to be based on extremely cursed instruments, especially the Conflict Tactics Scale, the Sexual Experiences Survey, and their descendants.) Medical studies of pain and fatigue rely on surveys. Almost every study of a psychiatric condition is based on surveys, even if an expert interviewer is taking the survey on the subject’s behalf (e.g. the Hamilton Depression Rating Scale). Many studies that purport to be about suicide are actually based on surveys of suicidal thoughts or behaviors. In the field of political science, election polls and elections themselves are surveys.
What I mean by “surveys” is standard written (or spoken) instruments, composed mostly of language, that are administered to subjects, who give responses, and whose responses are treated as quantitative information, which may then be subjected to statistical analysis. It is not the case that knowledge can never be obtained in this manner. But the idea that there exists some survey, and some survey conditions, that might plausibly produce the knowledge claimed, tends to lead to a mental process of filling in the blanks, of giving the benefit of the doubt to surveys in the ordinary case. But, I think, the ordinary survey, in its ordinary conditions, is of no evidentiary value for any important claim. Just because there exist rare conditions where survey responses tightly map to some condition measurable in other ways does not mean that the vast majority of surveys have any value.
Survey evidence seems to be a new phenomenon. Robert Groves (2011) argues that it is a 20th century phenomenon, arising in the 1930s, achieving a golden age from the 1960s to the 1990s, and then falling off in prestige and reliability after that.
Why is it important that surveys are new? I think it is important to remember that there is no ancestral practice equivalent to surveys. That is to say, there is no ancient human practice or context in which people anonymously tell the pure, innocent truth with language, in response to questioning, with no thought for the motives of the questioner or the effect of their answers. However, in the new, wholly invented ethnomethod of [doing a survey], it is imagined that subjects do tell the innocent truth, comprehending the underlying sense of the question but not answering with any motive or particularity of context. The anonymity of survey takers is given as proof that they feel free to tell the truth, rather than being perceived as a bar to asking them what they might have meant by their responses.
To get at the philosophical weirdness of the survey, it is necessary to dissect the phenomenon of survey-taking in detail. First I will consider the legal perspective, since that is an ancient domain of getting at truth through language in particular contexts.
Surveys and Legal Evidence
The vast majority of evidence in the legal context is and has always been testimonial. That is, a witness testifies in language to communicate some fact to the judge or jury, and the judge or jury then decides how much to believe it. This is true even for modern DNA evidence: the expert witness testifies about the alleged meaning of the laboratory findings, even if documents (also testimony of the writer or preparer) are also given to the trier of fact to examine. Good evidence and garbage evidence are both usually testimony.
In the English common law tradition, the biggest rule about testimonial evidence is hearsay. To put it colloquially, the hearsay rule says that testimony has to come from the horse’s mouth. If you saw someone run a red light and crash into a fire truck, you could undergo the ritual of being put under oath (agreeing that negative legal consequences could befall you for untrue speech), and testify about what you saw in court. Then you would be subject to the further ritual of cross-examination, in which all the aspects of your testimony could be questioned: your vision, whether you had your glasses on, where you were positioned, what specifically you mean by “crashed,” your relationship with the driver, and perhaps even whether you had a past conviction for forgery, which might make you seem like a liar in general. Those responsible for deciding whether to believe your testimony would have the chance to look at your face and mannerisms while testifying, to look at your clothes and hair and grooming, to see whether your eyes are bloodshot, in order to judge your responses to questioning. This may seem superficial and unfortunate, but in conversation we make these judgments all the time. And a witness may lose credibility for appearing too slick as often as for appearing too tattered, as in ordinary life.
However, if you wanted to testify about what you heard someone else say, someone who isn’t present for ritual oath-taking and questioning, that would be hearsay, admissible only under a list of exceptions. The exceptions to the hearsay rule are generally contexts in which language evidence is considered particularly likely to be accurate and truthful, such as a record kept in the ordinary course of business, or an emotional shout just after the crash (the idea being that you wouldn’t have time to think up a lie).
Survey evidence, then, is plainly hearsay, since when we hear claims based on survey evidence, we get no opportunity to judge the credibility of the statements as we might in conversation (much less under oath). However, survey evidence is often admissible in legal proceedings, particularly under the much-abused “state of mind” exception. But I think the more common reasoning underlying admission of survey evidence is as stated by a legal scholar at the beginning of the golden age of surveys (Zeisel 1959): “[S]ince surveys provide the best, if not the only, evidence on certain issues, and since expert knowledge in the field has advanced sufficiently to protect the trier of the facts from error, the law may well lower its heavy guard” (bolded emphasis mine). In other words, survey evidence is admissible because there’s no other way to get at the underlying facts. Consider trademark confusion: how would one measure whether consumers confuse one mark with another except by asking them in some clever way? The phenomenon of confusion is hidden in the minds of consumers, and can’t be measured with calipers or rulers.
Even when surveys are the only way to get at some particular knowledge, they may be done well or poorly. Zeisel (1959), citing Coca-Cola v. Nehi Corp., 27 Del. Ch. 318, 326, 36 A.2d 156 (1944), says:
Other aspects of an interview can also become grounds for criticism. Word association tests given to students in a classroom were rejected because their reactions were “bound to differ from that of the buyer in the market place when confronted with the.., beverage …. ” As another court remarked, “the issue is not whether the goods would be confused by a casual observer, but [rather] .. .by a prospective purchaser at the time he considered making the purchase. If the interviewee is not in a buying mood but is just in a friendly mood answering a pollster, his degree of attention is quite different.”
That is to say, even though a survey might be the only way to judge the phenomenon of confusion, a college classroom was judged to be sufficiently different from shopping in a store (e.g.) to render the survey meaningless. I find this standard touchingly exacting compared to the present lax standard for taking survey evidence seriously. The present standard seems to be that the more math you do to survey data, the more reliable it is.
The meaning of my title is from a joke told at the end of Annie Hall:
I thought of that old joke—you know, this guy goes to a psychiatrist and say doc, my brother’s crazy! He thinks he’s a chicken. And the doc says, why don’t you turn him in? Then the guy says, I would, but I need the eggs. I guess that’s pretty much now how I feel about relationships. They’re totally crazy, irrational, and absurd, but I guess we keep going through it because most of us need the eggs.
Surveys are perhaps the only way to get certain information, information about the most important and pressing phenomena, about happiness and suffering in all its forms. These are eggs that most of us need. So even though surveys are bullshit, they are not “turned in” like the unfortunate brother in Woody Allen’s joke, but embraced in a plausibility structure whose maintenance is widespread and in which we are all complicit.
The Phenomenon of the Survey
To understand what’s wrong with surveys, we must alternate between critically examining survey instruments themselves, and inferring what we can about the context in which surveys are taken.
I want to go into some detail about the assumptions that the meaningfulness of surveys rest upon. When we say that some conclusion “rests upon” assumptions, the metaphor might be one of a house resting on a foundation, where if the foundation is not sound, a collapse may occur. But in this case it is more like the parts of a machine, which all must function. In the case of a steam engine, the water container must be sound and hold pressure without leaking, the piston must pop up and down, the wheel must spin freely, etc. If any part of the machine does not function, the machine does not function.
But in the case of surveys, even if all assumptions fail, if all the pieces of the machine fail to function, data is still produced. There is no collapse or apparent failure of the machinery. But the data produced are meaningless—perhaps unbeknownst to the audience, or even to the investigators. What follows is my attempt to identify the moving parts of survey meaningfulness, with some attention to how they interact. Keep in mind that all of these are based on an underlying assumption that there is no outright fraud—that data are gathered in the way stated, and not made up or altered, either by the researchers or by any of their subcontractors or employees.
Given innocent scientists, the first of these interacting pieces is attention. In order for the survey to be meaningful, the subjects must be paying attention to the items. This is the first indication that the relationship between survey giver and survey taker is an adversarial one. The vast majority of surveys are conducted online, often by respondents who are paid for their time. (Just to give you an idea, I happened to see figures in recent publications of fifty cents for a ten-minute task and $1.65 for fifteen minutes.) A best practices document (Peifer & Garrett, 2014) identifies the practice of “satisficing” — when survey respondents exert what they perceive to be the minimum of cognitive effort that will not result in a rejected contribution. The survey taker wishes to complete as many tasks as possible, and cannot be supposed to have any motivation to bring his full self to the task. There are several strategies to detect careless responses, such as including “trap questions” and asking the same question in multiple ways, on the assumption that the change in wording will not change the meaning in any way to a sincere survey taker. However, it is likely that savvy survey takers quickly detect these strategies and find ways to minimize cognitive effort despite their presence.
The second piece might be termed sincerity. It is this, I think, that is meant by the “lizardman constant” – a (facetiously) constant percentage of people who, asked if the world is run by lizardmen, will respond in the affirmative. In addition to a money motivation, survey takers may be motivated by having fun with the survey by purposely reporting wrong or absurd answers. To some degree, this is on a continuum with attention: answering a survey absurdly may be more fun, and hence require less cognitive effort, than answering it sincerely. The survey that found that gay teens are much more likely to be pregnant, or to get somebody pregnant, most likely reflects this phenomenon. Especially since surveys are generally anonymous, with the only identifying information supplied by the survey taker, it is generally impossible to check the degree of sincerity of survey respondents. Compare this to your personal life, in which you have a good idea about which of your friends is trolling at any given time, and who is unlikely to troll.
A related adversarial motivation is making a point. In the normal course of conversation, in ordinary language use, one forms opinions about why the speaker is saying what she is saying, and prepares a reply based as much on this as on the words actually said. In surveys, survey takers may form an opinion about the hypothesis the instrument is investigating, and conform his answers to what he thinks is the right answer. It’s a bit subtle, but it’s easy to see in the communicative form of twitter polls. When you see a poll, and your true answer doesn’t make your point as well as another answer, do you answer truthfully or try to make a point? What do you think all the other survey respondents are doing? This is not cheating except in a vary narrow sense. This is ordinary language use—making guesses about the reasons underlying a communication, and communicating back with that information in mind. It’s the survey form that’s artificial, offered as if it can preclude this kind of communication. And even when a survey manages to hide its true hypothesis, survey takers still may be guessing other hypotheses, and responding based on factors other than their own innocent truths.
So we have attention, sincerity, and motivation as moving pieces, although they overlap messily with each other. The next piece can be summed up with the word comprehension, but this simple word hides a complex phenomenon that itself has many moving pieces.
Here are a couple of questions from a recent study. These purport to measure “Paranoid ideation” (quantified on a nine-point Likert scale, and yes this is the whole instrument):
Every day, our society becomes more lawless and bestial, a person’s chances of being robbed, assaulted and even murdered go up and up.
Although it may appear that things are constantly getting more dangerous and chaotic, it really isn’t so. Every era has its problems, and a person’s chances of living a safe, untroubled life are better today than ever before (reverse coded).
If you’re reading this, you probably slid right over the words “bestial,” “chaotic,” and “era.” But these are rather difficult words, and not everyone reads essays by bananas with words like “epistemology” and “indexicality” in them. You probably also slid right over the fact that the first item is not a grammatical sentence, but two sentences jammed together with a comma splice. Difficult words and baffling grammar are possible barriers to comprehension. But it is more complex than merely remembering to use simple language. Apparently simple words often mask complex structure and associations.
Consider another survey question, from the General Social Survey:
Taken all together, how would you say things are these days—would you say that you are very happy, pretty happy, or not too happy?
No one can accuse this of using big words (although my friend points out that “not too happy,” despite its colloquial meaning, is exactly as happy as you’d want to be). But in its simplicity, it exemplifies the complexity of the phenomenon of comprehension.
Consider what comprehension means here. It presumes first that the authors of the survey have encoded a meaning in the words, a meaning that the words will convey to the survey takers. More importantly, it presumes that this corresponds to “the real meaning” of the words—a meaning shared by the audience of the survey’s claims. What would the “real meaning” be in this very simple case? How are things these days? Are you very happy, pretty happy, or not too happy? What informs your choice? Would you have answered the same a month or a year ago? Fifteen minutes ago? How does your “pretty happy” compare with another person’s “pretty happy”? Happy compared to what? How would you predict that your family members would answer? Do they put a good face on things, or do they enjoy complaining? Would their answers correspond to how happy you think they really are? What about people from cultures you’re not familiar with? This is a three-point scale. Would you be able to notice a quarter of a point difference? What would that mean?
What underlying construct can we presume that the answer to this question gets at? If you ask enough people, will an underlying construct emerge where none existed before due to magic of the wisdom of crowds? This seems to me the mindset of divination. Divination (such as reading animal entrails) may actually be the ancient precursor to surveys that I was complaining didn’t exist in an earlier section. Divination practices are probably also a “most of us need the eggs” situation. National vibe checks on the General Social Survey might serve a similar purpose. Like traditional divination practices, it is guaranteed to produce an answer.
Comprehension is difficult enough in actual conversation, when mutual comprehension is a shared goal. Often people think they are talking about the same thing, and then find out that they meant two completely different things. A failure of comprehension can be discovered and repaired in conversation, can even be repaired as the reading of a text progresses, but it cannot be repaired in survey-taking. Data will be produced, whether they reflect the comprehension of a shared reality or not.
Here is another survey instrument:
It is not clear to me how much the content of these rather vague items reflects some underlying construct, existing in the world, of “conspiracy mentality,” versus other constructs. (I wonder if political historians score particularly high on the CMQ regardless of their personal tendency to conspiratorial thinking.) But a particularly interesting aspect is the stacking of abstractions. For instance, question 2 uses the word “usually,” a word with a probabilistic meaning, and question 4 uses the word “often.” Stacked on top of these estimates is the one hundred point rating scale, which helpfully supplies word meanings for the percentages. Other surveys have been conducted as to how people interpret probability words on a scale of percent likelihood and get varying spreads; it is unclear whether most survey respondents would interpret “somewhat unlikely” as “40%.” But survey respondents must complete the very difficult task of estimating the likelihood of whether something is true “often” or “usually.” Is it somewhat unlikely that something is usually the case?
What sense do the subjects make of the survey items? We cannot know. All we can know is the data they produce, regardless of whether any sense has been made.
Assuming that survey subjects are paying attention and being sincere (which we have no reason to assume), and assuming that they comprehend the true sense of each survey item as it corresponds to some part of reality (again dubious), they must also be accurate in their judgments and reporting for the data to be meaningful. The possibility of accuracy rests on there being some underlying reality against which to measure, which is debatable in the above cases. But for things like nutrition studies, in which food consumption is measured by a survey, there is some underlying reality (the food eaten), and respondents must be accurate in their recall and complete in their reporting for these surveys to be meaningful. If you have ever tried to keep track of your food intake, and if you can imagine people at all levels of competence and conscientiousness attempting the same task, you have some idea of the degree to which nutrition study responses correspond to reality. I searched for studies telling you that red meat is bad for you, and the first one I encountered was using a survey that asked participants to recall their diet over the past twelve months. The authors of that study used the portion sizes and frequencies that people could recall over the past year to calculate how many grams of meat they ate per day.
Of course, this is not the only problem with nutrition surveys. The more important issue is that answers to food survey questions are treated as quantitative data, and compared to various outcomes (many of which are also determined by surveys), with the assumption that differences in reported food consumption are causally related to health outcomes, while assiduously avoiding to the extent possible any consideration of factors that might confound the purported causal association. But it is important to keep in mind that even though there are many problems with nutritional analyses, they are generally based on survey data at their core.
I have presented the assumptions for survey meaningfulness in the order that seems most natural for understanding them, and finally come to what I regard as the most important factor, alluded to but not named in italics above. From comprehension and accuracy we can deduce the need for the existence of an underlying phenomenon that is being measured. In the case of nutrition surveys, there is a phenomenon—a fact of the matter—-that could be gotten at by, say, constant surveillance. People really do eat certain foods in certain quantities, whether or not it has much effect on their health or mortality. But for the really important issues, things like happiness and depression and political beliefs, there may really be no phenomenon underlying the attempt at measurement. This is not to say that happiness is not real, in the sense that it is subjectively felt that things are going well or poorly. Emotion and feeling and belief are real, as we can all see by experiencing them from time to time. We may even communicate about them in conversation, and use the communication and all the context surrounding it to judge the emotions and beliefs of others. My claim is that there is no basis for believing that a shared phenomenon underlies the use of words like these across contexts, especially the anti-context of survey taking.
Biofluorescence is a hot phenomenon as I write. Museums and zoos and wildlife enthusiasts are shining ultraviolet light on their preserved pelts and living animals in the dark, and finding that such things as flying squirrels and rats and platypus and Tasmanian devils, not to mention frogs and spiders, absorb invisible (to human eyes) UV light and re-emit it in the visible wavelength. The fluorescence often forms attractive (to my eyes) patterns on the animals. These patterns were always there, in some sense, whether they are typically observed by some observer (a mate or predator, perhaps) or not. But now humans can see them, because they have a machine (darkness plus UV light source plus eyes and/or visible-light-sensitive camera) through which they can look.
I like the image of the science machine as a dark room with an electric hum, illuminating colorful crystalline patterns that exist but were previously invisible to the onlookers, who peer through a dark screen at the radiant geometries on display.
Now consider the happiness survey. There is a phenomenon that we can’t directly observe: how other people feel. The survey shines its light, in the form of questions in language, and there is a response, and there are any number of patterns to be seen in the data obtained. Does this mean that the patterns in the data are illuminating a pattern that exists separately in the world, like bioluminescent fur? Is there any such pattern at all to be illuminated? Or are they, perhaps, patterns of language use (happiness-survey-talk by real, rather than generic, people) rather than clues to the structure of the referents of language (happiness itself)?
It is one thing to insist that words mean things while arguing pedantically about some dubious word usage, but it would be exhibiting a high degree of temerity to go so far as to believe that words mean things, much less to rely on this belief in making deductions.
Here is, I think, the most fundamental reason that surveys are of no evidentiary value: the phenomenon that the survey purports to measure either has no existence independent of the survey, or the words used in the survey instrument do not correspond to the real underlying phenomenon they purport to measure. No amount of attention and accuracy and innocence of motivation can make up for that. And I do not think the problem will be solved by improving the survey instruments, any more than priming studies will be rendered meaningful by better primes.
Peifer, J., & Garrett, K. (2014). Best practices for working with opt-in online panels. Ohio State University: Columbus.
Groves, R. M. (2011). Three eras of survey research. Public opinion quarterly, 75(5), 861-871.
Zeisel, H. (1959). Uniqueness of survey evidence. Cornell LQ, 45, 322.