by a literal banana
As a banana who lives among humans, I am naturally interested in humans, and in the social sciences they use to study themselves. This essay is my current response to the Thiel question: “What important truth do very few people agree with you on?” And my answer is that surveys are bullshit.
In the abstract, I think a lot of people would agree with me that surveys are bullshit. What I don’t think is widely known is how much “knowledge” is based on survey evidence, and what poor evidence it makes in the contexts in which it is used. The nutrition study that claims that eating hot chili peppers makes you live longer is based on surveys. The twin study about the heritability of joining a gang or carrying a gun is based on surveys of young people. The economics study claiming that long commutes reduce happiness is based on surveys, as are all studies of happiness, like the one that claims that people without a college degree are much less happy than they were in the 1970s. The study that claims that pornography is a substitute for marriage is based on surveys. That criminology statistic about domestic violence or sexual assault or drug use or the association of crime with personality factors is almost certainly based on surveys. (Violent crime studies and statistics are particularly likely to be based on extremely cursed instruments, especially the Conflict Tactics Scale, the Sexual Experiences Survey, and their descendants.) Medical studies of pain and fatigue rely on surveys. Almost every study of a psychiatric condition is based on surveys, even if an expert interviewer is taking the survey on the subject’s behalf (e.g. the Hamilton Depression Rating Scale). Many studies that purport to be about suicide are actually based on surveys of suicidal thoughts or behaviors. In the field of political science, election polls and elections themselves are surveys.
What I mean by “surveys” is standard written (or spoken) instruments, composed mostly of language, that are administered to subjects, who give responses, and whose responses are treated as quantitative information, which may then be subjected to statistical analysis. It is not the case that knowledge can never be obtained in this manner. But the idea that there exists some survey, and some survey conditions, that might plausibly produce the knowledge claimed, tends to lead to a mental process of filling in the blanks, of giving the benefit of the doubt to surveys in the ordinary case. But, I think, the ordinary survey, in its ordinary conditions, is of no evidentiary value for any important claim. Just because there exist rare conditions where survey responses tightly map to some condition measurable in other ways does not mean that the vast majority of surveys have any value.
Survey evidence seems to be a new phenomenon. Robert Groves (2011) argues that it is a 20th century phenomenon, arising in the 1930s, achieving a golden age from the 1960s to the 1990s, and then falling off in prestige and reliability after that.
Why is it important that surveys are new? I think it is important to remember that there is no ancestral practice equivalent to surveys. That is to say, there is no ancient human practice or context in which people anonymously tell the pure, innocent truth with language, in response to questioning, with no thought for the motives of the questioner or the effect of their answers. However, in the new, wholly invented ethnomethod of [doing a survey], it is imagined that subjects do tell the innocent truth, comprehending the underlying sense of the question but not answering with any motive or particularity of context. The anonymity of survey takers is given as proof that they feel free to tell the truth, rather than being perceived as a bar to asking them what they might have meant by their responses.
To get at the philosophical weirdness of the survey, it is necessary to dissect the phenomenon of survey-taking in detail. First I will consider the legal perspective, since that is an ancient domain of getting at truth through language in particular contexts.
Surveys and Legal Evidence
The vast majority of evidence in the legal context is and has always been testimonial. That is, a witness testifies in language to communicate some fact to the judge or jury, and the judge or jury then decides how much to believe it. This is true even for modern DNA evidence: the expert witness testifies about the alleged meaning of the laboratory findings, even if documents (also testimony of the writer or preparer) are also given to the trier of fact to examine. Good evidence and garbage evidence are both usually testimony.
In the English common law tradition, the biggest rule about testimonial evidence is hearsay. To put it colloquially, the hearsay rule says that testimony has to come from the horse’s mouth. If you saw someone run a red light and crash into a fire truck, you could undergo the ritual of being put under oath (agreeing that negative legal consequences could befall you for untrue speech), and testify about what you saw in court. Then you would be subject to the further ritual of cross-examination, in which all the aspects of your testimony could be questioned: your vision, whether you had your glasses on, where you were positioned, what specifically you mean by “crashed,” your relationship with the driver, and perhaps even whether you had a past conviction for forgery, which might make you seem like a liar in general. Those responsible for deciding whether to believe your testimony would have the chance to look at your face and mannerisms while testifying, to look at your clothes and hair and grooming, to see whether your eyes are bloodshot, in order to judge your responses to questioning. This may seem superficial and unfortunate, but in conversation we make these judgments all the time. And a witness may lose credibility for appearing too slick as often as for appearing too tattered, as in ordinary life.
However, if you wanted to testify about what you heard someone else say, someone who isn’t present for ritual oath-taking and questioning, that would be hearsay, admissible only under a list of exceptions. The exceptions to the hearsay rule are generally contexts in which language evidence is considered particularly likely to be accurate and truthful, such as a record kept in the ordinary course of business, or an emotional shout just after the crash (the idea being that you wouldn’t have time to think up a lie).
Survey evidence, then, is plainly hearsay, since when we hear claims based on survey evidence, we get no opportunity to judge the credibility of the statements as we might in conversation (much less under oath). However, survey evidence is often admissible in legal proceedings, particularly under the much-abused “state of mind” exception. But I think the more common reasoning underlying admission of survey evidence is as stated by a legal scholar at the beginning of the golden age of surveys (Zeisel 1959): “[S]ince surveys provide the best, if not the only, evidence on certain issues, and since expert knowledge in the field has advanced sufficiently to protect the trier of the facts from error, the law may well lower its heavy guard” (bolded emphasis mine). In other words, survey evidence is admissible because there’s no other way to get at the underlying facts. Consider trademark confusion: how would one measure whether consumers confuse one mark with another except by asking them in some clever way? The phenomenon of confusion is hidden in the minds of consumers, and can’t be measured with calipers or rulers.
Even when surveys are the only way to get at some particular knowledge, they may be done well or poorly. Zeisel (1959), citing Coca-Cola v. Nehi Corp., 27 Del. Ch. 318, 326, 36 A.2d 156 (1944), says:
Other aspects of an interview can also become grounds for criticism. Word association tests given to students in a classroom were rejected because their reactions were “bound to differ from that of the buyer in the market place when confronted with the.., beverage …. ” As another court remarked, “the issue is not whether the goods would be confused by a casual observer, but [rather] .. .by a prospective purchaser at the time he considered making the purchase. If the interviewee is not in a buying mood but is just in a friendly mood answering a pollster, his degree of attention is quite different.”
That is to say, even though a survey might be the only way to judge the phenomenon of confusion, a college classroom was judged to be sufficiently different from shopping in a store (e.g.) to render the survey meaningless. I find this standard touchingly exacting compared to the present lax standard for taking survey evidence seriously. The present standard seems to be that the more math you do to survey data, the more reliable it is.
The meaning of my title is from a joke told at the end of Annie Hall:
I thought of that old joke—you know, this guy goes to a psychiatrist and say doc, my brother’s crazy! He thinks he’s a chicken. And the doc says, why don’t you turn him in? Then the guy says, I would, but I need the eggs. I guess that’s pretty much now how I feel about relationships. They’re totally crazy, irrational, and absurd, but I guess we keep going through it because most of us need the eggs.
Surveys are perhaps the only way to get certain information, information about the most important and pressing phenomena, about happiness and suffering in all its forms. These are eggs that most of us need. So even though surveys are bullshit, they are not “turned in” like the unfortunate brother in Woody Allen’s joke, but embraced in a plausibility structure whose maintenance is widespread and in which we are all complicit.
The Phenomenon of the Survey
To understand what’s wrong with surveys, we must alternate between critically examining survey instruments themselves, and inferring what we can about the context in which surveys are taken.
I want to go into some detail about the assumptions that the meaningfulness of surveys rest upon. When we say that some conclusion “rests upon” assumptions, the metaphor might be one of a house resting on a foundation, where if the foundation is not sound, a collapse may occur. But in this case it is more like the parts of a machine, which all must function. In the case of a steam engine, the water container must be sound and hold pressure without leaking, the piston must pop up and down, the wheel must spin freely, etc. If any part of the machine does not function, the machine does not function.
But in the case of surveys, even if all assumptions fail, if all the pieces of the machine fail to function, data is still produced. There is no collapse or apparent failure of the machinery. But the data produced are meaningless—perhaps unbeknownst to the audience, or even to the investigators. What follows is my attempt to identify the moving parts of survey meaningfulness, with some attention to how they interact. Keep in mind that all of these are based on an underlying assumption that there is no outright fraud—that data are gathered in the way stated, and not made up or altered, either by the researchers or by any of their subcontractors or employees.
Given innocent scientists, the first of these interacting pieces is attention. In order for the survey to be meaningful, the subjects must be paying attention to the items. This is the first indication that the relationship between survey giver and survey taker is an adversarial one. The vast majority of surveys are conducted online, often by respondents who are paid for their time. (Just to give you an idea, I happened to see figures in recent publications of fifty cents for a ten-minute task and $1.65 for fifteen minutes.) A best practices document (Peifer & Garrett, 2014) identifies the practice of “satisficing” — when survey respondents exert what they perceive to be the minimum of cognitive effort that will not result in a rejected contribution. The survey taker wishes to complete as many tasks as possible, and cannot be supposed to have any motivation to bring his full self to the task. There are several strategies to detect careless responses, such as including “trap questions” and asking the same question in multiple ways, on the assumption that the change in wording will not change the meaning in any way to a sincere survey taker. However, it is likely that savvy survey takers quickly detect these strategies and find ways to minimize cognitive effort despite their presence.
The second piece might be termed sincerity. It is this, I think, that is meant by the “lizardman constant” – a (facetiously) constant percentage of people who, asked if the world is run by lizardmen, will respond in the affirmative. In addition to a money motivation, survey takers may be motivated by having fun with the survey by purposely reporting wrong or absurd answers. To some degree, this is on a continuum with attention: answering a survey absurdly may be more fun, and hence require less cognitive effort, than answering it sincerely. The survey that found that gay teens are much more likely to be pregnant, or to get somebody pregnant, most likely reflects this phenomenon. Especially since surveys are generally anonymous, with the only identifying information supplied by the survey taker, it is generally impossible to check the degree of sincerity of survey respondents. Compare this to your personal life, in which you have a good idea about which of your friends is trolling at any given time, and who is unlikely to troll.
A related adversarial motivation is making a point. In the normal course of conversation, in ordinary language use, one forms opinions about why the speaker is saying what she is saying, and prepares a reply based as much on this as on the words actually said. In surveys, survey takers may form an opinion about the hypothesis the instrument is investigating, and conform his answers to what he thinks is the right answer. It’s a bit subtle, but it’s easy to see in the communicative form of twitter polls. When you see a poll, and your true answer doesn’t make your point as well as another answer, do you answer truthfully or try to make a point? What do you think all the other survey respondents are doing? This is not cheating except in a vary narrow sense. This is ordinary language use—making guesses about the reasons underlying a communication, and communicating back with that information in mind. It’s the survey form that’s artificial, offered as if it can preclude this kind of communication. And even when a survey manages to hide its true hypothesis, survey takers still may be guessing other hypotheses, and responding based on factors other than their own innocent truths.
So we have attention, sincerity, and motivation as moving pieces, although they overlap messily with each other. The next piece can be summed up with the word comprehension, but this simple word hides a complex phenomenon that itself has many moving pieces.
Here are a couple of questions from a recent study. These purport to measure “Paranoid ideation” (quantified on a nine-point Likert scale, and yes this is the whole instrument):
Every day, our society becomes more lawless and bestial, a person’s chances of being robbed, assaulted and even murdered go up and up.
Although it may appear that things are constantly getting more dangerous and chaotic, it really isn’t so. Every era has its problems, and a person’s chances of living a safe, untroubled life are better today than ever before (reverse coded).
If you’re reading this, you probably slid right over the words “bestial,” “chaotic,” and “era.” But these are rather difficult words, and not everyone reads essays by bananas with words like “epistemology” and “indexicality” in them. You probably also slid right over the fact that the first item is not a grammatical sentence, but two sentences jammed together with a comma splice. Difficult words and baffling grammar are possible barriers to comprehension. But it is more complex than merely remembering to use simple language. Apparently simple words often mask complex structure and associations.
Consider another survey question, from the General Social Survey:
Taken all together, how would you say things are these days—would you say that you are very happy, pretty happy, or not too happy?
No one can accuse this of using big words (although my friend points out that “not too happy,” despite its colloquial meaning, is exactly as happy as you’d want to be). But in its simplicity, it exemplifies the complexity of the phenomenon of comprehension.
Consider what comprehension means here. It presumes first that the authors of the survey have encoded a meaning in the words, a meaning that the words will convey to the survey takers. More importantly, it presumes that this corresponds to “the real meaning” of the words—a meaning shared by the audience of the survey’s claims. What would the “real meaning” be in this very simple case? How are things these days? Are you very happy, pretty happy, or not too happy? What informs your choice? Would you have answered the same a month or a year ago? Fifteen minutes ago? How does your “pretty happy” compare with another person’s “pretty happy”? Happy compared to what? How would you predict that your family members would answer? Do they put a good face on things, or do they enjoy complaining? Would their answers correspond to how happy you think they really are? What about people from cultures you’re not familiar with? This is a three-point scale. Would you be able to notice a quarter of a point difference? What would that mean?
What underlying construct can we presume that the answer to this question gets at? If you ask enough people, will an underlying construct emerge where none existed before due to magic of the wisdom of crowds? This seems to me the mindset of divination. Divination (such as reading animal entrails) may actually be the ancient precursor to surveys that I was complaining didn’t exist in an earlier section. Divination practices are probably also a “most of us need the eggs” situation. National vibe checks on the General Social Survey might serve a similar purpose. Like traditional divination practices, it is guaranteed to produce an answer.
Comprehension is difficult enough in actual conversation, when mutual comprehension is a shared goal. Often people think they are talking about the same thing, and then find out that they meant two completely different things. A failure of comprehension can be discovered and repaired in conversation, can even be repaired as the reading of a text progresses, but it cannot be repaired in survey-taking. Data will be produced, whether they reflect the comprehension of a shared reality or not.
Here is another survey instrument:
It is not clear to me how much the content of these rather vague items reflects some underlying construct, existing in the world, of “conspiracy mentality,” versus other constructs. (I wonder if political historians score particularly high on the CMQ regardless of their personal tendency to conspiratorial thinking.) But a particularly interesting aspect is the stacking of abstractions. For instance, question 2 uses the word “usually,” a word with a probabilistic meaning, and question 4 uses the word “often.” Stacked on top of these estimates is the one hundred point rating scale, which helpfully supplies word meanings for the percentages. Other surveys have been conducted as to how people interpret probability words on a scale of percent likelihood and get varying spreads; it is unclear whether most survey respondents would interpret “somewhat unlikely” as “40%.” But survey respondents must complete the very difficult task of estimating the likelihood of whether something is true “often” or “usually.” Is it somewhat unlikely that something is usually the case?
What sense do the subjects make of the survey items? We cannot know. All we can know is the data they produce, regardless of whether any sense has been made.
Assuming that survey subjects are paying attention and being sincere (which we have no reason to assume), and assuming that they comprehend the true sense of each survey item as it corresponds to some part of reality (again dubious), they must also be accurate in their judgments and reporting for the data to be meaningful. The possibility of accuracy rests on there being some underlying reality against which to measure, which is debatable in the above cases. But for things like nutrition studies, in which food consumption is measured by a survey, there is some underlying reality (the food eaten), and respondents must be accurate in their recall and complete in their reporting for these surveys to be meaningful. If you have ever tried to keep track of your food intake, and if you can imagine people at all levels of competence and conscientiousness attempting the same task, you have some idea of the degree to which nutrition study responses correspond to reality. I searched for studies telling you that red meat is bad for you, and the first one I encountered was using a survey that asked participants to recall their diet over the past twelve months. The authors of that study used the portion sizes and frequencies that people could recall over the past year to calculate how many grams of meat they ate per day.
Of course, this is not the only problem with nutrition surveys. The more important issue is that answers to food survey questions are treated as quantitative data, and compared to various outcomes (many of which are also determined by surveys), with the assumption that differences in reported food consumption are causally related to health outcomes, while assiduously avoiding to the extent possible any consideration of factors that might confound the purported causal association. But it is important to keep in mind that even though there are many problems with nutritional analyses, they are generally based on survey data at their core.
I have presented the assumptions for survey meaningfulness in the order that seems most natural for understanding them, and finally come to what I regard as the most important factor, alluded to but not named in italics above. From comprehension and accuracy we can deduce the need for the existence of an underlying phenomenon that is being measured. In the case of nutrition surveys, there is a phenomenon—a fact of the matter—-that could be gotten at by, say, constant surveillance. People really do eat certain foods in certain quantities, whether or not it has much effect on their health or mortality. But for the really important issues, things like happiness and depression and political beliefs, there may really be no phenomenon underlying the attempt at measurement. This is not to say that happiness is not real, in the sense that it is subjectively felt that things are going well or poorly. Emotion and feeling and belief are real, as we can all see by experiencing them from time to time. We may even communicate about them in conversation, and use the communication and all the context surrounding it to judge the emotions and beliefs of others. My claim is that there is no basis for believing that a shared phenomenon underlies the use of words like these across contexts, especially the anti-context of survey taking.
Biofluorescence is a hot phenomenon as I write. Museums and zoos and wildlife enthusiasts are shining ultraviolet light on their preserved pelts and living animals in the dark, and finding that such things as flying squirrels and rats and platypus and Tasmanian devils, not to mention frogs and spiders, absorb invisible (to human eyes) UV light and re-emit it in the visible wavelength. The fluorescence often forms attractive (to my eyes) patterns on the animals. These patterns were always there, in some sense, whether they are typically observed by some observer (a mate or predator, perhaps) or not. But now humans can see them, because they have a machine (darkness plus UV light source plus eyes and/or visible-light-sensitive camera) through which they can look.
I like the image of the science machine as a dark room with an electric hum, illuminating colorful crystalline patterns that exist but were previously invisible to the onlookers, who peer through a dark screen at the radiant geometries on display.
Now consider the happiness survey. There is a phenomenon that we can’t directly observe: how other people feel. The survey shines its light, in the form of questions in language, and there is a response, and there are any number of patterns to be seen in the data obtained. Does this mean that the patterns in the data are illuminating a pattern that exists separately in the world, like bioluminescent fur? Is there any such pattern at all to be illuminated? Or are they, perhaps, patterns of language use (happiness-survey-talk by real, rather than generic, people) rather than clues to the structure of the referents of language (happiness itself)?
It is one thing to insist that words mean things while arguing pedantically about some dubious word usage, but it would be exhibiting a high degree of temerity to go so far as to believe that words mean things, much less to rely on this belief in making deductions.
Here is, I think, the most fundamental reason that surveys are of no evidentiary value: the phenomenon that the survey purports to measure either has no existence independent of the survey, or the words used in the survey instrument do not correspond to the real underlying phenomenon they purport to measure. No amount of attention and accuracy and innocence of motivation can make up for that. And I do not think the problem will be solved by improving the survey instruments, any more than priming studies will be rendered meaningful by better primes.
Peifer, J., & Garrett, K. (2014). Best practices for working with opt-in online panels. Ohio State University: Columbus.
Groves, R. M. (2011). Three eras of survey research. Public opinion quarterly, 75(5), 861-871.
Zeisel, H. (1959). Uniqueness of survey evidence. Cornell LQ, 45, 322.
14 thoughts on “Survey Chicken”
I was sympathetic to your point until you got to the happiness survey. Because the answer to “is there an underlying construct revealed witt lots of data” is “yes, obviously”. All the factors you mention which make the data unreliable for one individual are noise, not bias; there is no reason to expect systematic directionality to the deviations from honesty they produce. So averaging large datasets will remove noise and get back to honest answers.
That the one detailed example is, to be blunt, nonsense, makes me doubt the rest as well.
“All the factors you mention which make the data unreliable for one individual are noise, not bias; there is no reason to expect systematic directionality to the deviations from honesty they produce. So averaging large datasets will remove noise and get back to honest answers.”
Not if most people are more alike than not. I’m not convinced one way or the other here, but I think your argument requires the belief that most people will have a core, average response that in some way represents the “truth” (which the article asserts cannot actually be empirically expressed anyway).
What does it mean to say there is a real phenomenon of SOCIETAL happiness?
We can agree that people say things about their happiness, that’s uncontroversial.
We can agree, for the sake of argument, that people have beliefs about their happiness.
Let’s even agree, for the sake of argument, that people “have” individual happiness.
What does all of that mean for any sort of AGGREGATION of happiness levels? Is happiness level a number? Does my level 3 mean the same thing as your level 3? How would either of us know? Hell, does your happiness level of 3 five years ago mean a happiness level of 3 today? How would you know?
It gets worse. The very act of trying to collapse a set of survey data down to a single number is usually meaningless.
Suppose the survey qualified happiness levels as High, Medium, Low. OK, we could then aggregate those to a histogram.
BUT (and this is important) there is no way to extract from that histogram a single number of “mean happiness” because high, medium, and low are not numbers, they are not the sorts of things that can be averaged. Pretending that they are numbers by assigning a value of 5 to high, 3 to medium, 1 to low, does not magically change this fact; all it shows is that you are innumerate. Why were those values (5, 3, 1) the correct numbers to assign to each happiness level? Why not 15, 3 and 1? What does it mean to say that I (high happiness) am five times as happy as you (low happiness)?
‘Numbers have meanings. If you can’t quantify something, you can’t quantify it. That doesn’t mean you can’t study it; but it does mean that anything you claim to see in your data as a consequence of that quantification is meaningless.
If there’s any difference between mathematicians (and math-adjacent folks like physicists) and the rest of society, it’s that the mathematicians understand exactly where the limits of numbers lie; the rest of society treats numbers as some sort of magic and fantasizes that once you’ve invoked numbers (regardless of the details) the spell has been cast, the magic will be invoked.
Condorcet (an actual mathematician) understood that even something as simple as “the will of the majority” is a meaningless concept, the result you get depends on the rules of your election, with the Condorcet Paradox being a stark reminder of this point. And yet something like voting where the underlying value (do I prefer X to Y) is well-defined is a vastly simpler aggregation problem that aggregating an undefined underlying (what’s my happiness level)…
You can’t just average large datasets to remove the noise in happiness.
“We show that none of the findings can be obtained relying only on nonparametric identification. The findings in the literature are highly dependent on one’s beliefs about the underlying distribution of happiness in society, or the social welfare function one chooses to adopt. Furthermore, any conclusions reached from these parametric approaches rely on the assumption that all individuals report their happiness in the same way. When the data permit, we test for equal reporting functions, conditional on the existence of a common cardinalization from the normal family. We reject this assumption in all cases in which we test it.”
PDV objects that the underlying construct measured by the happiness survey is obvious. If it’s so obvious, what is it?
Not the claim. It is obvious that some construct exists. What that construct is need not be obvious.
There’s a book called How Emotions Are Made which starts off with the discovery that people mean very different things when they’re using the same words for emotions– “anger” is a different experience from one person to another.
I am afraid you are missing an entire, huge, decades old literature on survey methodology here. Survey methodologists are acutely aware of the limitations, measurement issues, and potential biases. The main reason why even scientific (“good”) surveys occasionally ask bad questions is that the “substantive” research behind those question does not listen to the survey experts.
@hjuerges – that is a fair point better made in response to a different essay. Our beloved banana is saying “surveys are bullshit,” not “it is impossible to construct a survey that isn’t bullshit.” Actual real surveys exist in the world, and they are constructed, as you say, without listening to survey experts. So they are bullshit.
What is your opinion on psychophysical surveys? I believe pitch, laudness and other perceptual scales are created via surveys. Those scales proved very useful, for example, for audio encoding. Would you say those surveys are bullshit as well?
“Here is, I think, the most fundamental reason that surveys are of no evidentiary value: the phenomenon that the survey purports to measure either has no existence independent of the survey”
I find this very confused.
Psychology surveys measure behaviours, because psychology is a study of behaviours.
“How often do you cry? Once a day, once a week etc…” Is a measure of a behaviour. It can be compared with convergent measures of that behaviour, such as installation of a tear duct measuring apparatus. It can be an imperfect measurement tool but it does have evidentiary value.
Instead the “bullshit” seems to be really about the ontological status of constructs underlying the behaviours (which would include thoughts and feelings etc…) that the survey measures.
Banana appears to hold the view that non-physical things, things that aren’t physical properties like bioluminescent fur, aren’t “real”. So this seems like a positivist epistemology of science. But there is not much of an argument here – just a statement of a particular belief of the banana.
Philosophers of science have been thinking about this for a while though, so it might be worth checking in on them to see what they say about what is real and what is not real and how we can tell the difference.
Because a lot of our private work is under NDA and because I want to maintain a thin veneer of info-sec, I’ll be replacing some details with plausible parallel fiction.
Pertaining to the article: https://carcinisation.com/2020/12/11/survey-chicken/
On surveys being used legally: I’ve prepared to testify to a court about our survey methods / what the stats mean as an expert twice, but in both cases they didn’t end up calling me or swearing me in, even though I wore my good tie. In one case this was annoying because they asked my (non-statistician boss) questions about, for example, what a p-value was and the answer was adjacent to correct but not technically the definition so I was jonesing (as I always am!) to explain it properly.
On trademark confusion surveys: We did one of those once. Because it was going to be used in an adversarial process, our goal was to set it up in such a way that nothing could be pointed at as biased in favor of our client, but as you mentioned we were hamstrung by factors relating to realism of the scenarios we’d be replicating, because in real life the brand and the one it was supposedly infringing on would basically never show up in the same context. Imagine the original suing brand was for luxury watches, and the defendant had a similar name but sold tree-stands for hunters. Here as the person creating the instrument to measure brand confusion, the extremely different contexts that each product was sold in, advertised, etc. was, while logically beneficial to the defendant’s case, an impediment to trying to come up with a good way to measure confusion parallel to a real world experience. The absurdity of the claim made it harder to defend against! Ultimately, the project ended up falling through when the client got caught up in an enormous legal scandal on a different issue entirely and settled.
On attention and paid online surveys: It is generally understood (although minimized by some) in the industry that online surveys and panels in particular (where people are paid to participate) tend to give worse data, and the worst is when a certain survey is open to the public for a monetary incentive, which is when you draw the highest proportion of people who are straight-lining answers (just quickly selecting the first answer available repeatedly) and not thinking about things. Some amount of straight-liners can be eliminated in the data-cleaning step, but that’s a one-sided arms race where paying a bit more attention puts you over the edge into plausible deniability. The best way to eliminate the profit-motive people is to not put a survey up to be taken, but to instead reach out to relevant people to take the survey. Of course this has ethical issues involved (“don’t be annoying”), so it’s typically only employed when we can get a list of people who could plausibly be interested in sharing their opinion. When we do a survey for a school district for parents, we get a this-project-only list of parents to email. In the private sphere, it might be something like a list of people who run businesses using their software, and so on. Telephone is the tool of choice for general public stuff, and is basically always preferred aside from cost concerns.
On the subject of “traps,” things get a bit tricky. It is my general finding that a person’s answers within a survey are not necessarily consistent even with themselves, though expected connections tend to correlate very strongly. If a person is asked to rate seven different factors in why they love public libraries, and then asked to pick which is most important, it will be MORE common that a person rated the most important the most highly of the seven, but it will be far from certain even with people paying a reasonable amount of attention. Unless it’s an online or on-paper survey where on one page they set up their ratings immediately next to each other and pick the most important, there’s not a consistent level of comparison between the variables held in the average respondent’s mind. Inconsistency can be used in combination with other factors to make a judgment that a person’s data is lacking sincerity, but by itself it’s unfortunately not enough.
On sincerity: Lizardmen constant is real, some people take a survey as a consequence free opportunity to be goofy. Sometimes there will be people who start a survey goofy, but then hit a question that seems to genuinely interest them and then give a seemingly sincere and detailed open-end response. Any time we run an open-to-the-public online survey, we also, almost without exception, have at least one person (perhaps the same set of people) use the opportunity to shout into the void to type racial slurs into every category that allows a typed response.
On comprehension: One thing that helps to minimize this problem is the inclusion of open-end questions, where after asking for a rating or a bounded answer, we then ask them to explain why. The open ends are typically not used in the quantitative process itself afterwards (and are thus a bit out of the scope of your article), but when there’s a disconnect between the two types of answers it offers an opportunity to correct confusion, either by the interviewer in the moment for phone surveys, or by the data cleaners later on. A lot of the time the hope is that the natural noise of confusion is evenly split and non-biasing, but that’s really just hope.
On interpretation of scales: Fun fact! For a lot of surveys when we want to look at man / woman differences, we will have to normalize each group’s scores, because, for instance , women will be more “worried” about every single issue compared to men, or say that every issue is more important, etc. While this overall difference is always reported, it can mask relative differences in the issue itself with how people respond to things.
General thoughts on the industry / things to share:
Issues of attention / sincerity also apply on the other side of the survey-taking question! For telephone surveys, an important cog in the machine is CHEAP LABOR, with all of the normally attendant issues of cheap labor. It’s not my department, but we have had to retrain people pretty often for not fully writing out what people say, hanging up on people thinking they could get more completes that way (not how it works in our system actually), and people have been fired for just making up information after a hang-up. Even when you reach out to people, there is noise in selection effects of who will take the survey. Some surveyors are better than others in getting people to take surveys, even though they follow scripts. Highschool girls with clear and friendly speaking voices are our high performers in the call center in this aspect. Then when the data is cleaned, it’s a bit different depending on who cleaned it, and then when there are “other” responses there’s a level of arbitrariness in categorization.
The customer is always right, and the customer doesn’t know math: As the statistician, I have resigned myself to using a relatively limited set of tools for 95% of my work, because most clients are not statistically literate and only have heard of / know the big hits, and most clients who are statistically literate will do their own math in-house. One time we lost a client who wanted a particular form of analysis which involved, as the bedrock of the entire process, selection of key product factors relating intimately to the business (let’s say they were available car-features, to be mixed and matched in a way that allows to us to estimate the appeal of each without going full combinatorial madness). Given that we weren’t subject area experts, we talked with them about cooperating on selecting these factors so we could proceed with the analysis on the best footing possible. They got mad at us for asking these questions, saying we should just figure it out ourselves because it’s our job, not theirs.
The customer will not read: If the information isn’t in the top-level summary, it is often the case that no one will actually read it. It’s a bit depressing to spend hours and hours on information which will ultimately just be chucked straight into the void, but the people involved in paying us will often have no interest in diving into details about the project / the information beyond the easiest, most accessible surface level. Which leads into a more broad point:
Most surveys are not about learning things: We do surveys for three broad categories of clients: minor government agencies and departments, charities and NGOs, and private companies (or often, the marketing agencies serving private companies). For a lot of these clients, their goals are not about learning something. Goals include:
Fulfilling legal obligations to do a survey in the community about a given subject, a mechanistic process in sloshing money between parts of the government and/or NGO complexes.
Seeking third party confirmation for an internal decision that’s already been made, or where an arbitrary decision has to be made, consulting style.
Signaling due diligence. We checked / we tried, we are thus beyond reproach.
Getting soundbites / text bites for use in advertising / campaigning.
Budget constraint: When a client has non-truth based goals for the project, the fact that they get data at all is of significantly more value to them than the quality of the actual data, which will lead to them budgeting in such a way that the quality of what they get is suspect. On the analysis side we do our best to explain the limits and constraints of what we can actually get in a given budget, what they’re giving up and what they’re keeping (I do power analysis for some clients ahead of time, etc.). The simplest cut is less sample, and after that, worse sample. Going from telephone to mixed mode (online and telephone), and using a panel for online data all make things worse, but cheaper. There are some kludgy partial fixes on the analysis side (weighting, etc.), but you typically get what you pay for.
Overall I didn’t find much to disagree with in your article. I do my best to minimize noise and contribute as little as possible myself, but without seeing the methodology and knowing how it was done I would say working in surveys has made me far more survey-skeptical than the average man on the street.