The Ongoing Accomplishment of the Big Five


by a literal banana

I have been trying to understand the “lexical hypothesis” of personality, and its modern descendant, the Five Factor Model of personality, for several months. In that time, I have said some provocative things about the Big Five, and even some unkind things that I admit were unbecoming to a banana. Here, I wish to situate the Five Factor Model in the context of its historical development and modern use, and to demonstrate to the reader the surprising accomplishment that it represents for the field of psychology.

In personality research, the “lexical hypothesis” refers to a hypothesis attributed to Francis Galton (1884). Galton supposed that each human language would reflect important realities of human character within that language and culture. In particular, he noted that the words used to evaluate character and personality are very numerous (he estimated over a thousand, using a thesaurus), and often overlap in meaning.

But Galton immediately left his thesaurus behind, readily admitting of the impossibility of defining any aspect of character. Rather, he turned to experimental means of testing the character in various ways, and insisted that no particular map or model of personality is needed to start from.

Nowhere in his essay does Galton propose surveys as a means for studying character. He would probably regard such methods as unscientific, as indicated in his final paragraph:

[C]haracter ought to be measured by carefully recorded acts, representative of the usual conduct. An ordinary generalisation is nothing more than a muddle of vague memories of inexact observations. It is an easy vice to generalise. We want lists of facts, every one of which may be separately verified, valued and revalued, and the whole accurately summed. It is the statistics of each man’s conduct in small every-day affairs, that will probably be found to give the simplest and most precise measure of his character.

The methods that Galton proposed are exclusively non-linguistic. For instance, he commented that observing children involved in play quickly gives one an idea of each child’s emotional expression. Galton’s proposed methods prefigure both hidden camera prank shows and Goffman’s “breaching experiments:”

I will not attempt to describe particular games of children or of others, nor to suggest experiments, more or less comic, that might be secretly made to elicit the manifestations we seek, as many such will occur to ingenious persons. They exist in abundance, and I feel sure that if two or three experimenters were to act zealously and judiciously together as secret accomplices, they would soon collect abundant statistics of conduct. They would gradually simplify their test conditions and extend their scope, learning to probe character more quickly and from more of its sides.

Other methods Galton expressed enthusiasm for include heart rate measurement (he wore a home-brew heart-rate-measuring apparatus while he delivered the lecture that makes up the text) and methods discoverable from personal context (giving an example from Benjamin Franklin, of a man with one attractive and one deformed leg, who kept track of which leg his interlocutors paid attention to, as a gauge of their optimism or pessimism). Galton would be surprised, I think, to find that the most promising and scientific theory of personality in the twenty-first century is premised entirely on survey responses as its “facts.”

Early in the study of personality, there was a major shift of meaning in the lexical hypothesis. At first, the thesaurus and the word list were its tools of study (e.g., Allport & Odbert, 1936); the idea was to find common factors of meaning in the words themselves. Of course, there is no particularly scientific way to decide how much the word “annoying” is the same as “obnoxious,” or how much either is the same as “low-status.” The major shift was to begin to measure the correlations of an entirely different construct: the correlations of the words when used to describe a particular person. That is, rather than trying to measure the underlying meaning of words, researchers began to measure the degree to which different words were applied to the same person. “Sameness” and “correlation” were no longer distinguishable concepts for the methods.

Initially, lists of adjectives, and eventually, short survey questions, were administered to subjects, who described either a person they knew or themselves. When the responses were subjected to factor analysis—a mathematical analysis to reveal the structure of correlations between responses—a varying number of factors emerged, depending on the methods and the researchers and the questions and the subjects, and these factors were given varying names. Since the early 1990s, the Five Factor Model has been dominant, although the names of the factors vary somewhat even today. The acronym OCEAN is used for the traits: Openness to experience (sometimes called “intellect” or “imagination” or “open-mindedness”), Conscientiousness, Extraversion (sometimes called “surgency”), Agreeableness, and Neuroticism (sometimes called “negative emotionality” or “emotional stability” reversed). 

Today, the five traits are measured with various survey instruments, with five questions on the shortest version (one for each aspect) and sixty questions on a common long-form version (that used by Soto, 2019). Survey instruments are validated in a number of ways: how much their responses correlate between testings (test-retest reliability, with astrological sign as the gold standard), how much different raters agree using the criteria (inter-rater reliability), and a nebulous concept of construct validity, which sometimes includes scientific gestures designed to ensure that the instrument measures what it purports to measure. Many papers present elaborate numerical artifacts of validation, and I have found that some characterize the validity of their instruments as “good” without providing an indication of what would be “not good enough.” From a brief review of dozens of validated instruments in social psychology, it seems to me that it is relatively easy to “validate” meaningless instruments. As long as the mathematical bona fides are present, the construct need not be meaningful in other ways. (The reader who is rightly suspicious of my broad and unsourced claims may wish to search Google Scholar with variations of “scale,” “inventory,” and “survey instrument,” and examine the results critically. The naming of factors is often a particularly interesting step.)

The strong claim made by advocates of the Five Factor Model is that any set of questions describing a human being, administered to subjects and the responses subjected to factor analysis, will reveal the same five factors (paraphrasing Jordan Peterson in this video, around 11:10-16:40). This strong claim, though dubious in a number of respects, is a major part of the basis for the scientific legitimacy of the Big Five. It is interesting to see which aspects of the strong claim are admitted to be false by advocates of the Big Five, and how much is excused on the grounds that at least it’s something. The Five Factor Model is not perfect, advocates grant, but it is better than nothing. It is not clear how they measure “better than nothing;” this is a potentially interesting hypothesis in need of precisification, perhaps.

The Big Five exist as a special, scientifically validated property of language and survey methods, and that is one basis for their legitimacy. The other basis for the legitimacy of the Five Factor Model is its replicable correlation with consequential life outcomes. We know that the Big Five are not merely phantoms that fall out of a certain analysis of a certain use of language in WEIRD college students, because these traits are reliably correlated with things we care about. 

One of the most interesting features of the Big Five is the nature of its scientific evidence. Observe what is held out as a “replication” of the theory, and you will discover the theory’s true nature. The most impressive aspect of the ongoing accomplishment of the Five Factor Model is the degree to which it deflects curiosity about its underlying meaning with rituals of scientific validation, regardless of the rituals’ appropriateness in context. Since “replication” is the scientific ritual most recently shown to detect poor science in psychology, being shown to reliably “replicate” is a huge boost to the credibility of a theory. 

The interesting thing about the Five Factor Model is what it gets away with, in terms of being considered a theory, even though it is not causal, and makes no predictions. What counts as a “replication” of the Five Factor Model, as in Soto (2019), is the following: a correlation is found between one or more factors of the Five Factor Model and some other construct, and that correlation is found again in another sample, regardless of the size of the correlation. In almost all cases, and in 100% of Soto (2019)’s measures, the construct compared to a Big Five factor is derived from an online survey instrument.

What counts as a “consequential life outcome” is also fascinating. In most cases, the life outcome constructs are vague abstractions measured with survey instruments, much like the Big Five themselves. For instance, the life outcome “Inspiration” is measured with the Inspiration Scale, which asks the subject in four ways how often and how deeply inspired they are. Amazingly, this scale correlates a little bit with Extraversion and with Open-mindedness. Do these personality traits “predict” the life outcome of inspiration? Is “Inspiration” as instrumentalized here meaningfully different from the Big Five constructs, such that this correlation is meaningful? 

Compare the items for the construct “Inspiration” with the items for Extraversion and Open-mindedness used in Soto (2019):

Inspiration Scale Items

Extraversion+ Items

Extraversion- Items

Open-mindedness+ Items

Open-mindedness- Items

How surprised should we be that items like “I am inspired to do things” correlate with items like “Is full of energy,” or that “I experience inspiration” negatively correlate with “Has few artistic interests”? “Replication” seems like a dignified term for asking the same questions in different ways.

One of the more surprising correlations replicated by Soto (2019) is between “Agreeableness” (Big Five) and “Heart Disease” (measured by a questionnaire about chest pain). The correlation is only .04, compared to the original .15. Even so, since the most important questions on the 1977 questionnaire ask the subject to agree with various statements about chest pain, rather than phrasing them in the negative, a tiny correlation with agreeableness would not be surprising. It is not clear what scientific hypothesis—particularly a causal hypothesis—might be advanced here. But it counts as a replication.

Soto (2019) reports that Negative Emotionality (sometimes called Neuroticism) correlates strongly with DSM-III Depression and Anxiety. It is hardly surprising that a subject answering questions about negative emotions would answer questions related to depression and anxiety similarly, since many of them are functionally the same questions. It is also interesting that a measure of supposedly stable personality traits correlates with a measure of supposedly pathological disease symptoms. Nonetheless, the correlation is treated as an important replication of the predictive power of the Big Five. 

Many of the “consequential life outcomes” appear to be rephrasings, or aspects, of the personality factors they correlate with: “Existential or phenomenological concerns” correlates with Open-mindedness, “Existential well-being” and “Subjective well-being” correlate (negatively) with Negative emotionality, and “Dating variety” correlates with Extraversion. Interestingly, “Occupational performance” correlates only .03 with Conscientiousness (Soto 2019), even though the questions on the two surveys seem to overlap more than that. Survey statements like “I do a thorough job” or “I tend to be lazy” seem like they should predict job performance, and this prediction is at the core of the claims of proponents of the Big Five. But despite the inanity of the purported correlation, it is not in fact always easy to detect. Salgado’s (2002) meta-analyses, for instance, found that Conscientiousness modestly correlates with “deviant behavior” (such as theft), but does not correlate with absenteeism or accidents. 

On the other hand, Big Five Conscientiousness was not found to correlate with mask wearing in a sample of thousands in Spain during the coronavirus epidemic (Barceló & Sheen, 2020). This was not treated by the authors as any kind of falsification of the Big Five, or even evidence against it. The abstract noun “conscientiousness” has a rich meaning, only part of which is captured by the Big Five, and only a tinier part of which is captured by the two-question methodology used here (“does a thorough job” and “tends to be lazy”). But Conscientiousness is often correlated to health behaviors, and is often said to predict them with various strengths, even though the questions in the survey focus on job performance and tidiness. For instance, Tiainen et al. (2013) found that Conscientiousness correlated with higher fruit intake, at least in women, in a food questionnaire study. (Extraversion was associated with a higher meat intake in women, and each personality trait was found to have some effect on some gender and food type.) Conscientiousness has been linked to the intention to use sunscreen by beachgoers in the UK, although not among beachgoers in New Zealand (Kouzes et al., 2017). Given the loose standards at play, it wouldn’t have been too surprising to discover a relationship between Conscientiousness and mask wearing, or Conscientiousness and any other allegedly health-related behavior.

The methodology of the Spain mask study was poor (asking about mask use without asking about likelihood of leaving the house at all, for example), but the quality of almost all studies in social science is poor, and it is not worse than most studies I have encountered in support of the Big Five. From a naive perspective, it is surprising that poor-quality studies with no practical power to falsify anything are still performed. But this is exactly the genius of the thing. What is interesting is the degree to which the Five Factor Model is insulated from falsification. It makes no causal predictions; others may make predictions about correlations, and these may replicate, or not. The only content of the Five Factor Model, in the sense of making claims, seems to be that the five factors are important somehow, and that their importance is specially validated by mathematical methods using survey instrument responses. That they correlate with responses on surveys ostensibly measuring other things is taken as evidence for their importance, but rarely is failure to find such evidence counted against their validity. There is an old saying that a stopped clock is right twice a day; it is not clear how often the Big Five “clock” is “right.”

As a novice to the study of the Big Five, I found that I had many misconceptions. For example, I thought that the Big Five were considered to be universal, in some sense, and not just descriptive of WEIRD college students. But in fact, it is not simply that the Big Five factors fail to fall out of the analysis of large surveys in other languages and cultures. They don’t even fall out of Big Five-specific questions administered in non-WEIRD populations (Gurven et al., 2013). Many of the concepts mentioned in the Big Five survey instruments do not even exist in non-WEIRD languages as such, particularly abstract nouns (Gurven et al., 2013). Of course a concept that does not exist across cultures could not form the basis for a universally important personality trait, as measured by language.

Second, I imagined that the Big Five were based on large and extremely inclusive sets of survey items. I admit that I hadn’t really thought about the abstract noun “personality.” What counts as personality, and what doesn’t? Apparently, the exact nature of the questions in the survey seem to matter a great deal. Saucier & Srivastava (2015) say that the appearance of the Big Five factors “is clearly contingent on one’s variable-selection procedure,” noting that multiple particularly broad question databases have failed to produce the Big Five as the top five traits. The Big Five are not a property that emerges from all sets of questions describing humans, as the strong claim would have it, but rather from a subset of questions whose nature and selection procedure is not always clear. The consequences of this are unknown.

Third, I’d thought that the Big Five were relatively stable across the life course; however, longitudinal studies of several cohorts of adults, born between 1914 and 1960, revealed that most traits changed over the life course in distinct patterns (Roberts et al., 2006), with social dominance (an aspect of extraversion), agreeableness, conscientiousness, and emotional stability increasing in adulthood and/or later life. One interpretation is that the positive traits identified by the Big Five are traits of adulthood, or reflections of having control over one’s environment, or something like that. 

Finally, I’d thought that the traits themselves were orthogonal, in that they were genuinely separate traits that didn’t covary much. I’d thought that this was a major focus of the factoring process, and an aspect of the traits’ specialness and validity. However, the traits covary a great deal. Lukaszewski et al. (2017) found positive mean inter-factor correlations for every country in a large international sample, and an especially high value for Tanzania, the outlier apparently driving their result (that complex societies decrease trait covariance through specialization). Nor is it unusual to find big correlations between the factors; in a large sample of twins, Shane et al. (2010) found correlations of .39 between Extraversion and Openness to Experience, and .30 between Emotional Stability and Agreeableness. Chang et al. (2012) address the problem of the non-orthogonality of the Big Five, and report that even with a methodological correction to eliminate “methods bias,” they could not eliminate correlations between many factors, suggesting some may be “redundant.”

If the Big Five are not universal, stable, or orthogonal, what good are they? They have a perfectly clear use. They replicate: the answers to many other survey instruments can be found to correlate with the Big Five survey responses, in multiple samples of survey-takers. To complain that the Big Five are meaningless is somewhat unscientific. They have a very specific meaning within the language game they belong to, and they are popular and memetically successful tools within that sphere. 

The Big Five are, in a sense, protected from falsification. They make no predictions; there is no underlying causal model. As I understand it, no study could be devised to prove that the Big Five aren’t real, because they make no formal pretense to reality. They are innocent mathematical constructs that fall out of particular survey instruments administered to particular populations.

Critics of the Big Five are almost always proponents of personality factor models with other numbers attached. Few criticize the connection between the survey instruments and the underlying reality. Consider each survey question. What does it ask the respondent for, other than “a muddle of vague memories of inexact observations,” as Galton put it? Note how every single question may be responded to with “compared to what?” Context, and not just what we might narrowly think of as personality, is relevant in 100% of the questions. Certainly the questions measure something; it is not at all clear what that something is, and the nature of that something is rarely investigated. 

Heene (2013) puts the matter rather strongly, in regard to psychological methods that allow the researcher to slip out of the reach of falsification:

[T]he tools of mainstream psychology such as [Structural Equation Modeling] and [Item Response Theory] make exactly these strong assumptions about the quantitative structure of psychological attributes. But avoiding any tests of quantitative measurement but applying methods making the assumption of quantity appears to be nothing more than a self-delusion that one bears something valuable instead of being in fact empty-handed. This all too strong tendency to avoid falsification is probably deeply rooted in the scientifically unhealthy political/economical aspiration of psychology which keeps the machine for paper-producing and grant-funding well-oiled but also leading to a severe publication bias….[T]he possibly best evidence of my claims comes from a logical argument: has anyone ever seen articles using SEM, IRT, or Rasch models in which the author admitted the falsification of his/her hypotheses? On the contrary, it appears that stringent model tests are mostly carefully avoided in favor of insensitive “goodness-of-fit indices.” (Citations omitted.)

Holtz & Monnerjahn (2017), reviewing psychology textbooks, conclude that Karl Popper’s ideas of falsification have had “little to no traceable influence on the epistemology and practice of social psychology.” Given social psychology’s newfound conversion to the religion of replication, a theory devoid of causal claims, and therefore predictions, is the perfect theory: it can never be falsified. 

But the true power of the Five Factor Model is not just its power to replicate, but its power to bless the measurement of any correlation (or lack thereof) between itself and any construct as valid science. For example, Asselmann & Specht (2020) measured the Big Five traits of subjects multiple times in a longitudinal study, some of whom had children during the study period, and some of whom remained childless. They predicted, based on previous studies, that three traits would change with parenthood, or predict parenthood: conscientiousness, agreeableness, and emotional stability. In fact, they found that exactly the other two traits (extraversion and openness) correlated with parenthood: more extraverted, less open people selected into parenting by a tiny amount, and people became less open and less extraverted after becoming parents. But the authors conclude their abstract triumphantly:

Taken together, our findings suggest that the Big Five personality traits differ before and across the transition to parenthood and that these differences especially apply to openness and extraversion. 

The Big Five survey instrument doesn’t seem to measure the same changes from study to study, but this is taken as support for the results of the latest study. The underlying construct is not questioned. Simply measure the Big Five and report how they correlate with literally anything else, or even with themselves at different points in time, and you’ve performed socially valid science, regardless of your hypothesis and your results. Such is the power of the Big Five.

Is this “better than nothing”? For now, the “better than nothing” of the Big Five is its ongoing use in publishing psychology papers. It is certainly better than nothing for psychology researchers who want to publish research using cheap methods that no one will question (except rude bananas on the internet). But from an outside perspective, the perspective of one who might hope to get knowledge about the world from science, an unfalsifiable theory using methods that evade falsification becoming the dominant paradigm is hardly “better than nothing.” 

Those who are skeptical of the Enneagram are usually Type 6, and those who are skeptical of astrology are usually Tauruses. Similarly, those who criticize the Big Five are typically low on extroversion, high on conscientiousness, low on agreeableness, and high on neuroticism (openness to experience can go either way, I suppose). From the perspective of the social psychology of personality, this essay is a glowing partial replication of the Big Five! 

I predict that the Five Factor Model of personality has many years of life in it yet. Perhaps it will endure for decades, producing replicable findings of dubious significance, and meanwhile crowding out creative and meaningful research into how people work. Perhaps the most important and surprising accomplishment of the Five Factor Model is hiding the fact that such research is not taking place within the field of social psychology.



Allport, G. W., & Odbert, H. S. (1936). Trait-names: A psycho-lexical study. Psychological monographs, 47(1), i.

Asselmann, E., & Specht, J. (2020). Testing the Social Investment Principle Around Childbirth: Little Evidence for Personality Maturation Before and After Becoming a Parent. European Journal of Personality.

Barceló, J., & Sheen, G. (2020). Voluntary adoption of social welfare-enhancing behavior: Mask-wearing in Spain during the COVID-19 outbreak. Preprint at OSF

Chang, L., Connelly, B. S., & Geeza, A. A. (2012). Separating method factors and higher order traits of the Big Five: A meta-analytic multitrait–multimethod approach. Journal of personality and social psychology, 102(2), 408.

Galton, F. (1884). Measurement of character. Fortnightly, 36(212), 179-185.

Gurven, M., Von Rueden, C., Massenkoff, M., Kaplan, H., & Lero Vie, M. (2013). How universal is the Big Five? Testing the five-factor model of personality variation among forager–farmers in the Bolivian Amazon. Journal of personality and social psychology, 104(2), 354.

Heene, M. (2013). Additive conjoint measurement and the resistance toward falsifiability in psychology. Frontiers in psychology, 4, 246.

Holtz, P., & Monnerjahn, P. (2017). Falsificationism is not just ‘potential’ falsifiability, but requires ‘actual’ falsification: Social psychology, critical rationalism, and progress in science. Journal for the Theory of Social Behaviour, 47(3), 348-362.

Kouzes, E., Thompson, C., Herington, C., & Helzer, L. (2017). Peer Reviewed: Sun Smart Schools Nevada: Increasing Knowledge Among School Children About Ultraviolet Radiation. Preventing Chronic Disease, 14.

Lukaszewski, A. W., Gurven, M., von Rueden, C. R., & Schmitt, D. P. (2017). What explains personality covariation? A test of the socioecological complexity hypothesis. Social Psychological and Personality Science, 8(8), 943-952.

Roberts, B. W., Walton, K. E., & Viechtbauer, W. (2006). Patterns of mean-level change in personality traits across the life course: a meta-analysis of longitudinal studies. Psychological bulletin, 132(1), 1.

Saucier, G., & Srivastava, S. (2015). What makes a good structural model of personality? Evaluating the big five and alternatives. In APA handbook of personality and social psychology, Volume 4: Personality processes and individual differences. (pp. 283-305). American Psychological Association.

Salgado, J. F. (2002). The Big Five personality dimensions and counterproductive behaviors. International journal of selection and assessment, 10(1‐2), 117-125.

Shane, S., Nicolaou, N., Cherkas, L., & Spector, T. D. (2010). Genetics, the Big Five, and the tendency to be self-employed. Journal of Applied Psychology, 95(6), 1154.

Soto, C. J. (2019). How replicable are links between personality traits and consequential life outcomes? The Life Outcomes of Personality Replication Project. Psychological Science, 30(5), 711-727.

Tiainen, A. M. K., Männistö, S., Lahti, M., Blomstedt, P. A., Lahti, J., Perälä, M. M., … & Eriksson, J. G. (2013). Personality and dietary intake–findings in the Helsinki Birth Cohort Study. PloS one, 8(7), e68284.