home     authors     titles     dates     links     about

the indo-european controversy

14 august 2018

Big Data is as big in the humanities as anywhere else. This seems counterintuitive. Humanities are supposed to be about the slow accumulation of expertise about unique, contingent things, not about crunching massive datasets to get generalizable results. But in some of the humanities, you can see the appeal of big data. Linguists love their corpuses, or whatever the plural of "corpus" may be. Historians depend on demography. Even literary scholars can appreciate the ability to scan vast ranges of text for verbal patterns that would take several lifetimes to get a feel for the old-fashioned way.

But Big Data has its limits. Asya Pereltsvaig and Martin Lewis chart some of the absurdities of the Big Data approach to historical linguistics in their recent book The Indo-European Controversy. At least I read the book recently; it came out in 2015. But heck, when you're talking about millennia of linguistic history, what's three little years.

The Indo-European Controversy was written in response to a single article, "Mapping the origins and expansion of the Indo-European language family," by Remco Bouckaert et al., which appeared in the journal Science in 2012. Pereltsvaig and Lewis admit that writing a 235-page book to refute a 3-page article looks like overkill. But they have a meta-purpose. "Mapping the origins" marked a rare appearance of high-profile linguistics in one of the world's leading scientific journals. Not just linguistics, but big-data computational linguistics at that, and trained on a problem (the geographical origins of Indo-European languages) with long-standing political and cultural implications. For a lot of observers, particularly those who read only quick mainstream-press digests of articles in the big-gun science journals, "Mapping the origins" may be all they will hear for the next decade about a major issue in the history of language. Not only does the specific article deserve a thorough critique, but the whole rhetoric of glitzy sound-bite science writing should come under scrutiny.

Not that a heck of a lot of people are going to read a technical 235-page academic-press book on historical linguistics and its methodology. But I will! And by posting this review, however belated, I can add my mite to the authors' counterstatement.

So what's the controversy? There are two main plausible theories about the Indo-European homeland. Indo-European languages, now spoken anywhere planes fly or ships sail, were spread at the dawn of history from Ireland to India. Two areas in the center of that range hold sway as possible homelands: the steppes north of the Black Sea (the "Steppe hypothesis") and Asia Minor (the "Anatolian hypothesis"). Both have some support from pencil-and-paper linguistic reconstruction and from archeology. The Steppe hypothesis has a good deal more, and is thus the orthodox view among linguists, the Anatolian being distinctly contrarian.

The authors of "Mapping the origins" fed word lists from Indo-European languages into Bayesian algorithms: computer programs that develop indications of how probable certain scenarios may be. They came down decisively on the side of the Anatolian hypothesis. In the process, they opined in favor of a very early date for the splitting of Indo-European into its many daughter languages, closer to 8,000 or 9,000 than 5,000 years ago (the Steppe hypothesis usually opts for the much later date). Case closed, for many in the media and many following at home.

Except … of course it isn't that simple. Algorithms like the ones that Bouckaert et al. use are like those used to predict presidential elections and baseball seasons – Pereltsvaig and Lewis even invoke Nate Silver, a current guru of both kinds of prediction, in explaining the methodology. As we saw in the 2016 election, and see in baseball annually, those predictions are only as good as the data that goes into them. Pereltsvaig and Lewis don't think that the "Mapping" data is very good.

And there's also the difference between prediction and retrospection. If I want to get a good idea of who's going to win the 2018 congressional elections, or the 2018 World Series, I can feed data into programs that simulate thousands and thousands of "preplays" of the events. I obviously still don't know what's going to happen – it hasn't happened yet. But if a certain outcome shows up more often than others, that's at least the way to bet. The thing about the past, though, is that it has already happened. The speakers of Proto-Indo-European lived somewhere, no matter how likely or unlikely. You can extrapolate backwards and find a greater statistical chance that they came from Anatolia, but if they came from the steppes, they came from the steppes. I can replay the 1978 AL East playoff game 10,000,000 times and find a high likelihood that Reggie Jackson hit the home run that won it; but still, Bucky Dent did.

In any case, whether I'm projecting forwards or backwards, my projection is only as good as my data, and Pereltsvaig and Lewis argue that the "Mapping the origins" data is inadequate. Crucially, the "Mapping" team uses only word lists. These lists, called Swadesh lists after the mid-20th-century statistical linguist Morris Swadesh, include common vocabulary items that are arguably somewhat resistant to change over time. The degree to which those words have actually changed in daughter languages can help establish the lineages of language families.

Words are certainly important in linguistic history. But Pereltsvaig and Lewis see dangers in relying on words alone. The mere content of a language's lexicon, taken uncritically, can be deceptive. For instance, one reason we know that the Indo-European languages stem from a common ancestor is the presence of cognates for words like "otter," "adder," "ewe," "mouse," and several other animals. But most Indo-European languages (and many non-Indo-European languages) also share a common word for "dinosaur." This is not because early Indo-Europeans knew about dinosaurs, still less that they lived among dinosaurs, like the Flintstones. It is because the word "dinosaur" was coined (in English from Greek roots) less than 200 years ago, and then borrowed into a host of other languages. This is a stupid example, but the principle obtains: when looking at word lists, you have to carefully sort out cognates from borrowings, something that simple Swadesh lists fail to do.

And in any case, thinking of a language as merely a set of words is hugely reductive. Languages are way more than flashcard vocabularies. They consist of sound systems, rules for forming and inflecting words, and grammars (which in turn are determined by a language's settings for a system of parameters – as Mark Baker discusses in The Atoms of Language, a frequent reference point for Pereltsvaig and Lewis). By ignoring all those aspects of language, the "Mapping" team appear to have run themselves into numerous absurdities, splitting languages into impossible lineages and getting the chronology of "branching" of these lineages out of order.

Of course, proponents of Big Data might argue that all these pesky details are beside the point. By invoking computers' ability to run millions of alternative histories, Big Data analysts deliberately blur the picture, so that they don't miss the forest for the trees. (Often these are literal trees, as names for tree species are central to traditional linguists' reconstructions of Indo-European origins.) Big Data has a point here, but it just brings us back to the Bucky Dent problem: if your retro-analyses conflict with actual history, the whole exercise is fairly useless.

The "Mapping" algorithms produce best-fit tree lineages for various Indo-European language groups, relative chronologies for them based on assumptions about the rate of language change (what used to be called "glottochronology"), and various historical anchorage points that establish an absolute framework onto which their relative findings can be overlaid. (This work is similar to that done by statistical evolutionary biologists in trying to construct species lineages, another analogy that Pereltsvaig and Lewis critique as misleading.) In the process, the Bouckaert team comes up with assertions that, according to Pereltsvaig and Lewis, are real howlers if you're a historical linguist: an ancient date for the separation of Romani from the Indian languages (59), a grouping of Polish with Eastern Slavic tongues (84), and weird chronologies for the Romance languages (106). The last of these problems is bizarre because the Romance languages developed entirely within recorded history. It's as if we didn't know anything about Latin, and used the Big-Data approach to reconstruct the history of the relationships among French, Italian, Spanish, and Romanian. In such a case, the researchers might be way off. In the actual case, we know they're way off, yet the "Mapping" team continues to show confidence in their methods.

But let's say that Big Data is correct, and the Indo-European linguistic community last existed as a contiguous whole more than 8,000 years ago in Anatolia. How, in that scenario, did the Indo-European languages spread across Eurasia? Champions of Anatolian origins tend to follow Colin Renfrew, the earlier pioneer of Anatolian thinking, in proposing "demic diffusion" as the mechanism for the propagation of Indo-European. Demic diffusion is a little hard to envision, and its very vagueness is in fact an argument against it. The idea is that the Indo-Europeans were early agriculturalists. As their way of life caught on, they'd start farming a few acres further on in every new generation. Hunter-gatherers at the edge of their range would pick up their languages along with their farming techniques. Several really boring millennia later, everybody was speaking an Indo-European language as they plowed their fields.

Languages that we know from historical records haven't spread by demic diffusion, of course. Migration, colonization, and conquest have been way more effective. Caesar shows up in Gaul with a few legions, and ecce, a little while later all the Celts there are speaking Latin.

(Or rather, of course, they're speaking a strange Gallic Latin that would eventually become French after a bunch of Germans descended on them speaking a Frankish dialect that would also have a major impact on the way they spoke. Another critique that Pereltsvaig and Lewis make of the "Mapping" hypothesis is that it seems to envision languages as traveling in bubbles along with their speakers and developing without external influences, but that also never happens.)

What we know of linguistic history, then, suggests that languages move across continents (and skip over oceans) in the mouths of highly mobile speakers: cavalry, seaborne traders, hordes of barbarians. By contrast, the Anatolian hypothesis sees everybody as just plodding along in a literally bovine manner till they'd seeped into the nooks and crannies of arable Europe and Asia. Such a mechanism doesn't appear to fit the archeological or linguistic facts of the situation.

Demic diffusion does, however, have a political virtue. Too many early Indo-Europeanists, back in the bad old racist days c1900, saw the "Aryan" tongues, as they called them, conquering Europe and Asia along with their indomitable warrior speakers. This is also nonsense, as people of any ethnicity can learn any human language, and frequently have. You don't need to think of the Indo-Europeans as a master race to envision their languages winning out; but you don't need to go as far as an Edenic vision of agricultural placidity on the other hand.

What of the authors' own critical ethos? Many of the "Mapping" team aren't traditional historical linguists, and Pereltsvaig and Lewis admit that they're not experts on the reconstruction of Indo-European, either. They were best known, going into this project, as bloggers. I used to read Asya Pereltsvaig's incisive commentaries on linguistics, and Martin Lewis used to run what looks like a very sharp blog on geography. (Both blogs seem to have gone dormant at about the time The Indo-European Controversy was published, as far as I can tell.) Bloggers might know a thing or two (says me, posting Internet entries about books for the past 15 years). And of course The Indo-European Controversy carries the imprimatur of Cambridge University Press; though since one of Pereltsvaig and Lewis' themes is that you can't trust everything you read from prestigious outlets, that imprimatur is somewhat undercut.

Still, I find the logic that Pereltsvaig and Lewis deploy to be convincing, and their distrust of the distancing solutions of data analysis to be compelling. Their conclusions in a technical sphere mirror Jerry Muller's more generalized critiques of data-driven thinking in The Tryanny of Metrics. In so many realms of life, direct experience of weird, specific, contingent, customized stuff is losing ground to distanced, one-size-fits-all number crunching. But such number crunching is not just speciously "objective"; it can also be considerably detached from reality.

Pereltsvaig, Asya, and Martin W. Lewis. The Indo-European Controversy: Facts and fallacies in historical linguistics. Cambridge: Cambridge University Press, 2015.