To train Google’s artificial Q&A brain, Orr and company also use old news stories, where machines start to see how headlines serve as short summaries of the longer articles that follow. But for now, the company still needs its team of PhD linguists. They not only demonstrate sentence compression, but actually label parts of speech in ways that help neural nets understand how human language works. Spanning about 100 PhD linguists across the globe, the Pygmalion team produces what Orr calls “the gold data,” w

A new paper published in PLoS ONE outlines some of the major problems with the corpus of scanned books that powers Google Ngram. “It’s so beguiling, so powerful,” says Peter Sheridan Dodds, an applied mathematician at the University of Vermont who co-authored the paper. “But I think there’s a misrepresentation of what people should expect from this corpus right now.” Here are some of the problems.

