The Pitfalls of DATA Science
JUNE 17, 2021
Now, we will continue the discussion on the undermining of scientific publication using two examples: SCIgen and citation counts.
In 2005 three MIT graduate computer science students created a prank program they called SCIgen for using randomly selected words to generate bogus computer-science papers complete with realistic graphs of random numbers. Their goal was “maximum amusement rather than coherence,” and to demonstrate that some academic conferences will accept almost anything.
They submitted a hoax paper titled, “Rooter: A Methodology for the Typical Unification of Access Points and Redundancy,” to the World Multiconference on Systemics, Cybernetics and Informatics (WMSCI) that was being held that July in Orlando, Florida.
The abstract read:
Many physicists would agree that, had it not been for congestion control, the evaluation of web browsers might never have occurred. In fact, few hackers worldwide would disagree with the essential unification of voice-over-IP and public-private key pair. To solve this riddle, we confirm that SMPs can be made stochastic, cacheable, and interposable.
Among the bogus references was:
Aguayo, D., Aguayo, D., Krohn, M., Stribling, J., Corbato, F., Harris, U., Schroedinger, E., Aguayo, D., Wilkinson, J., Yao, A., Patterson, D., Welsh, M., Hawking, S., and Schroedinger, E. A case for 802.11b. Journal of Automated Reasoning, 904 (Sept. 2003), 89-106.
Notice that D. Aguayo appears three times in the list of authors and that E. Schroedinger appears twice. Aguayo is one of the MIT pranksters and Schroedinger is an apparent referent to Erwin Schroedinger, the Nobel Laureate physicist who died in 1961, 42 years before the referenced paper was said to have been written. Nonetheless, WMSCI quickly accepted the fake paper.
The students have now gone on to bigger and better things, but SCIGen lives on. Believe it or not, some resourceful (desperate?) researchers bolster their CVs by using SCIgen to create papers that they submit to conferences and journals.
Cyril Labbé, an energetic and enterprising computer scientist at the University of Grenoble, wrote a program to detect hoax papers published in real journals. Working with Guillaume Cabanac at the University of Toulouse, he found 243 bogus papers written entirely or in part by SCIgen. A total of 19 publishers were involved, all reputable and all claiming that they only publish papers that pass rigorous peer review.
Gaming Citation Counts
In 2010 Labbé showed how citation counts could be inflated. In a few short months, he elevated an imaginary computer scientist (Ike Antkare, pronounced, “I can’t care”) to “one of the greatest stars in the scientific firmament.” Since Google Scholar only indexes papers that reference a paper already in Google Scholar, Labbé used SCIgen to create a fake paper, purportedly authored by Antkare, which referenced real papers and then used Scigen to generate 100 additional bogus papers supposedly authored by Antkare, each of which cited itself, the other 99 papers, and the initial fake paper. Finally, Labbé created a web page listing the titles and abstracts of all 101 papers, with links to pdf files, and waited for Google’s web crawler to find the bogus cross-referenced papers.
The Googlebot soon found the papers and Antkare was credited with 101 papers that had been cited by 101 papers, which propelled him to 21st on Google’s list of the most cited scientists of all time, behind Freud but well ahead of Einstein, and first among computer scientists.
If imitation is the sincerest form of flattery, Labbé should be flattered. Soon after he reported the Antkare stunt, three Spanish researchers who specialize in bibliometrics and research evaluation reported inflating their Google Scholar Citation profiles by posting six fictitious papers on one of their university websites, with each of the six papers citing 129 papers written by the authors. As expected, “The citation explosion was thrilling, especially in the case of the youngest researchers whose citation rates multiplied by six.” They published an account of their manipulation of citation counts because their intent was not to game the system, but to show how the system cannot be trusted because it can be gamed.
Even if researchers do not do an all-out Ike Antkare, they can still easily game citation metrics. In every paper they write, they cite as many of their other papers as the editors will let them get away with.
Journals can also game citation counts by publishing lots of papers that cite papers previously published in the journal. On more than one occasion, I have had journal editors ask me to add references to articles published in their journal. I randomly selected an Expert Systems paper and found that it had nine citations to other Expert Systems papers. I previously estimated that Expert Systems publishes an average of 1,200 articles a year. If Expert Systems publishes 1,200 papers a year, each with 9 citations to other papers published in this journal, authors will average 9 citations per paper and the journal will average 10,800 citations per year, even if their articles are never cited by papers published in other journals.
Publishers like Elsevier that have a large portfolio of journals can also boost citation counts by encouraging authors to cross-cite other journals in the portfolio. Expert Systems is, in fact, among the top 5 artificial intelligence journals in terms of total citations but ranks 27th when the citations are weighted by the importance of the journals that the citations come from.
What to do? What to do?
Publication counts and citation indexes are too noisy and too easily manipulated to be reliable. Nor can we evaluate research simply by noting the journals that publish the results. Because there is so much noise in the review process, lots of great papers have been published in lesser journals and many terrible papers have been published in the most respected journals. John P. A. Ioannidis’s paper, “Why Most Published Research Findings Are False,” has been cited nearly 10,000 times despite being published in PLOS Medicine, a good-but-not-great journal. On the other hand, the British Medical Journal, a truly great journal, published a paper making the preposterous claim that Asian-Americans are unusually susceptible to heart attacks on the fourth day of the month because the bad luck associated with the number four is as terrifying as being chased down a dark alley by a savage dog with “flaming jaws and blazing eyes.”
The best solution is to have experts read the research done by applicants for a job, promotion, or grant. Simple counts and indexes simply will not do. Bob Marks reminded me of the old aphorism in academia, “Deans can’t read, but they can count.” A better one is, “Not everything that can be counted counts, and not everything that counts can be counted.”
You may also wish to read the first two articles in this three-part series:
Publish or Perish — Another Example of Goodhart’s Law. In becoming a target, publication has ceased to be a good measure. Researcher’s game the system to beat the publish-or-perish culture, which undermines the usefulness of publication and citation counts. (Gary Smith)
Gaming the System: The Flaws in Peer Review. Peer review is well-intentioned but flawed in many ways. Predatory journals, dishonest researchers, and escalating costs in academic journals reveal the weaknesses in peer review. (Gary Smith)
GARY N. SMITH
SENIOR FELLOW, WALTER BRADLEY CENTER FOR NATURAL AND ARTIFICIAL INTELLIGENCE
Gary N. Smith is the Fletcher Jones Professor of Economics at Pomona College. His research on financial markets statistical reasoning, and artificial intelligence, often involves stock market anomalies, statistical fallacies, and the misuse of data have been widely cited. He is the author of The AI Delusion (Oxford, 2018) and co-author (with Jay Cordes) of The Phantom Pattern (Oxford, 2020) and The 9 Pitfalls of Data Science (Oxford 2019). Pitfalls won the Association of American Publishers 2020 Prose Award for “Popular Science & Popular Mathematics”.