We have some new numbers in the long running-question about the impact LLMs are having on scientific discovery… and they’re pretty surprising (or, maybe they shouldn’t be?).
A team of researchers has developed a novel method to detect the presence of AI-generated content in scientific peer reviews. Their findings reveal a huge increase in the use of AI writing tools, such as ChatGPT, in the review process of major machine learning conferences. This has far-reaching implications for the integrity and trustworthiness of scientific research.
AIModels.fyi is a reader-supported publication. To receive new posts and support my work subscribe and be sure to follow me on Twitter!
The study, titled “Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews,” was conducted by a team of researchers. The authors propose a new method called “distributional GPT quantification” to estimate the fraction of text in a large corpus that has been substantially modified or generated by AI. They apply this technique to peer reviews from major AI conferences, including ICLR, NeurIPS, CoRL, and EMNLP, both before and after the release of ChatGPT in 2022. The results uncover concerning trends in the use of AI-generated content and raise important questions about the future of scientific research in the era of large language models.
How it works
“Figure 2: An overview of the method. We begin by generating a corpus of documents with known scientist or AI authorship. Using this historical data, we can estimate the scientist-written and AI text distributions P and Q and validate our method’s performance on held-out data. Finally, we can use the estimated P and Q to estimate the fraction of AI-generated text in a target corpus.”
The key technical innovation of the study is the authors’ distributional GPT quantification method, which enables efficient estimation of AI-generated content at the corpus level, without the need to classify individual documents. The main steps of the method are:
Collect a dataset of known human-written and AI-generated texts to serve as reference distributions. In this case, the authors used peer reviews from past conferences (before ChatGPT) as the human-written reference, and generated corresponding AI-written reviews by prompting GPT-4 with the same review instructions.
Estimate token usage distributions for the human-written (P) and AI-generated (Q) references. The authors focused on the occurrence probabilities of adjectives.
Fit a mixture model to the target corpus of unknown composition, assuming each document is drawn from a weighted combination of the human and AI distributions: (1 — α)P + αQ. Use maximum likelihood estimation to infer the value of α, i.e., the fraction of AI-generated content.
Validate the method on held-out data with known mixing proportions. The authors show that their approach accurately recovers the true α within 2.4% absolute error, outperforming baseline classifiers.
Results
Applying this method to peer reviews from major AI conferences, the authors find a significant increase in the prevalence of AI-generated content following the release of ChatGPT in late 2022. Specifically, they estimate that 10.6% of ICLR 2024 reviews, 9.1% of NeurIPS 2023 reviews, 6.5% of CoRL 2023 reviews, and 16.9% of EMNLP 2023 reviews contain substantial AI-generated content, compared to ~1–2% in the pre-ChatGPT era.
To verify that the method is detecting more than just superficial AI assistance (e.g., grammar correction), the authors conduct sensitivity analyses by applying the model to human-written reviews that were lightly edited by ChatGPT. They find that the estimated α increases only marginally in this setting, confirming that the model is capturing more substantial AI generation.
“Figure 12: Word cloud of top 100 adjectives in LLM feedback, with font size indicating frequency.”
The authors also investigate factors associated with higher AI content in reviews. Some key findings include:
Reviews submitted close to the deadline (within 3 days) have significantly higher estimated α values.
Reviews containing scholarly citations (indicated by the presence of “et al.”) have lower α, suggesting that AI-generated reviews are less likely to properly cite related work.
Reviewers who engage less in the rebuttal process (i.e., respond to fewer author rebuttals) tend to have higher α estimates.
AI-generated reviews are more semantically homogeneous, as measured by the similarity of their text embeddings. This suggests that AI assistance may reduce the diversity of perspectives in peer review.
Reviews with lower self-reported confidence scores are associated with higher α values, hinting at a potential “deferral to the machine” effect when reviewers are uncertain.
In plain English
In simpler terms, this study develops a way to “sniff out” AI-generated text in large collections of documents. It’s a bloodhound for robot writing!
Applying this AI detection technique to peer reviews from top machine learning conferences, the scientists found a surprising amount of “artificial flavoring” slipping into the mix. Before the release of ChatGPT, only about 1–2% of reviews had hallmarks of substantial AI involvement. But after ChatGPT hit the scene in late 2022, those numbers shot up to 7–16% of reviews, depending on the conference.
The researchers dug deeper to understand the conditions where AI incursion tends to be higher. They found AI-tinged reviews are more likely to:
Come in hot at the last minute, right before the deadline
Skimp on scholarly citations
Come from reviewers who engage less in back-and-forth discussion
Sound suspiciously alike, like they came off a cookie-cutter assembly line
Correlate with reviewers expressing low confidence in their assessments
In the authors’ interpretation, this points to a concerning trend of AI reshaping peer review, the “Supreme Court” of scientific legitimacy, in subtle but significant ways. If AI makes reviews more generic, less grounded in citation, and less robust to author challenge, the whole process of quality control for science could be compromised.
Limitations
I this study makes a valuable contribution in empirically surfacing the “shadow impact” of AI writing tools in the vital domain of scientific peer review. And I think it provides a foundation for the research community to critically examine the role of AI in shaping knowledge production and validation.
However, as the authors note, the current analysis has limitations that should temper interpretation and spur follow-up work. For example:
The method likely underestimates the full scope of AI influence, as it may not capture more subtle assistance like polishing prose or tweaking a few key sentences. The true “AI footprint” could be even larger.
The observed correlates of higher AI content, such as fewer citations or less rebuttal engagement, are just that — correlations. Establishing whether AI causes lazy reviewing practices, or whether lazy reviewers just gravitate to AI, requires more targeted study designs. The current data is ambiguous I guess.
The technique’s long-term robustness is unclear, as the “arms race” between AI generation and detection may escalate. If future language models learn to evade these statistical signatures, the estimates could become less reliable. Monitoring the AI landscape will be crucial.
More broadly, I think this study should catalyze some reflection and proactive thinking in the scientific community around the responsible use of AI writing aids. Some key questions for all of us to grapple with include:
What are the appropriate norms and disclosure expectations for using AI in peer review? Should it be prohibited entirely, or encouraged with guardrails?
How can the research community incentivize and enforce good practice, such as engagement with author rebuttals and proper citation practices, in the presence of AI temptations?
If AI homogenizes review content and style, how can we ensure a healthy diversity of ideas and perspectives in vetting research claims?
How should the advent of human-AI hybrid knowledge work reshape scientific reward structures, like authorship and credit?
In short, while I think AI writing tools hold immense promise to aid and augment human reasoning, this study highlights the very real potential drawbacks of their adoption in high-stakes arenas of scientific judgment.
Encouragingly, I think this research also demonstrates the power of innovative measurement techniques to track and illuminate AI’s tendrils of influence. As much as we need prophetic foresight to anticipate the impacts of AI, we also need rigorous empiricism to trace its actual percolations through the knowledge ecosystem. Transparent, accountable science will be one of our key tools for navigating the AI future with integrity.
Conclusion
This study delivers a revealing glimpse into how AI language tools are already permeating the foundations of scientific self-governance, with peer review as maybe the canary in the coal mine
As the use of AI writing aids explodes in society, it’s inevitable they’ll seep into sensitive domains like academic research. I think the challenge is not to futilely play whack-a-mole against each new tool, but to deliberately architect systems, norms, and incentives that channel them towards good epistemic ends.
None of this will be easy. But I believe the community is up to the challenge, if we can approach it with both optimism and clear-eyed realism.
What do you think? I’d love to hear you to share their own ideas and initiatives for ensuring AI writing tools enrich, rather than erode, the quality and trustworthiness of science. Together, we can shape an ecosystem where human and machine intelligence synergize to advance the frontiers of reliable knowledge. Let me know in the comments or on Twitter — I’d like to hear what you have to say.
AIModels.fyi is a reader-supported publication. To receive new posts and support my work subscribe and be sure to follow me on Twitter!
Top comments (0)