
Is It Human, or Is It AI?
A new study identifies differences between human and AI-generated text
Media Inquiries
A team of Carnegie Mellon University researchers set out to see how accurately large language models (LLMs) can match the style of text written by humans, and their findings were recently published in the Proceedings of the National Academy of Sciences(opens in new window) (PNAS).
“We humans, we adapt how we write and how we speak to the situation. Sometimes we're formal or informal, or there are different styles for different contexts,” said Alex Reinhart(opens in new window), lead author and associate teaching professor in the Department of Statistics & Data Science. “What we learned is that LLMs, like ChatGPT and Llama, write a certain way, and they don't necessarily adapt to the writing style. The context and their style are actually very distinctive from how humans normally write or speak in different contexts. Nobody has measured or quantified this in the way we were able to do.”
In this study, Reinhart and his team were able to show how LLMs write by prompting them with extracts of writing from various genres, such as TV scripts and academic articles. Using code written by David West Brown(opens in new window), associate teaching professor in the Department of English and co-author of the study, they found large differences in grammatical, lexical and stylistic features between text written by LLMs and humans. These differences were largest for instruction-tuned models, such as ChatGPT, which undergo additional training to answer questions and follow instructions.
According to the researchers, LLMs used present participle clauses at 2 to 5 times the rate of human text, as demonstrated in this sentence written by GPT-4o: “Bryan, leaning on his agility, dances around the ring, evading Show’s heavy blows.”
They also used nominalizations at 1.5 to 2 times the rate of humans, and GPT-4o uses the agentless passive voice at half the rate as humans. This suggests that LLMs are trained to write in an informationally dense, noun-heavy style, which limits their ability to mimic other writing styles. The researchers also found that instruction-tuned LLMs have distinctive vocabularies, using some words much more often that humans writing in the same genre. For example, versions of ChatGPT used "camaraderie" and "tapestry" about 150 times more often than humans do, while Llama variants used "unease" 60 to 100 times more often. Both models had strong preferences for "palpable" and "intricate.”
“There (has been) a lot of anxiety circulating amongst teachers. And I thought to myself — as someone who does computational work and works a lot with data science for someone who's in an English department — that this is not really what writers do,” Brown said. “We don't write once. We write over and over and over and over again. So, the question was: can (LLMs) generate a one-off that looks plausible?
“The message that I think we really wanted to communicate was to think very carefully about under what circumstances (using LLMs) might be fine,” Brown said. “I care that my doctor's notes are accurate. I don't really care if they're in the voice of my doctor. But if I'm writing a job application letter where I want to stand out, that matters a great deal. As instructors, writers and communicators, we need to be aware of LLMs’ idiosyncrasies and shortcomings.”
Reinhart also noted growing concerns about what happens if students use LLMs to complete assignments.
“Some people will say it's like when we got calculators for math class. And now you just use the calculator, and it's great. What we learned is, it's not quite like a calculator,” Reinhart said. “You use a calculator, it does the same math you were going to do, but it doesn't screw up and forget to carry the one. But here, you’re getting something different than what a typical human would write.”
Researchers noted that further study and a broader look at more LLMs is needed to understand the importance and impact of instruction tuning on these models. An ongoing project by Ph.D. student Ben Markey involves studying how LLMs can be used to evaluate human writing, such as student essays, and how consistent their evaluations are.
“Can you give a large language model, say an essay and have it evaluated?” Brown asked. “What (Markey) is doing is rather than giving an LLM just an essay or something once, what happens if you give it the criteria and give it over and over and over and over again? Is it going to give you the same score, or is it going to do different things every time? So, we're also thinking about other kinds of applications with these models as well to see if we can understand them.”
The team of CMU researchers who contributed to this study includes Reinhart; Ben Markey, a Ph.D. student in the Department of English; Kachatad Pantusen, an undergraduate student in the Dietrich College of Humanities and Social Sciences; Ron Yurko, assistant teaching professor in the Department of Statistics & Data Science; Gordon Weinberg, instructor in the Department of Statistics & Data Science; and Brown. Michael Laudenbach, an assistant professor at the New Jersey Institute of Technology and alumnus of CMU’s Department of English, also was part of the research team.