Higher Grades, Less Learning: Your Brain on ChatGPT
New research makes clear what we all know: grading needs an overhaul.
MIT Media Lab has released a comprehensive and very accessible study on how the use of ChatGPT impacts neural connectivity during the process of writing an essay. The research divided young adults into three groups, attached electrodes to their heads to monitor brain function, and gave them 20 minutes to write an essay using an SAT-style prompt. Group one had to rely on themselves alone to write the essay (Brain-only), group two could use search engine tools (Search Engine), and the third group had access to ChatGPT (LLM).
The researchers found that the LLM group produced more polished work but simultaneously showed less activation of neural networks and poorer (or no) recall of what they had written, while the Brain-only group’s results were reversed. The Search Engine group fell in the middle — some activation, some recall.
They concluded that:
The LLM undeniably reduced the friction involved in answering participants' questions compared to the Search Engine. However, this convenience came at a cognitive cost, diminishing users' inclination to critically evaluate the LLM's output or ”opinions” (probabilistic answers based on the training datasets).
One important observation for AI enthusiasts is that members of the Brain-only group who were invited back to write a fourth and final essay using an LLM “exhibited significant increase in brain connectivity … suggest[ing] that AI-supported re-engagement invoked high levels of cognitive integration, memory reactivation, and top-down control”.
The takeaway from this research is that the unreflective, unfettered use of ChatGPT is a danger to learning. But tucked in this study is an interesting observation about grading.
The researchers in this study scored the quality of papers with the help of two English teachers and an AI agent. The graders were asked to score essays on a scale of 1-5 on qualities such as uniqueness, vocabulary, grammar, organization, content, length and ChatGPT (i.e., written by ChatGPT). According to the authors of the study, the essays by the Brain-only group consistently received higher scores than those from the LLM group by the teachers.
That is surprising! ChatGPT excels at language. It should do better on metrics such as vocabulary, grammar, organization and length. Even content should come in higher on average than a human working alone: ChatGPT has the web; we have whatever we’ve stored in memory.
What gives? How did the Brain-only group consistently out-score the LLM group?
The answer lies in how the essays were judged: the human teachers privileged authentic voice over all else. Here is how they describe their approach:
These, often lengthy, essays included standard ideas, reoccurring typical formulations and statements, which made the use of AI in the writing process rather obvious. We, as English teachers, perceived these essays as 'soulless', in a way, as many sentences were empty with regard to content and essays lacked personal nuances. While the essays sounded academic and often developed a topic more in-depth than others, we valued individuality and creativity over objective "perfection".
I am sympathetic to these teachers. I don’t want to read ChatGPT’s thoughts on freedom, art, or anything else either. Nevertheless, it appears that the Brain-only papers were graded on a curve — and that’s how they outscored their LLM counterparts.
If true, then a more nuanced reading of the main conclusion of the study is needed. It’s not that LLM use is worse in every way, but that LLM-produced essays are objectively better, yet pedagogically worse. That contradiction exposes a growing dilemma in education: how we grade is increasingly out of step with what students are learning.
Essays are a favoured form of assessment because they foster learning. A well-executed essay requires mastery of facts, strong reasoning, and the integration of knowledge into a wider context. No more. Generative AI has severed the connection between the act of writing and the process of learning.
In light of this uncoupling, we need to reconsider where the best proof of learning lies. Here too, the study by MIT Media Lab sheds light: it is not a final paper produced with the help of ChatGPT, nor one that integrates research from the internet, but that imperfect, hard-to-read, error-ridden first draft produced by the student alone. The least polished piece of writing is the best evidence we have that a student is actively engaging their brain.
What would this mean in practice? At a minimum, it would require rethinking how we weight the value of the different iterations of an essay. First drafts (produced without tech) should be worth more than second attempts (that incorporate internet research), while final papers that are honed through generative AI have little value at all. Each step in the writing process that brings a paper closer to perfection should count less toward the final grade.
That kind of inversion is hard to picture, but if we want to measure learning, and not AI mastery, then essays with fewer facts, less structure, and more spelling errors should be more richly rewarded than the beautiful but ultimately hollow creations made possible by LLMs. The former is evidence of deep learning, while the second is the polish of technology.
It is only when we get the incentives aligned with learning that we will foster what we want from a student writing an essay: evidence of grappling with a hard subject, absorbing what is important, and creating the deep links necessary to ensure that learning is achieved.
