AI matches human graders in ranking macroeconomics exam text responses

Chloe AdamsOctober 25, 20250105 views

Credit: AI-generated image

How does high population growth affect gross domestic product? Economics students are all too familiar with exam questions like this. As free-text questions, they require not only specialist knowledge but also the ability to think and argue economically. However, grading these answers is a time-consuming task for university assistants: Each answer must be checked and assessed individually.

Could artificial intelligence do this work? Researchers from the University of Passau in the fields of economics and computer science have investigated this question. Their study was recently published in Scientific Reports. The results showed that OpenAI’s GPT-4 language model performs similarly to human examiners in ranking open-text answers.

The results at a glance:

When the AI model was asked to rank text responses according to correctness and completeness—in the sense of best, second best or worst answer—GPT achieved an assessment comparable to that of human examiners.
Students cannot impress GPT with AI-generated texts: GPT showed no significant preference for AI-generated or longer answers.
When evaluating text responses according to a points system, the AI model performed slightly worse in terms of quality. GPT tended to be more generous in its evaluations than humans, in some cases by almost an entire grade.

The researchers conclude that AI cannot yet replace human markers. “Writing good sample solutions and re-checking must remain human tasks,” explains Professor Johann Graf Lambsdorff, Chair of Economic Theory at the University of Passau, who was responsible for the experimental design of the study together with Deborah Voß and Stephan Geschwind.

Computer scientist Abdullah Al Zubaer programmed the technical implementation and evaluation under the supervision of Professor Michael Granitzer (Data Science). The researchers argue that exam tasks should continue to be closely supervised by humans. However, AI is certainly suitable as a critical second examiner.

New method for comparing AI and human assessment

There are already several studies on the assessment of AI as an examinee. However, studies on AI as an examiner are rare, and the few that exist use human assessment as a truthful basis. The Passau team goes one step further: it investigated whether AI assessments can compete with that of human examiners—without assuming that humans are always right.

For the experiment, the researchers used free-text answers from students in a macroeconomics course to six questions. The team selected 50 answers per question. The total of 300 answers were evaluated by trained correction assistants. At the same time, GPT was given the same evaluation task.

Since there is no clear “correct” answer to open-ended questions, it is unclear whether an error lies with the AI or with humans. In order to be able to make a comparison nonetheless, the research team used a trick: it used the degree of agreement between the evaluations as a measure of proximity to a presumed truth. The higher the agreement, the closer to the truth.

The starting point was the agreement between the human examiners. One examiner was then replaced by GPT. If this resulted in a higher level of agreement, this was taken as an indication that the AI’s assessment was better than that of the human examiners. In fact, GPT was able to slightly increase the score on individual questions.

“We were partly surprised ourselves at how well the AI performed in some of the assessments,” says Voß.

Al Zubaer adds, “In our tests, the quality of GPT-4 remained largely stable even with imprecise or incorrect instructions.” According to the team, this shows that AI is robust and versatile, even if it still performs slightly weaker in point-based assessments.

More information:
Abdullah Al Zubaer et al, GPT-4 shows comparable performance to human examiners in ranking open-text answers, Scientific Reports (2025). DOI: 10.1038/s41598-025-21572-8

Provided by
University of Passau

Citation:
AI matches human graders in ranking macroeconomics exam text responses (2025, October 24)
retrieved 25 October 2025
from https://phys.org/news/2025-10-ai-human-graders-macroeconomics-exam.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

New method for comparing AI and human assessment

High Yield, High Risk: MicroStrategy Calls Trigger Massive Losses

Confirmed: Halo is getting a campaign remaster – and it’s coming to PS5

Related posts