Abstract
The assessment of student assignments is a fundamental component of the educational process, serving both formative and summative functions. Traditional assessment of assignments relies on lecturer expertise to evaluate student performance and provide targeted feedback. However, recent advancements in artificial intelligence (AI) have raised questions about the feasibility of AI-based assessment tools as a complement or alternative to human evaluation. This study explores the extent to which ChatGPT-4o, a multimodal language model, can perform the assessment (grading and evaluation/feedback) of essays when measured against the principles of quality assessment.
The rationale for this research stems from the growing need to understand the capabilities and limitations of AI in educational assessment. Given the increasing workload on lecturers, AI tools could offer a means to streamline assessment processes by providing quick, consistent, and potentially objective evaluations. However, concerns remain regarding AI’s ability to assess complex academic writing, particularly in capturing critical analysis, argumentation, and contextual relevance. Ethical considerations, including AI bias, transparency, and the impact on student learning, also play a crucial role in evaluating AI’s potential in educational settings.
The primary research question asks whether ChatGPT-4o adheres to the principles of quality assessment when evaluating and grading academic essays. Secondary questions explore key debates surrounding AI-based assessment in higher education, compare AI-generated and lecturer-based assessment results, and examine ethical implications and potential biases in AI evaluation.
The research aims to:
- Determine whether ChatGPT-4o meets quality assessment standards, specifically when assessing academic essays.
- Explore key debates about the use of large language models as an AI-based assessment tool in the higher education context.
- Compare the assessment results of ChatGPT-4o and a human lecturer.
- Identify ethical concerns and biases associated with AI-based assessment tools.
This study employs a mixed-methods research design, integrating quantitative and qualitative analyses. The sample consists of 18 academic essays from second-year education students at a private higher education institution in Gauteng, South Africa. The assessment process utilises an analytical rubric covering five key criteria: (1) identification and discussion of appropriate texts, (2) use of relevant examples and evidence, (3) critical analysis, (4) language and structure, and (5) technical refinement. The lecturer and ChatGPT-4o independently assessed the essays using this rubric. While the lecturer provided context-specific feedback, ChatGPT-4o was given a zero-shot prompt including the assignment instructions, the rubric and the essays without additional training or contextual input. Mainly to determine what type of output ChatGPT-4o will yield with the minimum amount of input.
Results reveal a significant discrepancy between the grade assigned by the lecturer and those by ChatGPT-4o. The AI model consistently awards higher grades, with a mean difference of 14,93% in total grades. The lecturer’s grading demonstrates greater variability, indicating differentiation between strong and weak essays, whereas ChatGPT-4o’s grading remains relatively uniform. A t-test confirms that the result differences are statistically significant (p < 0,05). A Pearson correlation analysis indicates only a weak positive correlation between the two assessors, further demonstrating a lack of alignment in evaluation standards.
In terms of feedback, both ChatGPT-4o and the lecturer provide comments on key aspects of the essays. However, the lecturer offers more precise, context-driven critiques, identifying structural weaknesses, gaps in argumentation, and issues with language proficiency. In contrast, ChatGPT-4o’s feedback is more generic and lenient, often failing to penalise essays with inadequate argumentation or incorrect referencing. Additionally, the AI model does not fully recognise the contextual appropriateness of selected texts, which is a crucial aspect of the assignment.
These findings highlight several challenges in AI-based assessment. First, ChatGPT-4o lacks contextual awareness and struggles to apply academic standards effectively. While it recognises basic linguistic and structural elements, it fails to assess argument complexity and the depth of critical analysis. Second, its feedback, though detailed, is largely generic and does not provide the personalised guidance necessary for academic development. Third, the AI’s lenient grading suggests a risk of grade inflation if used without human oversight. Moreover, ethical concerns arise regarding the transparency and reliability of AI assessment tools, particularly in ensuring fairness and addressing potential biases.
Despite these limitations, ChatGPT-4o offers potential advantages. Its ability to provide instant feedback can support students’ learning processes by offering preliminary insights into their writing. However, it should be seen as a supplementary tool rather than a replacement for human assessment. The study underscores the importance of maintaining human oversight to ensure fairness, accuracy, and adherence to academic standards.
Several limitations must be acknowledged. The sample size of 18 essays is relatively small, limiting the generalisability of the findings. Additionally, the study focuses exclusively on essay assessments, whereas AI’s performance in evaluating other forms of assessment, such as multiple-choice or short answer questions, remains unexplored. Furthermore, ChatGPT-4o was not provided with institution-specific referencing guidelines, which could have influenced its assessment ability. Future research should investigate how AI can be trained with specific criteria to improve its alignment with human grading standards. Moreover, comparative studies involving multiple AI models could offer deeper insights into AI’s evolving role in assessment.
The study concludes that while ChatGPT-4o has potential as an assessment aid, it does not yet meet the standards required for independent grading of academic essays. AI lacks the ability to fully capture cultural, contextual, and finer nuances of academic writing. The discrepancies in grading and feedback suggest that AI should be used cautiously in academic assessment, with human moderation remaining essential to maintain fairness and accuracy. Ethical considerations, including transparency, bias, and reliability, must also be addressed before AI can be widely adopted in educational assessment practices. Future research should focus on refining AI assessment methodologies to enhance their reliability and effectiveness within the academic landscape.
Keywords: Afrikaans 2; academic essays; AI-based assessment tool; artificial intelligence (AI); assessment; ChatGPT-4o; quality assessment practices; traditional lecturer-based assessment

