【Summary】Argumentative Essay Assessment with LLMs - A Critical Scoping Review (Favero et al. 2026.)

Background

Argumentative writing is a core academic and civic competency, requiring learners to formulate claims, support them with evidence, and articulate coherent reasoning. Automated Essay Scoring (AES) has long been proposed as a scalable alternative to manual scoring, and the emergence of LLMs has dramatically accelerated this field. Early AES systems focused on general writing quality rather than argumentative reasoning, and prior reviews have not addressed LLM-based scoring of argumentative essays specifically. The field has grown rapidly but without consolidated methodological, psychometric, or ethical foundations.

Motivation

Despite rapid growth, LLM-based AAES remains conceptually unsettled. Scoring rubrics in existing datasets disproportionately emphasize rhetorical and linguistic fluency while neglecting deeper argumentative constructs such as logical cogency, evidential sufficiency, and dialectical engagement. Critical concerns persist around reliability, construct validity, fairness, and responsible deployment — particularly given the high-stakes educational contexts in which such systems may be used. No prior systematic or scoping review had addressed these gaps comprehensively.

Research Questions

  • RQ1: How are LLMs currently employed for automated scoring of argumentative essays and feedback provision in educational settings — what techniques, datasets, and evaluation methodologies are used, and what methodological gaps remain?
  • RQ2: To what extent do LLM-based AAES approaches align with human judgment in terms of psychometric validity and the FATEN principles (Fairness, Augmentation, Transparency, bEneficence, Non-maleficence) for responsible educational assessment?

Methods

The review followed PRISMA 2020 guidelines with a preregistered protocol. From an initial corpus of 3,467 records retrieved across 8 databases (Google Scholar, Springer, ACM, ScienceDirect, arXiv, Web of Science, ERIC, PubMed), 46 studies meeting predefined inclusion criteria were selected. Inclusion required: publication between January 2022 and October 2025; focus on argumentative essays written in English; multi-trait rubric-based scoring; and LLMs as the primary technical approach. Data were extracted across six structured dimensions — datasets, scoring traits, LLM families, technical approaches, evaluation metrics, and analytical frameworks — and essay traits were mapped onto the Romberg et al. Argument Quality (AQ) taxonomy for construct-level analysis.

Results

Datasets. 29 datasets were identified across the 46 studies. The field is anchored to a small set of benchmarks, with ASAP/ASAP++ used in 30% of studies. Most newly introduced datasets contain ≤120 essays (70%), and 50% are proprietary or inaccessible. Prompt coverage is narrow: 44% of datasets rely on a single prompt.

Traits. 82 distinct trait names were identified. Mapping onto the AQ taxonomy revealed heavy concentration in Rhetorical effectiveness (~56% of traits on average), while Logical cogency is nearly absent (5%) and Dialectical reasonableness and Deliberative norms are sparsely and inconsistently represented (both ~25%). Current benchmarks thus evaluate stylistic fluency far more than substantive reasoning.

LLMs. 87% of studies used proprietary GPT-family models. Open-weighted models (predominantly Llama) appeared in 35% of studies, often matching GPT performance when paired with robust prompting strategies. Only 20% of studies used reasoning-optimized LLMs, though these consistently yielded stronger outcomes.

Technical Approaches. All studies used rubric-based prompting. Fine-tuning was adopted in only 22% of studies, multi-agent architectures in 15%, and reinforcement learning in 4%. The field is dominated by prompt engineering with limited methodological diversification.

Evaluation. Quadratic Weighted Kappa (QWK) was the dominant metric (52% of studies). Of studies reporting QWK: 38% achieved substantial agreement (0.61–0.80) and 21% near-perfect agreement (0.81–1.00), but performance varied substantially across prompts, datasets, and model configurations. Robustness analyses revealed sensitivity to sampling randomness, score distributions, and English proficiency levels.

FATEN Analysis. Fairness: no consistent direction of score bias emerged, but style bias, L1 sensitivity, and compressed score distributions were recurrent issues. Pedagogical alignment: feedback was often judged useful but sometimes too complex or misaligned with argumentative dimensions. Transparency: trait-level decomposition was common, but feature-based explainability remained limited. Beneficence and Non-maleficence: privacy, data security, environmental impact, and safety were almost entirely unaddressed empirically.

Discussion

The review identifies five structural problems. First, dataset fragmentation and benchmark dependence limit generalizability and invite data contamination risks. Second, the systematic underrepresentation of Logical cogency and Deliberative norms means current systems assess surface fluency rather than argumentative competence, raising construct validity concerns. Third, the dominance of proprietary GPT models raises reproducibility, transparency, and privacy concerns — especially when essays are processed through consumer-facing platforms without institutional governance. Fourth, evaluation practices are uneven: QWK is frequently misinterpreted, adjacent-agreement metrics are not comparable across scales, and uncertainty is rarely quantified. Fifth, responsible AI dimensions — privacy-by-design, environmental sustainability, human oversight, and pedagogical alignment — remain largely unexamined. Notably, open and smaller LLMs can achieve performance comparable to GPT variants when supported by strong prompting design, suggesting that scoring quality depends more on methodology than on raw model scale.

Conclusion

LLMs have demonstrably advanced AAES, but current systems do not yet provide a psychometrically robust, equitable, or pedagogically grounded alternative to human evaluation. The field requires: (1) construct-valid, publicly available datasets with broad prompt coverage and fine-grained argumentative traits; (2) theoretically grounded rubrics aligned with argumentation research; (3) standardized psychometric evaluation protocols with uncertainty quantification; (4) methodological expansion beyond prompt engineering toward fine-tuning, multi-agent systems, and hybrid neuro-symbolic approaches; and (5) embedding FATEN principles throughout the design and deployment pipeline. The central question is no longer whether LLMs can score essays, but whether they can do so in ways that reflect the complexity of human reasoning and the norms of responsible educational assessment.