IEA Insider 2025

IEA INSIDER 2025

The AI Horizon in TIMSS and PIRLS: Research Today, Transformation Tomorrow

drift. Our research, however, is demonstrating how powerful AI solutions can be. Our work has shown we can successfully train sophisticated neural networks—Convolutional Neural Networks (CNNs)—to automatically score complex TIMSS graphical responses, like drag-and-drop diagrams (Tyack et al., 2024). Achieving over 92 percent accuracy compared to expert human raters isn’t just impressive: it’s transformative. It means we can score these items with unprecedented speed and consistency, flagging potential issues for human review and ensuring truly comparable cross-country evaluations of skills previously harder to capture reliably. The administration of ILSA, such as TIMSS and PIRLS, in multiple languages (over 100 language versions) poses challenges to scoring consistency. Jung et al. (2024) tackled this head-on, integrating machine translation with advanced natural language processing (NLP) and AI scoring models. Their approach—translating responses into a single, common language, applying AI scoring, and rigorous human verification—achieved remarkably high agreement with bilingual raters. This breakthrough isn’t just about efficiency; it’s about dismantling language barriers in scoring, a crucial step toward ensuring genuine fairness and comparability in our global assessments. For PIRLS 2026, we’re actively planning comparisons between AI and human scoring to monitor reliability. Current research is advancing the frontier by developing even more robust AI models that can understand the semantic nuances of open text responses directly across multiple languages, further refining accuracy and validity. FUTURE AI ENDEAVORS FOR TIMSS AND PIRLS Supporting the tasks of both automated test assembly and scoring, AI-driven item classification and feature extraction leads to further efficiency gains. Research funded by IEA at our center is exploring how machine learning models can predict key item characteristics—primarily content domain alignment, and to a more challenging extent, cognitive domain and difficulty parameters. While predicting content alignment shows high promise, accurately classifying cognitive demand and difficulty purely via AI remains a complex puzzle, mirroring the challenges even human experts sometimes face. This ongoing research is vital, as accurately understanding these features is essential for effective test assembly and ultimately, for ensuring the validity of the inferences we draw from the assessments. So, where is this all heading in the very near future? Beyond refining automated scoring and assembly systems for operational use in the next cycles, we see a paradigm shift on the horizon: AI as a collaborative partner for test developers.

BY MATTHIAS VON DAVIER

The integration of AI and process data isn’t just a futuristic dream for TIMSS and PIRLS; it’s an active frontier of research that will fundamentally reshape how we measure global learning. While we’re still firmly grounded in exploration and validation, the trajectory is clear. AI isn’t at some point coming to international large-scale assessments (ILSAs); it’s rapidly becoming an indispensable engine within them, particularly in two critical areas: automated scoring and automated test assembly. And looking just over the next cycle, we see a fascinating new role emerging—AI as a collaborative assistant for our test development experts. Research conducted at Boston College in the TIMSS & PIRLS International Study Center paves the way for these applications. AUTOMATED TEST ASSEMBLY The science of building the tests themselves is being transformed by advances in AI. Automated Test Assembly (ATA) is moving beyond theory into practical research application. The TIMSS & PIRLS International Study Center is testing using sophisticated algorithms capable of simultaneously balancing an array of requirements: content coverage across domains, psychometric targets (like difficulty and discrimination), diverse response formats, and even constraints on test length and item exposure. The goal is to assemble multiple high-quality, psychometrically equivalent test forms far more efficiently than ever before. This isn’t just about saving time, it allows content experts to focus on the essential fine tuning. It also enables more adaptive, targeted assessment designs that can provide more precise insights while maintaining comparability that earns TIMSS and PIRLS their reputation as gold standards for monitoring trends. At the same time, it is important to stress that humans remain at the center of the process, steering decisions and ensuring quality—AI serves only as a powerful tool that supports, not replaces, expert judgment. AUTOMATED SCORING Automated scoring, like test assembly, is undergoing significant changes due to the evolution of AI. Historically, the sheer volume and complexity of responses, especially graphical or written responses to open-ended questions, presented immense logistical and consistency challenges, particularly across multiple languages. Human scoring, while essential, is resource-intensive and susceptible to subtle

Made with FlippingBook - PDF hosting