A2i Assessments Technical Manual

TECHNICAL MANUAL

Reporting on Reliability & Validity

In the context of assessments, reliability refers to the consistency and dependability of the results obtained from a measurement tool or evaluation. It ensures that the outcomes of an assessment are stable and repeatable, which gives us confidence in the accuracy and precision of the results. Reliability is also important because it allows us to make meaningful interpretations based on the assessment results. Validity refers to whether an assessment measures what it intends to measure. Without reliability, it is challenging to establish validity because inconsistent or unreliable measurements might not accurately capture the construct being assessed. Below, we discuss the reliability and validity of A2i assessments using several different approaches. Reliability Internal Consistency Reliability To be useful, assessment results should be reliable—stable, accurate, and dependable. A test’s accuracy is estimated by a number called the standard error of measurement (SEM). Assessment reliability can be evaluated for computer adaptive tests in a number of ways. One straightforward approach is to evaluate the relationship between a student’s “true ability” (the student’s exact ability level at the time of testing) and their actual score on a given assessment (Kim, 2012). However, it is impossible to measure a real person’s exact proficiency level without any error at all, so this approach is only possible when student test-takers can be simulated (since the simulation creates the “true ability”). Using this type of assessment simulator, the Letters2Meaning test was found to have a high squared- correlation reliability metric of 0.904 (values range from 0 to 1). For reference, the maximum reliability of assessments is indicated by a score of 1.0, representing perfect reliability. For most practical purposes, values higher than 0.7 are considered reliable. Conversely, scores closer to 0.0 indicate lower reliability, and one should be cautious about interpreting results from tests with very low reliability. Another approach to calculating reliability for this type of assessment is to use the observed relationship between student scores and assessment-specific error measurements. This approach yields reliability estimates of 0.93 to 0.94 , depending on the specific methodology chosen. The Word Match Game (WMG) reliability is lower than for L2M. Specifically, the reliability for the WMG is currently 0.32 , based on the same approach using a ratio of scores to error measurements. The current reliability estimates are primarily driven by the accuracy of the item-level difficulty estimation. The L2M items were recalibrated in 2020 using a large dataset (n > 5,000) and this update, combined with improved scoring model parameters, improved the L2M reliability to the values reported above. This work has not yet been completed for the WMG, and as such, a lower reliability is currently reflected for this assessment. The current WMG items were originally calibrated for a study conducted over the 2015–2016 school year on a smaller dataset (n < 1,000). Psychometric analysis of the WMG did reveal that the total test information was greater than 2.0 throughout the range of Rasch theta scores, suggesting that computer adaptive administration of the WMG can produce reliable individual scores throughout the full range of student abilities. (Values consistently greater than 2.0 correspond to a reliability greater than 0.7 for the assessment.)

Scholastic.com/A2i

Page 16

Made with FlippingBook - Online catalogs