2.5 Scoring and Calibration
Many psychometric CBA applications rest on principles of psychological and educational measurement (see, e.g., Veldkamp and Sluijter 2019) and typical design principles for tasks/items and tests that also apply to non-technology-based assessments are still valid (see, e.g., Downing and Haladyna 2006). For instance, approaches to increase the measurement efficiency use a particular item response theory (IRT) model and (pre-)calibrated item parameters (see also section 2.7):
- Difficulty Parameter or Threshold Parameters,
- Discrimination Parameter (Loading) and
- (Pseudo)Guessing Parameter.
Which item parameters are used to model the probability of a correct response (or a response in a particular category) as a function of a one- or multidimensional latent ability depends on the choice of a concrete IRT model (see, for instance, Embretson and Reise 2013 for a general introduction). Bolt (2016) list the following IRT models for CBA:
- Models for innovative item types (polytomous IRT models and multidimensional IRT models) and models related to testlet-based administration (see section 2.5.1)
- Models that attend to response times (see section 2.5.6)
- Models related to item and test security (see section 2.10.1)
Further areas of application for (IRT) models in the context of CBA include:
- Models to deal with missing values (see section 2.5.2)
- Models to deal with rapid guessing (see section 2.5.3)
- Models for automated item generation (see section 2.6)
Another class of IRT models that is often used with CBA data are Cognitive Diagnostic Models (CDM, see, e.g., George and Robitzsch 2015 for a tutorial)). Additional information, available when data are collected with CBA, such as response times and more general process data (see section 2.8) can be used in cognitive diagnostic modeling (e.g., Jiao, Liao, and Zhan 2019).
2.5.1 Scoring of Items
Computer-based assessment consists of a sequence of assessment components, i.e., instructions and between-screen prompts, and purposefully designed (digital) environments in which diagnostic evidence can be collected. The atomic parts for gathering diagnostic evidence are called items, typically consisting of a prompt (or a question) and the opportunity to provide a response. Items can include introductory text, graphics, tables, or other information required to respond to the prompt, or questions can refer to a shared stimulus (creating a hierarchical structure called units). The raw response provided to items is typically translated into a numerical value called Score.
Dichotomous vs. Polytomous Scoring: For the use of responses to determine a person’s score, a distinction is commonly made between dichotomous (incorrect vs. correct) and polytomous (e.g., no credit, partial credit, full credit). While this distinction is essential, for example, for IRT models, these two kinds are not mutually exclusive concerning the use of information collected in computer-based assessment. Identical responses can be scored differently depending on the intended use. For instance, a polytomous score can be used for ability estimation, while multiple dichotomous indicators for specific potential responses can provide additional insight in formative assessment scenarios.
Multiple Attempts & Answer-Until-Correct: In computer-based assessments, the process of the test-taking and responding can also be included in the scoring. This allows scenarios in which test-takers can answer an item multiple times (e.g., Attali 2011) or until correct (e.g., DiBattista 2013; Slepkov and Godfrey 2019). While this goes beyond the simple IRT models, it can be included in addition to a traditional scoring of the (first) attempt or the final response prior to a provided feedback (see section 2.9).
Constructed Response Scoring: Scores can be calculated automatically for closed response formats (i.e., items that require selecting one or multiple presented response options). For response formats beyond that, scoring can require human raters, pattern recognition (e.g., using Regular Expressions, see section 6.1), or Natural Language Processing (NLP) techniques and machine learning (typically using some human-scored training data, see, for instance, Yan, Rupp, and Foltz 2020).
Scoring of Complex Tasks: Special scoring requirements may arise for complex tasks that consist of multiple subtasks or where multiple behavior-based indicators are derived. Possible approaches for scoring multiple responses include aggregating the total number of correct responses, used, for instance, to score C-tests as shown in Figure 2.9 (see, for instance, Harsch and Hartig 2016 for an example). If (complex) items share a common stimulus or if for any other reason responses to individual items influence each other, dependencies may occur that require psychometric treatment (see, for instance, the Testlet Response Theory in Williamson, and Bejar 2006).
The scoring of answers is, among other things, the basis for automatic procedures of (adaptive) test assembly (see section 2.7) and for various forms of feedback (see section 2.9), among others regarding the completeness of test processing. For this purpose, a differentiated consideration of missing responses is also necessary, as described in the following sub-section.
2.5.2 Missing Values
Scoring can also take the position of items within assessments into account. This is typically done when differentiating between Not Reached items (i.e., responses missing at the end of a typically time-restricted section) versus Omitted responses (i.e., items without response followed by items with responses). Computer-based assessment can be used to further differentiate the Types of Missing Responses, resulting in the following list:
Omitted responses: Questions skipped during the processing of a test lead to missing values typically described as Omitted Responses. How committed responses should be taken into account when estimating item parameters (see section 2.5.4) and estimating person parameters (see section 2.5.5) depends on the so-called missing-data mechanisms (see, for instance, Rose, von Davier, and Nagengast 2017).
Not reached items: If there is only a limited amount of time provided to test-takers to complete items in a particular test section, answers may be missing because the time limit has been reached. These missing responses are called not reached items. The arrangement of the items and the possibilities of navigation within the test section must be considered for the interpretation of missing values as Not Reached.
Quitting: An example that tasks can be incorrectly classified as Not Reached is described in Ulitzsch, von Davier, and Pohl (2020). Missing values at the end of a test section can also indicate that the test was terminated if testing time was still available.
Not administered: Missing responses to items that were not intended for individual test-takers, for instance, based on a booklet design (see section 2.7.2). Since these missing values depend on the test design (see section 2.7.2), they are often referred to as Missing by Design.
Filtered: Items may also be missing because previous responses resulted in the exclusion of a question.
Missing Value Coding: In computerized assessment, there is no reason to wait until after data collection to classify missing responses. Features of interactive test-taking can be considered during testing to distinguish missing responses using the described categories as part of scoring (see chapter 5).
Use of Log-Data for Missing Value Coding: An even more differentiated analysis of missing responses is possible by taking log data into account. The incorporation of response times for the coding of omitted responses (e.g., A. Frey et al. 2018) is one example for the use of information extracted from log data (see section 2.8). Response elements that have a default value require special attention. For instance, checkboxes (see section 3.9.3) used for multiple-choice questions have an interpretation regarding the selection without any interaction (typically de-selected). Log data can be used to differentiate whether an item with multiple (un-selected) checkboxes has been solved or the item should be coded as a missing response.
Missing responses can provide additional information regarding the measured construct, and their occurrence may be related to test-taking strategies. As described in the next section, rapid missing responses may also be part of a more general response process that is informative about test-taking engagement.
2.5.3 Rapid Guessing
Computer-based assessment can make different types of test-taking behaviors visible. A simple differentiation into solution behavior and rapid guessing was found to be beneficial (Schnipke and Scrams 1997), that can be applied, when response times (as discussed in the previous section 2.2.2) are available for each item or when an item design is used that allows interpreting individual time components (see section 2.8.2). Rapid guessing is particularly important for low-stakes assessments Goldhammer, Martens, and Lüdtke (2017).
While solution behavior describes the (intentional) process of deliberate responding, a second process of very fast responding can be observed in many data sets. Since both processes can often be clearly separated when inspecting the response time distribution, a Bimodal has become a central validity argument (see, for instance, Wise 2017) for focusing on Rapid Guessing as a distinguished response process. Using the bimodal response time distribution, a time threshold can be derived, and various methods exist for threshold identification (e.g., Soland, Kuhfeld, and Rios 2021), using either response time and or (in addition to) other information-based criteria (e.g., Wise 2019).
Alternatively to simple time thresholds, mixture modeling (e.g., Schnipke and Scrams 1997; Lu et al. 2019) can be used to differentiate between solution behavior and rapid guessing when post-processing the data. Treatments of rapid responses include response-level or test-taker-level filtering (see Rios et al. 2017, for a comparison). However, similar to missing values (see above), a treatment of responses identified as rapid guessing might require to take the missing mechanism into account (e.g., Deribo, Kroehne, and Goldhammer 2021). Further research is required regarding the operationalization of rapid guessing for complex items (see, e.g., Sahin and Colvin 2020, for a first step in that direction) and validating responses identified as Rapid Guessing (e.g., Ulitzsch, Penk, et al. 2021). Another area of current research is the transfer of response-time-based methods to identify Rapid Guessing to non-cognitive instruments and the exploration of Rapid Responding as part of Careless Insufficient Effort Responding (CIER), either using time thresholds or based on mixture modeling (e.g., Ulitzsch, Pohl, et al. 2021).
2.5.4 Calibration of Items
After constructing a set of new assessment tasks (i.e., single items or units), the items are often administered in a pilot study (often called calibration study). Subsequently, a sub-set of items is selected that measures a (latent) construct of interest in a comparable way, where the selection of items is typically guided in the context of the Item Response Theory (see, e.g., Partchev 2004) regarding Item Fit, and so-called Item Parameters are estimated. Different tools and, for instance, R packages such as TAM (Alexander Robitzsch, Kiefer, and Wu 2022) can be used to estimate item parameters and to compute (item) fit indices.
Missing values can be scored in different ways for item calibration and ability estimation (see Alexander Robitzsch and Lüdtke 2022 for a discussion), depending, for instance, on assumptions regarding the latent missing propensity (see, for instance, Koehler, Pohl, and Carstensen 2014). The treatment of rapid guessing can improve item parameter estimation (e.g., Rios and Soland 2021; Rios 2022).
IRT models exist for dichotomous and polytomous items (see section 2.5.1). When multiple constructs are collected together, multidimensional IRT models can increase measurement efficiency (see, e.g., Kroehne, Goldhammer, and Partchev 2014).
Known item parameters are a prerequisite for increasing measurement efficiency through automatic test assembly and adaptive testing procedures (see section 2.7), and techniques such as the Continuous Calibration Strategy (Fink et al. 2018) can help to create new Item Pools.
Item parameters are only valid as long as the item remains unchanged. This limits the possibilities for customizing items, even if they are shared as Open Educational Resources (OER, see section 8.7.4).
2.5.5 Ability Estimation
While the estimation of item parameters is typically done outside the assessment software as part of test construction, the computation of a raw score (e.g., the number of items solved) or the estimation of a (preliminary) person-ability (using IRT and based on known item parameters) is a prerequisite for the implementation of methods to increase measurement efficiency (multi-stage testing or adaptive testing, see section 2.7). Rapid guessing (see section 2.5.3, e.g., Wise and DeMars 2006) as well as informed guessing can be acknowledged when estimating person parameters (e.g., Sideridis and Alahmadi 2022).
2.5.6 Incorporation of Response Times
A long research tradition deals with the incorporation of response times in psychometric models. Based on the hierarchical modeling of responses and response times (van der Linden 2007) response times can be used, for instance, as collateral information for the estimation of item- and person-parameters (van der Linden, Klein Entink, and Fox 2010). Response times (and more generally, Process Indicators, see section 2.8) used to improve item response theory latent regression models (Reis Costa et al. 2021; Shin, Jewsbury, and van Rijn 2022). In combination with missing responses response-time related information (in terms of not reached items) can also be included in the ability estimation using polytomous scoring (Gorgun and Bulut 2021).