2.5 Scoring and Calibration

Many psychometric CBA applications rest on principles of psychological and educational measurement (see, e.g., Veldkamp and Sluijter 2019) and typical design principles for tasks/items and tests that also apply to non-technology-based assessments are still valid (see, e.g., Downing and Haladyna 2006). For instance, approaches to increase the measurement efficiency use a particular item response theory (IRT) model and (pre-)calibrated item parameters (see also section 2.7):

Difficulty Parameter or Threshold Parameters,
Discrimination Parameter (Loading) and
(Pseudo)Guessing Parameter.

Which item parameters are used to model the probability of a correct response (or a response in a particular category) as a function of a one- or multidimensional latent ability depends on the choice of a concrete IRT model (see, for instance, Embretson and Reise 2013 for a general introduction). Bolt (2016) list the following IRT models for CBA:

Models for innovative item types (polytomous IRT models and multidimensional IRT models) and models related to testlet-based administration (see section 2.5.1)
Models that attend to response times (see section 2.5.6)
Models related to item and test security (see section 2.10.1)

Further areas of application for (IRT) models in the context of CBA include:

Models to deal with missing values (see section 2.5.2)
Models to deal with rapid guessing (see section 2.5.3)
Models for automated item generation (see section 2.6)

Another class of IRT models that is often used with CBA data are Cognitive Diagnostic Models (CDM, see, e.g., George and Robitzsch 2015 for a tutorial)). Additional information, available when data are collected with CBA, such as response times and more general process data (see section 2.8) can be used in cognitive diagnostic modeling (e.g., Jiao, Liao, and Zhan 2019).

2.5.1 Scoring of Items

Computer-based assessment consists of a sequence of assessment components, i.e., instructions and between-screen prompts, and purposefully designed (digital) environments in which diagnostic evidence can be collected. The atomic parts for gathering diagnostic evidence are called items, typically consisting of a prompt (or a question) and the opportunity to provide a response. Items can include introductory text, graphics, tables, or other information required to respond to the prompt, or questions can refer to a shared stimulus (creating a hierarchical structure called units). The raw response provided to items is typically translated into a numerical value called Score.

Dichotomous vs. Polytomous Scoring: For the use of responses to determine a person’s score, a distinction is commonly made between dichotomous (incorrect vs. correct) and polytomous (e.g., no credit, partial credit, full credit). While this distinction is essential, for example, for IRT models, these two kinds are not mutually exclusive concerning the use of information collected in computer-based assessment. Identical responses can be scored differently depending on the intended use. For instance, a polytomous score can be used for ability estimation, while multiple dichotomous indicators for specific potential responses can provide additional insight in formative assessment scenarios.

How to do this with the CBA ItemBuilder? The concept implemented to score responses provided in CBA ItemBuilder tasks (see chapter 5) allows to define so-called Scoring Conditions (see section 5.3.2). These conditions provide the basic evidence, that can be used to create either dichotomous (only one out of two possible conditions) or polytomous (one out of multiple possible conditions) variables (called Classes in the CBA ItemBuilder terminology, see also section 1.5).

Multiple Attempts & Answer-Until-Correct: In computer-based assessments, the process of the test-taking and responding can also be included in the scoring. This allows scenarios in which test-takers can answer an item multiple times (e.g., Attali 2011) or until correct (e.g., DiBattista 2013; Slepkov and Godfrey 2019). While this goes beyond the simple IRT models, it can be included in addition to a traditional scoring of the (first) attempt or the final response prior to a provided feedback (see section 2.9).

How to do this with the CBA ItemBuilder? The scoring options provided by the CBA ItemBuilder are rich and might look complex or confusing at first sight. The perceived complexity is because the CBA ItemBuilder disentangles the components used to create the interactive environments (required for test-takers to respond, see chapters 3 and 4) from the formulation of scoring conditions (required to translate the provided responses into scores, see chapter 5).

Constructed Response Scoring: Scores can be calculated automatically for closed response formats (i.e., items that require selecting one or multiple presented response options). For response formats beyond that, scoring can require human raters, pattern recognition (e.g., using Regular Expressions, see section 6.1), or Natural Language Processing (NLP) techniques and machine learning (typically using some human-scored training data, see, for instance, Yan, Rupp, and Foltz 2020).

How to do this with the CBA ItemBuilder? The CBA ItemBuilder can use Regular Expressions for scoring text entry (see section 5.3.4) and allows to convert text entry into result variables (see section 5.3.10) that can be used (outside) of the CBA ItemBuilder Runtime to score constructed response items.

Scoring of Complex Tasks: Special scoring requirements may arise for complex tasks that consist of multiple subtasks or where multiple behavior-based indicators are derived. Possible approaches for scoring multiple responses include aggregating the total number of correct responses, used, for instance, to score C-tests as shown in Figure 2.9 (see, for instance, Harsch and Hartig 2016 for an example). If (complex) items share a common stimulus or if for any other reason responses to individual items influence each other, dependencies may occur that require psychometric treatment (see, for instance, the Testlet Response Theory in Williamson, and Bejar 2006).

How to do this with the CBA ItemBuilder? For scoring complex items, the CBA ItemBuilder allows the define pieces of evidence in terms of (many different) properties of the test-taking process and responses to the different interactive elements. Multiple individual Scoring Conditions can be combined using logical operators to formulate (intermediate) scoring variables.

The scoring of answers is, among other things, the basis for automatic procedures of (adaptive) test assembly (see section 2.7) and for various forms of feedback (see section 2.9), among others regarding the completeness of test processing. For this purpose, a differentiated consideration of missing responses is also necessary, as described in the following sub-section.

2.5.2 Missing Values

Scoring can also take the position of items within assessments into account. This is typically done when differentiating between Not Reached items (i.e., responses missing at the end of a typically time-restricted section) versus Omitted responses (i.e., items without response followed by items with responses). Computer-based assessment can be used to further differentiate the Types of Missing Responses, resulting in the following list:

Omitted responses: Questions skipped during the processing of a test lead to missing values typically described as Omitted Responses. How committed responses should be taken into account when estimating item parameters (see section 2.5.4) and estimating person parameters (see section 2.5.5) depends on the so-called missing-data mechanisms (see, for instance, Rose, von Davier, and Nagengast 2017).
Not reached items: If there is only a limited amount of time provided to test-takers to complete items in a particular test section, answers may be missing because the time limit has been reached. These missing responses are called not reached items. The arrangement of the items and the possibilities of navigation within the test section must be considered for the interpretation of missing values as Not Reached.
Quitting: An example that tasks can be incorrectly classified as Not Reached is described in Ulitzsch, von Davier, and Pohl (2020). Missing values at the end of a test section can also indicate that the test was terminated if testing time was still available.
Not administered: Missing responses to items that were not intended for individual test-takers, for instance, based on a booklet design (see section 2.7.2). Since these missing values depend on the test design (see section 2.7.2), they are often referred to as Missing by Design.
Filtered: Items may also be missing because previous responses resulted in the exclusion of a question.

Missing Value Coding: In computerized assessment, there is no reason to wait until after data collection to classify missing responses. Features of interactive test-taking can be considered during testing to distinguish missing responses using the described categories as part of scoring (see chapter 5).

How to do this with the CBA ItemBuilder? The classification of missing values for assessments implemented with the CBA ItemBuilder requires differentiating two levels: Within individual assessment components, the scoring must be implemented in a way allowing to differentiate between Not Reached vs. Omitted responses. If applicable, missing responses due to filtered questions must also be coded accordingly. In addition, the software used to deliver the assessment components is expected to classify missing responses as missing by design (or not administered) if items in selected assessment components were not intended for individual test-takers. Finally, the delivery software must also classify missing responses as not reached due to a time limit, using the information about the test design and the scheduled item sequence.

Use of Log-Data for Missing Value Coding: An even more differentiated analysis of missing responses is possible by taking log data into account. The incorporation of response times for the coding of omitted responses (e.g., A. Frey et al. 2018) is one example for the use of information extracted from log data (see section 2.8). Response elements that have a default value require special attention. For instance, checkboxes (see section 3.9.3) used for multiple-choice questions have an interpretation regarding the selection without any interaction (typically de-selected). Log data can be used to differentiate whether an item with multiple (un-selected) checkboxes has been solved or the item should be coded as a missing response.

How to do this with the CBA ItemBuilder? Information from individual events during the test-taking process can be included using so-called scoring operators, which can also account for variables (see section 4.2) and internal states of the item (see Finite-State Machine in section 4.4). Since these events are also captured in CBA ItemBuilder log events (see section 2.8), a fine-grained differentiation of missing values is also possible as part of a log data analysis.

Missing responses can provide additional information regarding the measured construct, and their occurrence may be related to test-taking strategies. As described in the next section, rapid missing responses may also be part of a more general response process that is informative about test-taking engagement.

2.5.3 Rapid Guessing

Computer-based assessment can make different types of test-taking behaviors visible. A simple differentiation into solution behavior and rapid guessing was found to be beneficial (Schnipke and Scrams 1997), that can be applied, when response times (as discussed in the previous section 2.2.2) are available for each item or when an item design is used that allows interpreting individual time components (see section 2.8.2). Rapid guessing is particularly important for low-stakes assessments Goldhammer, Martens, and Lüdtke (2017).

While solution behavior describes the (intentional) process of deliberate responding, a second process of very fast responding can be observed in many data sets. Since both processes can often be clearly separated when inspecting the response time distribution, a Bimodal has become a central validity argument (see, for instance, Wise 2017) for focusing on Rapid Guessing as a distinguished response process. Using the bimodal response time distribution, a time threshold can be derived, and various methods exist for threshold identification (e.g., Soland, Kuhfeld, and Rios 2021), using either response time and or (in addition to) other information-based criteria (e.g., Wise 2019).

How to do this with the CBA ItemBuilder? The interpretation of time measures requires an appropriate item design (see section 2.8.2). If time components can be identified as response time, time thresholds can be used to classify responses as Rapid Guessing, either during the assessment or as part of the data post-processing (see section 8.6). If the implemented between-item navigation (see section 2.4) allows to measure time for item omission, Rapid Omission can also be classified as additional indicator for low test-taking engagement.

Alternatively to simple time thresholds, mixture modeling (e.g., Schnipke and Scrams 1997; Lu et al. 2019) can be used to differentiate between solution behavior and rapid guessing when post-processing the data. Treatments of rapid responses include response-level or test-taker-level filtering (see Rios et al. 2017, for a comparison). However, similar to missing values (see above), a treatment of responses identified as rapid guessing might require to take the missing mechanism into account (e.g., Deribo, Kroehne, and Goldhammer 2021). Further research is required regarding the operationalization of rapid guessing for complex items (see, e.g., Sahin and Colvin 2020, for a first step in that direction) and validating responses identified as Rapid Guessing (e.g., Ulitzsch, Penk, et al. 2021). Another area of current research is the transfer of response-time-based methods to identify Rapid Guessing to non-cognitive instruments and the exploration of Rapid Responding as part of Careless Insufficient Effort Responding (CIER), either using time thresholds or based on mixture modeling (e.g., Ulitzsch, Pohl, et al. 2021).

2.5.4 Calibration of Items

After constructing a set of new assessment tasks (i.e., single items or units), the items are often administered in a pilot study (often called calibration study). Subsequently, a sub-set of items is selected that measures a (latent) construct of interest in a comparable way, where the selection of items is typically guided in the context of the Item Response Theory (see, e.g., Partchev 2004) regarding Item Fit, and so-called Item Parameters are estimated. Different tools and, for instance, R packages such as TAM (Alexander Robitzsch, Kiefer, and Wu 2022) can be used to estimate item parameters and to compute (item) fit indices.

How to do this with the CBA ItemBuilder? Item parameters are not stored within the CBA ItemBuilder project files because identical items might be used with different parameters, for instance, to compensate for Differential Item Functioning (DIF) or to acknowledge Parameter Drift (PD).

Missing values can be scored in different ways for item calibration and ability estimation (see Alexander Robitzsch and Lüdtke 2022 for a discussion), depending, for instance, on assumptions regarding the latent missing propensity (see, for instance, Koehler, Pohl, and Carstensen 2014). The treatment of rapid guessing can improve item parameter estimation (e.g., Rios and Soland 2021; Rios 2022).

IRT models exist for dichotomous and polytomous items (see section 2.5.1). When multiple constructs are collected together, multidimensional IRT models can increase measurement efficiency (see, e.g., Kroehne, Goldhammer, and Partchev 2014).

Known item parameters are a prerequisite for increasing measurement efficiency through automatic test assembly and adaptive testing procedures (see section 2.7), and techniques such as the Continuous Calibration Strategy (Fink et al. 2018) can help to create new Item Pools.

How to do this with the CBA ItemBuilder? The accuracy of item parameters can be relevant, for instance, if adaptive testing is used that assumes known (and fixed) item parameters. For these applications, the calibration study should be performed with similar tools as the primary assessment to avoid mode effects due to different administration modes (i.e., paper vs. computer) or due to different computerization (see section 2.2.1).

Item parameters are only valid as long as the item remains unchanged. This limits the possibilities for customizing items, even if they are shared as Open Educational Resources (OER, see section 8.7.4).

2.5.5 Ability Estimation

While the estimation of item parameters is typically done outside the assessment software as part of test construction, the computation of a raw score (e.g., the number of items solved) or the estimation of a (preliminary) person-ability (using IRT and based on known item parameters) is a prerequisite for the implementation of methods to increase measurement efficiency (multi-stage testing or adaptive testing, see section 2.7). Rapid guessing (see section 2.5.3, e.g., Wise and DeMars 2006) as well as informed guessing can be acknowledged when estimating person parameters (e.g., Sideridis and Alahmadi 2022).

How to do this with the CBA ItemBuilder? Classes with assigned scoring conditions are defined in the CBA ItemBuilder without specifying how the resulting variables will be used to score the test-taker’s ability. For this purpose, an additional codebook should be part of the test delivery software, which assigns numerical values to the (nominal) scoring conditions. By separating the (nominal scaled) scoring conditions defined within CBA ItemBuilder tasks and the nominal or ordinal scaled weights, CBA ItemBuilder projects can remain unchanged even when switching between different ways of using item scores based on empirical results (e.g., model comparison).

2.5.6 Incorporation of Response Times

A long research tradition deals with the incorporation of response times in psychometric models. Based on the hierarchical modeling of responses and response times (van der Linden 2007) response times can be used, for instance, as collateral information for the estimation of item- and person-parameters (van der Linden, Klein Entink, and Fox 2010). Response times (and more generally, Process Indicators, see section 2.8) used to improve item response theory latent regression models (Reis Costa et al. 2021; Shin, Jewsbury, and van Rijn 2022). In combination with missing responses response-time related information (in terms of not reached items) can also be included in the ability estimation using polytomous scoring (Gorgun and Bulut 2021).

References

Attali, Yigal. 2011. “Immediate Feedback and Opportunity to Revise Answers: Application of a Graded Response IRT Model.” Applied Psychological Measurement 35 (6): 472–79. https://doi.org/10.1177/0146621610381755.

Bolt, Daniel. 2016. “Item Response Models for CBT.” In Technology and Testing: Improving Educational and Psychological Measurement, edited by Fritz Drasgow, 305. Routledge.

Deribo, Tobias, Ulf Kroehne, and Frank Goldhammer. 2021. “Model-Based Treatment of Rapid Guessing.” Journal of Educational Measurement 58 (2): 281–303. https://doi.org/10.1111/jedm.12290.

DiBattista, David. 2013. “The Immediate Feedback Assessment Technique: A Learner-centered Multiple-choice Response Form.” Canadian Journal of Higher Education 35 (4): 111–31. https://doi.org/10.47678/cjhe.v35i4.184475.

Downing, Steven M., and Thomas M. Haladyna, eds. 2006. Handbook of Test Development. Mahwah, N.J: L. Erlbaum.

Embretson, Susan, and Steven P Reise. 2013. Item Response Theory. Psychology Press.

Fink, Aron, Sebastian Born, Christian Spoden, and Andreas Frey. 2018. “A Continuous Calibration Strategy for Computerized Adaptive Testing.” Psychological Test and Assessment Modeling 3 (60): 327–46.

Frey, Andreas, Christian Spoden, Frank Goldhammer, and S. Franziska C. Wenzel. 2018. “Response Time-Based Treatment of Omitted Responses in Computer-Based Testing.” Behaviormetrika 45 (2): 505–26. https://doi.org/10.1007/s41237-018-0073-9.

George, A. C., and A. Robitzsch. 2015. “Cognitive Diagnosis Models in R: A Didactic.” The Quantitative Methods for Psychology 11 (3): 189–205. https://doi.org/10.20982/tqmp.11.3.p189.

Goldhammer, Frank, Thomas Martens, and Oliver Lüdtke. 2017. “Conditioning Factors of Test-Taking Engagement in PIAAC: An Exploratory IRT Modelling Approach Considering Person and Item Characteristics.” Large-Scale Assessments in Education 5 (1): 18. https://doi.org/10.1186/s40536-017-0051-9.

Gorgun, Guher, and Okan Bulut. 2021. “A Polytomous Scoring Approach to Handle Not-Reached Items in Low-Stakes Assessments.” Educational and Psychological Measurement 81 (5): 847–71. https://doi.org/10.1177/0013164421991211.

Harsch, Claudia, and Johannes Hartig. 2016. “Comparing C-tests and Yes/No Vocabulary Size Tests as Predictors of Receptive Language Skills.” Language Testing 33 (4): 555–75. https://doi.org/10.1177/0265532215594642.

Jiao, Hong, Dandan Liao, and Peida Zhan. 2019. “Utilizing Process Data for Cognitive Diagnosis.” In Handbook of Diagnostic Classification Models, edited by Matthias von Davier and Young-Sun Lee, 421–36. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-05584-4_20.

Koehler, C., S. Pohl, and C. H. Carstensen. 2014. “Taking the Missing Propensity Into Account When Estimating Competence Scores: Evaluation of Item Response Theory Models for Nonignorable Omissions.” Educational and Psychological Measurement, December. https://doi.org/10.1177/0013164414561785.

Kroehne, Ulf, Frank Goldhammer, and Ivailo Partchev. 2014. “Constrained Multidimensional Adaptive Testing Without Intermixing Items from Different Dimensions.” Psychological Test and Assessment Modeling 56 (4): 348.

Lu, Jing, Chun Wang, Jiwei Zhang, and Jian Tao. 2019. “A Mixture Model for Responses and Response Times with a Higher-Order Ability Structure to Detect Rapid Guessing Behaviour.” British Journal of Mathematical and Statistical Psychology, August, bmsp.12175. https://doi.org/10.1111/bmsp.12175.

Partchev, Ivailo. 2004. “A Visual Guide to Item Response Theory.” Retrieved November 9: 2004.

Reis Costa, Denise, Maria Bolsinova, Jesper Tijmstra, and Björn Andersson. 2021. “Improving the Precision of Ability Estimates Using Time-On-Task Variables: Insights From the PISA 2012 Computer-Based Assessment of Mathematics.” Frontiers in Psychology 12 (March): 579128. https://doi.org/10.3389/fpsyg.2021.579128.

Rios, Joseph A. 2022. “Assessing the Accuracy of Parameter Estimates in the Presence of Rapid Guessing Misclassifications.” Educational and Psychological Measurement 82 (1): 122–50. https://doi.org/10.1177/00131644211003640.

Rios, Joseph A., Hongwen Guo, Liyang Mao, and Ou Lydia Liu. 2017. “Evaluating the Impact of Careless Responding on Aggregated-Scores: To Filter Unmotivated Examinees or Not?” International Journal of Testing 17 (1): 74–104. https://doi.org/10.1080/15305058.2016.1231193.

Rios, Joseph A., and James Soland. 2021. “Parameter Estimation Accuracy of the Effort-Moderated Item Response Theory Model Under Multiple Assumption Violations.” Educational and Psychological Measurement 81 (3): 569–94. https://doi.org/10.1177/0013164420949896.

Robitzsch, Alexander, Thomas Kiefer, and Margaret Wu. 2022. TAM: Test Analysis Modules. Manual.

Robitzsch, Alexander, and Oliver Lüdtke. 2022. “Some Thoughts on Analytical Choices in the Scaling Model for Test Scores in International Large-Scale Assessment Studies.” Measurement Instruments for the Social Sciences 4 (1): 9. https://doi.org/10.1186/s42409-022-00039-w.

Rose, Norman, Matthias von Davier, and Benjamin Nagengast. 2017. “Modeling Omitted and Not-Reached Items in IRT Models.” Psychometrika 82 (3): 795–819. https://doi.org/10.1007/s11336-016-9544-7.

Sahin, Füsun, and Kimberly F. Colvin. 2020. “Enhancing Response Time Thresholds with Response Behaviors for Detecting Disengaged Examinees.” Large-Scale Assessments in Education 8 (1): 5. https://doi.org/10.1186/s40536-020-00082-1.

Schnipke, Deborah L., and David J. Scrams. 1997. “Modeling Item Response Times With a Two-State Mixture Model: A New Method of Measuring Speededness.” Journal of Educational Measurement 34 (3): 213–32.

Shin, Hyo Jeong, Paul A. Jewsbury, and Peter W. van Rijn. 2022. “Generating Group-Level Scores Under Response Accuracy-Time Conditional Dependence.” Large-Scale Assessments in Education 10 (1): 4. https://doi.org/10.1186/s40536-022-00122-y.

Sideridis, Georgios, and Maisa Alahmadi. 2022. “Estimation of Person Ability Under Rapid and Effortful Responding.” Journal of Intelligence 10 (3): 67. https://doi.org/10.3390/jintelligence10030067.

Slepkov, Aaron D., and Alan T. K. Godfrey. 2019. “Partial Credit in Answer-Until-Correct Multiple-Choice Tests Deployed in a Classroom Setting.” Applied Measurement in Education 32 (2): 138–50. https://doi.org/10.1080/08957347.2019.1577249.

Soland, James, Megan Kuhfeld, and Joseph Rios. 2021. “Comparing Different Response Time Threshold Setting Methods to Detect Low Effort on a Large-Scale Assessment.” Large-Scale Assessments in Education 9 (1): 8. https://doi.org/10.1186/s40536-021-00100-w.

Ulitzsch, Esther, Christiane Penk, Matthias von Davier, and Steffi Pohl. 2021. “Model Meets Reality: Validating a New Behavioral Measure for Test-Taking Effort.” Educational Assessment 26 (2): 104–24. https://doi.org/10.1080/10627197.2020.1858786.

Ulitzsch, Esther, Steffi Pohl, Lale Khorramdel, Ulf Kroehne, and Matthias von Davier. 2021. “A Response-Time-Based Latent Response Mixture Model for Identifying and Modeling Careless and Insufficient Effort Responding in Survey Data.” Psychometrika, December. https://doi.org/10.1007/s11336-021-09817-7.

Ulitzsch, Esther, Matthias von Davier, and Steffi Pohl. 2020. “A Multiprocess Item Response Model for Not-Reached Items Due to Time Limits and Quitting.” Educational and Psychological Measurement 80 (3): 522–47. https://doi.org/10.1177/0013164419878241.

van der Linden, Wim J. 2007. “A Hierarchical Framework for Modeling Speed and Accuracy on Test Items.” Psychometrika 72 (3): 287–308. https://doi.org/10.1007/s11336-006-1478-z.

van der Linden, Wim J., R. H. Klein Entink, and J.-P. Fox. 2010. “IRT Parameter Estimation With Response Times as Collateral Information.” Applied Psychological Measurement 34 (5): 327–47. https://doi.org/10.1177/0146621609349800.

Veldkamp, Bernard P., and Cor Sluijter, eds. 2019. Theoretical and Practical Advances in Computer-based Educational Measurement. Methodology of Educational Measurement and Assessment. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-18480-3.

Williamson, David M, Mislevy, and Isaac I Bejar. 2006. Automated Scoring of Complex Tasks in Computer-Based Testing. Mahwah, N.J.: Lawrence Erlbaum Associates.

Wise, Steven L. 2017. “Rapid-Guessing Behavior: Its Identification, Interpretation, and Implications.” Educational Measurement: Issues and Practice 36 (4): 52–61. https://doi.org/10.1111/emip.12165.

———. 2019. “An Information-Based Approach to Identifying Rapid-Guessing Thresholds.” Applied Measurement in Education 32 (4): 325–36. https://doi.org/10.1080/08957347.2019.1660350.

Wise, Steven L., and Christine E. DeMars. 2006. “An Application of Item Response Time: The Effort-Moderated IRT Model.” Journal of Educational Measurement 43 (1): 19–38.

Yan, Duanli, André A. Rupp, and Peter W. Foltz, eds. 2020. Handbook of Automated Scoring; Theory into Practice. CRC Press/Taylor & Francis Group.