8.6 Data Processing after Assessments

The data preparation process should be already tested as part of the Integration Testing (see section 8.4.2). For this purpose, the required routines (e.g., R scripts) should already have been created prior to data collection and tested with the help of synthetic data. Testing is complete if it is verified that the central information can be derived from completing tasks required for identifying evidence about the test-taker’s knowledge, skills, and abilities.

TABLE 8.8: Steps for Data Preparation / Reporting / Feedback
Data Preparation / Reporting / Feedback
Data Preparation
Coding of Open-Ended Responses
Final Scoring / Cut Scores / Test Reports
Data Set Generation / Data Dissemination
How to do this with the CBA ItemBuilder? The simplest case for Data Preparation of assessment projects using the CBA ItemBuilder are studies in which the Codebook and Missing Value Coding are already included in the deployment software. If all instruments are implemented using CBA ItemBuilder content and if no ExternalPageFrame content is used, the required data preparation for result data boils down to combining data stored in case-wise individual raw data archives into a data set in the desired file format.

Data Preparation: Data preparation can begin during data collection if intermediate data are provided or made available. Typically, the data is generated in smaller units (i.e., sessions) in which a test-taker processes a set of tasks compiled for him or assigned to him via pre-load information. The data on a test taker, as provided by the assessment software, can be understood as a Raw Data Archive. Analogous to the scans of paper test booklets, these Raw Data Archives (for example, combined as a ZIP archive) are the starting point for data preparation. If raw data from computer-based assessments must be archived in terms of good scientific practice, then this can be understood as the requirement for long-term storage of the Raw Data Archives.

A first step often required to describe the collected data as pseudonymized or anonymized is the exchange of the person identifiers (ID change) that were used during data collection. Person identifiers might be used as the file name of the Raw Data Archives and might be included in several places. Since the Raw Data Archives should not be changed after data collection, the data processing means extracting the relevant information from the Raw Data Archives and changing the person identifier in the extracted result data and the extracted log data.

Approaches known for Open Science and Reproducible research (Gandrud 2020) should be used (i.e., using scripts that are maintained under version control), to allow re-running the complete data preparation starting from the Raw Data Archives to the final data sets. If the data preparation is carried out entirely with the help of scripts (e.g., using R), later adjustments are more straightforward. Possible adjustments include deletion requests for the data of individual test-takers, which might otherwise be cumbersome if, for example, a large number of data sets is created due to the collected log data (see section 2.8).

How to do this with the CBA ItemBuilder? For automatic and script-based processing of data collected with the CBA ItemBuilder and selected delivery software, the readout of raw data archives can be automated with the R package LogFSM (see section 2.8.5).

Coding of Open-Ended Responses: Operators described in chapter 5 for the CBA ItemBuilder for evaluating so-called Open-Ended Answers are currently limited. Open-ended answers (such as text answers) can only be scored automatically to a minimal extent (in the CBA ItemBuilder, only with the help of regular expressions). More modern methods of evaluating open-text responses using natural language processing methods [NLP; see, for instance, Zehner, Sälzer, and Goldhammer (2016)] might require a two-step procedure. Training data are collected in the first step and not evaluated live during test-taking. Afterward, classifiers are trained based on NLP language models or adapted in the form of fine-tuning. Once such classifiers are available, the answers can be automatically evaluated by test-takers and transferred to the data set.

A similar procedure applies to graphical answer formats (e.g., when an ExternalPageFrame allows test takers to create a drawing for their answer). For the creation of training data as preparation of an automatic coding or if answers are to be evaluated exclusively humanly, the open answer must be extracted from the raw data archives for an evaluation process (Human Scoring).

How to do this with the CBA ItemBuilder? Using the data collected during runtime (in particular using the so-called Snapshot) the TaskPlayer API can be used to restore the item in exactly the state in which the item was left by the test-taker. This allows to build solutions for human coding of open responses. Note that if content is embedded using ExternalPageFrames, the JavaScript/HTML5 content embedded into CBA ItemBuilder items must implement the getState()/setState()-interface to collect the state of the ExternalPageFrames on exit and to allow to restore the content for scoring purposes (rating).

Final Scoring: The decision of whether items already score the responses (scoring) or whether only the raw responses (i.e., the selected items, entered texts, etc.) are collected at runtime is made differently for different assessments. As long as the responses are not needed for filtering or ability estimation (see section 2.7), there is no critical reason why scoring should not be performed as part of post-processing. Only if created assessment content is shared (see section 8.7.3) is it helpful to define the scoring, for instance, directly within the CBA ItemBuilder Project Files (i.e., the files to be shared), because this way, items are automatically shared with the appropriate scoring.

Cut Scores and Item Parameters: Even if the scoring, i.e., for example, the mapping of a selection to a scoring (correct, wrong, partially correct), can be part of the item (i.e., is implemented, for instance, using the scoring operators described in Chapter 5), the Item Parameters and potential Cut Scores (i.e., threshold values for estimated latent abilities) are not considered to be part of the assessment content, because these parameters might either not be known when a newly developed instrument is used for the first time or the values might depend on the intended target population.

How to do this with the CBA ItemBuilder? To implement adaptive testing (i.e., a dynamic selection of Tasks during testing, see section 6.7.2), Item Parameters are needed. Depending on the deployment software used, the Item Parameters can be stored, for example, as an Item Pool (see section 7.5.5) or used in an R function (see section 7.3.3).

Test Reports: Different parts of an assessment software might be responsible for feedback either during the assessment (see section 2.9.1), or after data processing and scoring of open-ended responses (see section 2.9.2). Hence, reports can be generated either online (as part of the assessment software) or offline (as part of the data processing).

How to do this with the CBA ItemBuilder? You can implement basic examples of using CBA ItemBuilder items with automatic, out-of-the-box feedback using the R package ShinyItemBuilder (see section 7.3.5).

Data Dissemination: The provision and distribution (i.e., dissemination) of data from computer-based assessments, for example in research data centers, can be done for Result Data and Process Indicators in the typical form of data sets (one row per person, one column per variable). Since the different assessment platforms and software tools provide log data in different ways, log data can be transformed into one of the data formats described in section 2.8.4 as part of the data processing after an assessment.

References

Gandrud, Christopher. 2020. Reproducible Research with R and RStudio. Third edition. The R Series. Boca Raton, FL: CRC Press.
Zehner, Fabian, Christine Sälzer, and Frank Goldhammer. 2016. “Automatic Coding of Short Text Responses via Clustering in Educational Assessment.” Educational and Psychological Measurement 76 (2): 280–303. https://doi.org/10.1177/0013164415590022.