2.2 Standardized Response Formats

The existing standard Question and Test Interoperability (QTI)¹⁵ defines simple items with one point of interaction. These simple items can be understood as the standardized form of classical response formats (see Figure 2.1 for an illustration).

FIGURE 2.1: Item illustrating QTI interactions for simple items (html|ib).

Choice Interaction: The QTI Choice Interaction presents a collection of choices to a test-taker. The test-takers response is to select one or more of the choices, up to a maximum of max-choices. The choice interaction is always initialized with no choices selected. The behavior of QTI Choice Interactions regarding the number of selectable choices is described with the attributes max-choices and min-choices.

How to do this with the CBA ItemBuilder? Instead of max-choices and min-choices the CBA ItemBuilder differentiates between RadioButtons (single-choice, see section 3.9.2) and Checkboxes (multiple choice, see section 3.9.3), and the more general concept of FrameSelectGroups (see section 3.5.1).

The QTI standard only differentiates between orientation with the possible values horizontal and vertical, while the CBA ItemBuilder allows you to create any visual layout (including tables) with either RadioButtons and/or Checkboxes (see section 3.5.3).¹⁶ Moreover, QTI allows to define the option shuffle to randomize the order of choices.¹⁷

(Extended) Text Entry Interaction: The QTI Text Entry Interaction is defined as an inline interaction that obtains a simple piece of text from the test-taker. The delivery engine (i.e., assessment platform) must allow the test-taker to review their choice within the context of the surrounding text. An example illustrating an item from Striewe and Kramer (2018) is shown in Figure 2.1. QTI uses specific so-called class attributes to define the width of text entry component. QTI defines the Extended Text Interaction for tasks where an extended amount of text needs to be entered as the response.

How to do this with the CBA ItemBuilder? Text responses can be collected with input fields (either single line or multiple lines, see section 3.9.1)¹⁸, while size and position can be defined visually in the Page Editor (see section 3.7). Regular expressions can be used to restrict possible characters (see section 6.1.3).

Gap Match Interaction / Associate Interaction / Order Interaction: QTI defines various interactions, that can be realized using drag and drop, such as Gap Match Interaction, Associate Interaction, and Order Interaction (see Figure 2.1). The interactions are defined by QTI for text and graph, depending on the nature of the elements.

How to do this with the CBA ItemBuilder? A concept that allows drag and drop of text, images and videos is implemented in the CBA ItemBuilder (see section 4.2.6). Using this approach, the interactions described by QTI can be implemented by a) visually placing the source and target elements on a page and b) configuring the drag and drop to either switch, copy or move values.

Match Interaction: Another interaction that not necessarily needs to be realized using drag and drop is the Match Interaction, that can also be computerized using components of type Checkbox as used for multiple choice responses formats (see Figure 2.1).

HotText / Inline Choice Interaction: For choice interactions embedded in context (e.g. text, image, etc.), the QTI standard defines two different interactions. Two possible implementations, with buttons and ComboBoxes, are shown in Figure 2.1. However, the response formats can also be implemented with other components, for example, Checkboxes (multiple-choice) or RadioButtons (single-choice) for hot text interactions.

Slider Interaction: Response formats in which the possible answers cannot leave a predefined value range contribute to the standardization of assessments. QTI defines such a response format as Slider Interaction (see Figure 2.1).

How to do this with the CBA ItemBuilder? The CBA ItemBuilder has specific components that can be linked directly to variables (see section 4.2.2).

Additional Interactions Defined by QTI: The QTI standard defines additional interactions not illustrated in Figure 2.1. The Media interaction allows to add audio and video components to items, including a measurement how often the media object was experienced.

How to do this with the CBA ItemBuilder? The use of components to play audio and video files is described in section 3.10.3.

Hotspot interaction and Position Object interaction are graphical interactions that allow to select or positions parts of images, while the Drawing interactions describes items in which test-taker can construct graphical responses.

How to do this with the CBA ItemBuilder? Simple graphical interactions can be implemented using ImageMaps (see section 3.9.10). For more advanced graphical response formats, the CBA ItemBuilder provides the concept of ExternalPageFrames, to embed HTML5/JavaScript content (see section 3.14 and section 6.6 for examples).

QTI also defines the Upload interactions, that is more suited for learning environments and not necessarily for standardized computer-based assessment since uploading material from the computer might violate typical requirements regarding test security (see section 2.10).

PCI Interaction: As shown in Figure 2.1, the CBA ItemBuilder can be used to create items that correspond to the interactions defined by QTI. Figure 2.1 shows a single CBA ItemBuilder task, illustrating two different ways of navigating between interactions. The Innovative Item Types described in section 2.3 below show additional examples beyond the possibilities defined in the QTI standard. To use such innovative items, e.g., tasks created with the CBA ItemBuilder, in QTI-based assessment platforms, QTI describes the Portable Custom Interaction (PCI).

How to do this with the CBA ItemBuilder? CBA ItemBuilder tasks can be packaged in a way that allows using them in QTI-based assessment platforms, such as TAO (see section 7.4).

Combination of Interactions: Not all of the interactions standardized by QTI were available in paper-based assessments (PBA) mode. However, in particular, single- and multiple-choice interactions (for closed question types) and text entry interactions (for constructed written responses) were used extensively in PBA, meaning printed and distributed across multiple pages.

How to do this with the CBA ItemBuilder? The CBA ItemBuilder uses the concept of pages on which components are placed to implement individual interactions to structure assessments into items, tasks and units (see section 3.4). This makes it possible to implement Choice Interactions and Text Entry Interactions, among others, in any order, without the QTI interactions themselves having to be provided by the CBA ItemBuilder (see Figure 2.1).

Beyond simple items, Items in general are defined by QTI as a set of interactions:¹⁹

For the purposes of QTI, an item is a set of interactions (possibly empty) collected together with any supporting material and an optional set of rules for converting the candidate’s response(s) into assessment outcomes.

Distribution of Items on Pages: The QTI standard also provides some guideline how to split content with multiple interactions into items:²⁰

To help determine whether or not a piece of assessment content that comprises multiple interactions should be represented as a single assessmentItem (known as a composite item in QTI) the strength of the relationship between the interactions should be examined. If they can stand alone then they may best be implemented as separate items, perhaps sharing a piece of stimulus material like a picture or a passage of text included as an object. If several interactions are closely related then they may belong in a composite item, but always consider the question of how easy it is for the candidate to keep track of the state of the item when it contains multiple related interactions. If the question requires the user to scroll a window on their computer screen just to see all the interactions then the item may be better re-written as several smaller related items.

How to do this with the CBA ItemBuilder? The CBA ItemBuilder provides a high degree of design freedom. Items do not have to be broken down into individual interactions. Components to capture responses can be freely placed on pages, and item authors can define how test-takers can switch between pages. However, the division of assessment content into individual components (referred to as Tasks in the context of the CBA ItemBuilder) is still necessary (see in detail in section 3.6).

Two key points for the computerization of assessments can be derived from the QTI standard:

The QTI standard defines basic interaction types. However, the combination of multiple items requires either scrolling (if all items are computerized using one page) or paging (i.e., additional interactive elements are required to allow the navigation between items, see section 2.4). The button View as single page with scrolling ... in Figure 2.1 illustrates the two possibilities.

How to do this with the CBA ItemBuilder? As shown in Figure 2.1, the CBA ItemBuilder technically supports paging and scrolling. However, paging is preferred as it allows item designs in which only one task is visible at a time. Moreover, the structure created by item content, such as Units, is important to consider when distributing items across pages and deciding on the required navigation.

The standardization of computerized items goes beyond the definition of individual interactions. For instance, typical QTI items provide a submit button at the end of the page that contains one or multiple interactions (and the submit button is essential to acknowledge, for instance, regarding the precise definition of response times, see section 2.2.2).²¹

How to do this with the CBA ItemBuilder? Assessment components created with the CBA ItemBuilder cannot be described with the QTI standard. However, the project files can be shared and archived, either with the actual task content or only as functional mockups with replaced content (referred to as Mock Item).

2.2.1 Mode Effects and Test-Equivalence

Items designed for collecting diagnostic information were printed on paper, and simple response formats such as choice interactions and text entry interactions were used that capture the products of test-takers answering items in paper-based assessments. As described in the 2.1 section, there are a number of advantages of computer-based assessment that distinguish this form of measurement from paper-based assessment. Until these advantages are exploited to a large extent and to confirm that existing knowledge about constructs and items holds, research into the comparability of paper- and computer-based assessment is essential (e.g., Bugbee 1996; Clariana and Wallace 2002).

Properties of Test Administration: As described by Kroehne and Martens (2011), different sources of potential mode effects can be distinguished, and a detailed description of properties of test administrations is required, since also different computerization are not necessarily identical. Hence, instead of the comparison between new (CBA) versus old (PBA), the investigation of different properties that are combined in any concrete assessments is required, for instance, to achieve reproducibility.

Item Difficulty and Construct Equivalence: A typical finding for mode effects in large-scale educational assessments is that items become more difficult when answered on a computer (e.g., for PISA, A. Robitzsch et al. 2020). From a conceptual point of view, a separation between the concept of mode effects and Differential Item Functioning (Feskens, Fox, and Zwitser 2019) might be possible, since properties of the test administration can be created by researchers and conditions with different properties can be randomly assigned to test-takers. Consequently, mode effects can be analyzed using random equivalent groups and assessments can be made comparable, even if all items change with respect to their item parameter (Buerger et al. 2019). When advantages available only in computer-based assessment are utilized, the issue of mode effects fades into the background in favor of the issue of construct equivalence (e.g., Buerger, Kroehne, and Goldhammer 2016; Kroehne et al. 2019).

Mode effects might affect not only the response correctness, but also the speed in which test-taker read texts or, more generally, work in computer-based tests (Kroehne, Hahnel, and Goldhammer 2019), and mode effects can also affect rapid guessing (e.g., Kroehne, Deribo, and Goldhammer 2020), and might occur more subtly, for instance, concerning short text responses due to the difference in writing vs. typing (Zehner et al. 2018).

Further research and a systematic review regarding mode effects should cover typing vs. writing (for instance, with respect to capitalization, text production, etc., Jung et al. 2019), different text input methods such as hardware keyboard vs. touch keyboard, different pointing devices such as mouse vs. touch, and scrolling vs. paging (e.g., Haverkamp et al. 2022).

2.2.2 Response Times and Time on Task

Various terms are used in the literature to describe how fast responses are provided to questions, tasks, or items. Dating back to Galton and Spearman (see references in Kyllonen and Zu 2016). Reaction Time measures have a long research tradition in the context of cognitive ability measurement. Prior to the computer-based assessments, response times were either self-reported time measures or time measures taken by proctors or test administrators (e.g., Ebel 1953).

In recent years and in the context of computer-based assessment, Response Time is used to refer to time measures that can be understood as the time difference between the answer selection or answer submission and the onset of the item presentation (see Figure 2.2). However, a clear definition of how response times were operationalized in the computer-based assessments is missing in many publications (e.g., Schnipke and Scrams 1997; Hornke 2005). If log data are used to measure the time, the term Time on Task is used (see, for instance, Goldhammer et al. 2014; Naumann and Goldhammer 2017; Reis Costa et al. 2021).

FIGURE 2.2: Example illustrating different operationalizations of Response Time (html|ib).

Extreme Response Times and Idle Times: Response times have a natural lower limit. Short response times faster than expected for serious test-taking can be flagged using time thresholds if necessary (see section 2.5.3). Very long response times occur, for instance, if the test-taking process is (temporarily) interrupted or if the test-taking process takes place outside the assessment platform (in terms of thinking or note-taking on scratch paper). In particular, for unsupervised online test delivery (see section 7.2.1), long response times can also be caused by test-takers exiting the browser window or parallel interactions outside the test system. In order to gather information to allow informed trimming of long response times, it may be useful to record all interactions of test-takers that imply activity (such as mouse movements, keyboard strokes). Log events can then be used to detect idle times that occur when, for whatever reason, the test-taking is interrupted (see section 2.8).

Response Times in Questionnaires: As described in Kroehne et al. (2019) without a precise definition, response times cannot be compared, and alternative operationalizations, for instance, the time difference between subsequent answer changes are possible. In survey research the term Response Latency is used (e.g., Mayerl 2013), both for time measures taking by interviewers or by the assessment software. However, as described by Reips (2010), the interpretation of time measures require to know which task, question or item a test-taker or responded is working on, and additional assumptions are required if test-taker can freely navigate between tasks or see multiple questions per screen. With additional assumptions, item-level response times can, however, be extracted from log data, as illustrated for the example of item-batteries with multiple questions per screen in Figure 2.3.

FIGURE 2.3: Item illustrating Average Answering Time for item batteries (html|ib).

Time Components from Log Data: Since there are now countless different computer-based tests, many software tools, and assessment implementations, the concept of response times requires a more precise operationalization. One possible definition of time measures uses log events, as suggested as Time on Task. Various response time measures can be extracted from log data using methods developed for log file analyses (see section 2.8). Depending on the item design, all response time measures may require assumptions for interpretation. For example, if items can be visited multiple times, the times must be cumulated over visits. However, this rests on the assumption that the test-taker thinks about answering the task each time a task is visited. If multiple items are presented per screen and questions can be answered in any sequence, an assumption is necessary that each time a test-taker thinks about a question, this will be followed by an answer change.

2.2.3 Time Limits (Restrict Maximum Time)

Traditionally, in paper-based large scale assessments, time limits for tests or booklets were mainly used. Restricting the time for a particular test or booklet has the practical advantage that this type of time limit can also be controlled and implemented in group-based test administrations. A similar procedure can also be implemented in computer-based assessment. A time limit is defined for the processing of several items (e.g., for the complete test or a sub-test or section), and after the time limit has expired, the tasks for a particular test part can no longer be answered.

In contrast, however, computer-based testing also allows the implementation of time limits for individual tasks or small groups of items (e.g., units). The advantage is obvious: While time limits at the test level or booklet level can result for various reasons in large inter-individual differences in the number of visited items (for instance, due to individual test-taking strategies or individual items that a test person is stuck on), time limits at item level can be implemented in such a way that all persons can see each item for at least a certain amount of time. Computer-based assessment allows to differentiate between time limits at the item level and time limits at test level. In between, time limits for item bundles, e.g., units, can be created. If the comparability of psychometric properties of an assessment to an earlier paper-based form is not necessary or if this comparability can be established, for example, on the basis of a linking study, then time limits can be used purposefully in the computer-based assessment to design the data collection. For example, if a test is administered in only one predetermined order (i.e., no booklet design), time limits at the test level will result in not-reached items depending on the item position.

How to do this with the CBA ItemBuilder? Time limits for tasks (i.e., single items or group of items) can be implemented with the CBA ItemBuilder, for example (see section 6.4.5). Time limits across multiple items can be defined at the level of the test delivery (see section 7.2.8).

Time limits do not only restrict test-taking. Time limits also give the test-takers feedback on their individual chosen pace of working on the task (e.g., Goldhammer 2015). Different possibilities to give feedback during the assessment are described in section 2.9.1). The item design of Blocked Item Response (see section 2.4.1 can be used to force a minimum time for individual items.

References

Buerger, Sarah, Ulf Kroehne, and Frank Goldhammer. 2016. “The Transition to Computer-Based Testing in Large-Scale Assessments: Investigating (Partial) Measurement Invariance Between Modes.”

Buerger, Sarah, Ulf Kroehne, Carmen Koehler, and Frank Goldhammer. 2019. “What Makes the Difference? The Impact of Item Properties on Mode Effects in Reading Assessments.” Studies in Educational Evaluation 62: 1–9. https://doi.org/10.1016/j.stueduc.2019.04.005.

Bugbee, Alan C. 1996. “The Equivalence of Paper-and-Pencil and Computer-Based Testing.” Journal of Research on Computing in Education 28 (3): 282.

Clariana, Roy, and Patricia Wallace. 2002. “Paperbased Versus Computerbased Assessment: Key Factors Associated with the Test Mode Effect.” British Journal of Educational Technology 33 (5): 593–602. https://doi.org/10.1111/1467-8535.00294.

Ebel, Robert L. 1953. “The Use of Item Response Time Measurements in the Construction of Educational Achievement Tests.” Educational and Psychological Measurement 13 (3): 391–401. https://doi.org/10.1177/001316445301300303.

Feskens, Remco, Jean-Paul Fox, and Robert Zwitser. 2019. “Differential Item Functioning in PISA Due to Mode Effects.” In Theoretical and Practical Advances in Computer-Based Educational Measurement, edited by Bernard P. Veldkamp and Cor Sluijter, 231–47. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-18480-3_12.

Goldhammer, Frank. 2015. “Measuring Ability, Speed, or Both? Challenges, Psychometric Solutions, and What Can Be Gained From Experimental Control.” Measurement: Interdisciplinary Research and Perspectives 13 (3-4): 133–64. https://doi.org/10.1080/15366367.2015.1100020.

Goldhammer, Frank, Johannes Naumann, Annette Stelter, Krisztina Tóth, Heiko Rölke, and Eckhard Klieme. 2014. “The Time on Task Effect in Reading and Problem Solving Is Moderated by Task Difficulty and Skill: Insights from a Computer-Based Large-Scale Assessment.” Journal of Educational Psychology 106 (3): 608–26. https://doi.org/10.1037/a0034716.

Haverkamp, Ymkje E, Ivar Bråten, Natalia Latini, and Ladislao Salmerón. 2022. “Is It the Size, the Movement, or Both? Investigating Effects of Screen Size and Text Movement on Processing, Understanding, and Motivation When Students Read Informational Text,” 20.

Hornke, Lutz F. 2005. “Response Time in Computer-Aided Testing: A ‘Verbal Memory’ Test for Routes and Maps,” 14.

Jung, Stefanie, Juergen Heller, Korbinian Moeller, and Elise Klein. 2019. “Mode Effect: An Issue of Perspective? Writing Mode Differences in a Spelling Assessment in German Children with and Without Developmental Dyslexia,” 38.

Kroehne, Ulf, Sarah Buerger, Carolin Hahnel, and Frank Goldhammer. 2019. “Construct Equivalence of PISA Reading Comprehension Measured With Paper-Based and Computer-Based Assessments.” Educational Measurement: Issues and Practice, July, emip.12280. https://doi.org/10.1111/emip.12280.

Kroehne, Ulf, Tobias Deribo, and Frank Goldhammer. 2020. “Rapid Guessing Rates Across Administration Mode and Test Setting.” Psychological Test and Assessment Modeling 62 (2): 147–77.

Kroehne, Ulf, Carolin Hahnel, and Frank Goldhammer. 2019. “Invariance of the Response Processes Between Gender and Modes in an Assessment of Reading.” Frontiers in Applied Mathematics and Statistics 5: 2. https://doi.org/10.3389/fams.2019.00002.

Kroehne, Ulf, and Thomas Martens. 2011. “Computer-Based Competence Tests in the National Educational Panel Study: The Challenge of Mode Effects.” Zeitschrift Für Erziehungswissenschaft 14 (S2): 169–86. https://doi.org/10.1007/s11618-011-0185-4.

Kyllonen, Patrick, and Jiyun Zu. 2016. “Use of Response Time for Measuring Cognitive Ability.” Journal of Intelligence 4 (4): 14. https://doi.org/10.3390/jintelligence4040014.

Mayerl, Jochen. 2013. “Response Latency Measurement in Surveys. Detecting Strong Attitudes and Response Effects.” Survey Methods: Insights from the Field (SMIF).

Naumann, Johannes, and Frank Goldhammer. 2017. “Time-on-Task Effects in Digital Reading Are Non-Linear and Moderated by Persons’ Skills and Tasks’ Demands.” Learning and Individual Differences 53: 1–16.

“Question and Test Interoperability (QTI): Implementation Guide.” 2022. http://www.imsglobal.org/question/qtiv2p2/imsqti_v2p2_impl.html.

Reips, Ulf-Dietrich. 2010. “Design and Formatting in Internet-based Research.” In Advanced Methods for Conducting Online Behavioral Research, edited by S. Gosling and J. Johnson, 29–43. Washington, DC: American Psychological Association.

Reis Costa, Denise, Maria Bolsinova, Jesper Tijmstra, and Björn Andersson. 2021. “Improving the Precision of Ability Estimates Using Time-On-Task Variables: Insights From the PISA 2012 Computer-Based Assessment of Mathematics.” Frontiers in Psychology 12 (March): 579128. https://doi.org/10.3389/fpsyg.2021.579128.

Robitzsch, A., O. Lüdtke, F. Goldhammer, Ulf Kroehne, and O. Köller. 2020. “Reanalysis of the German PISA Data: A Comparison of Different Approaches for Trend Estimation with a Particular Emphasis on Mode Effects.” Frontiers in Psychology 11 (884). https://doi.org/http://dx.doi.org/10.3389/fpsyg.2020.00884.

Schnipke, Deborah L., and David J. Scrams. 1997. “Modeling Item Response Times With a Two-State Mixture Model: A New Method of Measuring Speededness.” Journal of Educational Measurement 34 (3): 213–32.

Striewe, Michael, and Matthias Kramer. 2018. “Empirische Untersuchungen von Lückentext-Items Zur Beherrschung Der Syntax Einer Programmiersprache.” Commentarii Informaticae Didacticae, no. 12: 101–15.

Zehner, Fabian, Frank Goldhammer, Emily Lubaway, and Christine Sälzer. 2018. “Unattended Consequences: How Text Responses Alter Alongside PISA’s Mode Change from 2012 to 2015.” Education Inquiry, October, 1–22. https://doi.org/10.1080/20004508.2018.1518080.

“Question and Test Interoperability (QTI): Implementation Guide” (2022)↩︎
Note that also other components can be used to implement choice interactions, see section 3.9.7 and section 3.9.10 for examples.↩︎
The shuffle-option is, if required, possible only with the help of dynamic components of the CBA ItemBuilder, as described in chapter 4 (see section 6.4.10 for an example).↩︎
CBA ItemBuilder allows to include advanced text editors as ExternalPageFrame, if required (see section 6.6.2).↩︎
https://www.imsglobal.org/question/qtiv2p2/imsqti_v2p2_impl.html#h.lkeh7elhvt2n ↩︎
http://www.imsglobal.org/question/qtiv2p2/imsqti_v2p2_impl.html#3.1 ↩︎
See the QTI standard for the End Attempt interaction.↩︎