Paper reproduction status: what irtsim can and cannot reproduce

Scope

One of irtsim’s two stated goals is paper reproducibility: users should be able to run the Monte Carlo examples from Schroeders and Gnambs (2025), “Sample Size Planning for Item Response Models: A Tutorial for the Quantitative Researcher” (https://ulrich-schroeders.github.io/IRT-sample-size/), and compare irtsim output against the published reference.

This vignette is the honest scorecard for that goal as of the current release. It documents which of the paper’s three examples irtsim can reproduce end-to-end today, which it cannot, and what architectural changes would close the remaining gaps.

Summary table

Example Paper scenario irtsim status Where to find it
Example 1 1PL linked two-form design, 30 items, 438 MC iterations, MSE criterion Reproducible vignette("paper-example-1-linked-design")
Example 2 2PL with bivariate θ + external criterion, MCAR, SE of cor(θ, criterion) via TAM latent-regression β Not reproducible — architectural gaps See “Example 2 gap” below
Example 3 GRM with leave-one-measure-out, RMSE of testinfo(mod, θ = 2.0) / (1 + testinfo(mod, θ = 2.0)) Not reproducible — fitted-model access gap See “Example 3 gap” below

Example 2 gap (personality test / external criterion)

The paper’s Example 2 constructs a bivariate latent trait (θ, external criterion ξ) with population correlation ρ = 0.5, generates responses to 30 2PL items, applies MCAR missingness at 0%/33%/67%, fits the model with TAM::tam.mml() (or tam.mml.2pl()) using the external criterion as a latent regressor, and extracts SE(cor(θ, ξ)) = tam.se(mod)$beta[2, 2].

What irtsim is missing:

  1. Bivariate θ generation. irt_design() accepts a single theta_dist (a distribution name or a function producing a univariate θ vector). There is no way to declare a bivariate (θ, external criterion) generating distribution.
  2. External-covariate plumbing. fit_model() wraps mirt::mirt() directly and has no provision for passing an external criterion column through to a latent-regression estimator.
  3. Latent-regression backend. The paper uses TAM, which irtsim does not depend on. mirt supports latent regression through mirt(..., covdata = ...), but irtsim does not expose that surface.
  4. Criterion-callback signature. criterion_fn in summary.irt_results() receives only (estimates, true_value, ci_lower, ci_upper, converged) — all item- scoped. There is no way to hand the callback a fitted model, a θ estimate vector, or an external covariate vector, so even an ad hoc post-hoc reconstruction of the paper’s β SE is not expressible today.

Obj 30 in the project backlog tracks closing this gap.

Example 3 gap (GRM conditional reliability at θ = 2.0)

The paper’s Example 3 calibrates a 50-item GRM composed of three clinical symptom scales and computes, per iteration, testinfo(mod, Theta = 2.0) / (testinfo(mod, Theta = 2.0) + 1) — the conditional reliability at a target θ value. The criterion reported is the RMSE of that quantity across iterations.

What irtsim is missing:

  1. Access to the fitted model. run_one_iteration() in irt_simulate.R calls extract_params() immediately after fit_model() returns and then discards the fitted mirt object. Nothing downstream has access to mod for test-info queries.
  2. Custom quantity extraction per iteration. There is no hook analogous to extract_fn(mod, data) that would let a user pull arbitrary quantities out of the fitted model (test info at θ, reliability at θ, discriminant validity, etc.) and attach them to the per-iteration result store.
  3. Post-hoc criterion on a non-parameter quantity. Even with criterion_fn in summary.irt_results(), the callback sees only item-level parameter estimates — not per-iteration scalar extracts that depend on the fitted model.

Obj 31 in the project backlog tracks closing this gap.

Why the gaps exist

irtsim’s pipeline was originally scoped tightly around item parameter recovery — estimands like bias, MSE, RMSE, coverage, and empirical SE on a, b, b1..bk. That scope is well served by the current (estimates, true_value, ci_lower, ci_upper) interface. The paper’s Examples 2 and 3, however, target estimands that are not item-scoped parameters at all:

Closing both gaps with a narrow, one-off hook for each estimand would entangle irtsim with TAM and mirt::testinfo and would keep adding scope for every new paper example that showed up.

Preferred architectural direction

The project is considering a single pluggable-hook addition that subsumes both gaps without hard-coding any specific backend:

irt_simulate(
  study,
  iterations = 438,
  seed       = 2024,
  parallel   = TRUE,
  fit_fn     = my_fit_fn,     # user-supplied: fits a model their way
  extract_fn = my_extract_fn  # user-supplied: returns a named list of
                              # per-iteration scalars / vectors
)

The returned irt_results object would gain a third store alongside item_results and theta_results — call it extracted_results — in which each named output of extract_fn becomes a column. Users would then write their own summary/recommended_n logic on that slot, or irtsim would provide a thin convenience that computes Morris-style criteria on each extracted column.

This approach:

The tradeoff is that the fit/extract contract becomes a new public surface that has to be documented, versioned, and tested.

Alternative: compose irtsim with external tools

Even without the pluggable hook, the paper’s examples can be partially reproduced by using irtsim for the pieces it handles well and stitching in external code for the rest:

This is a reasonable pattern while Obj 30 / Obj 31 are deferred. It also argues against inflating irtsim’s dependency surface.

What Example 1 buys users today

If your goal is a published-reference comparison for irtsim’s core workflow — linked-test design, Rasch fitting, MSE of item difficulty vs. sample size — Example 1 gives you that comparison as a rendered vignette today. See vignette("paper-example-1-linked-design") for the faithful reproduction and a table of the design decisions mapped from the paper onto irtsim’s API.

References

Schroeders, U., & Gnambs, T. (2025). Sample size planning for item response models: A tutorial for the quantitative researcher. Companion code: https://ulrich-schroeders.github.io/IRT-sample-size/.