Measuring the Impact of Human–AI Collaborative Personalized Interventions through Temporal Causal Inference

Isi Artikel Utama

Ahmed Bahurmuz

Abstrak

Adaptive learning platforms frequently report performance improvements, yet many evaluations remain vulnerable to time-varying confounding because interventions are triggered by evolving learner states. This study evaluates three intervention families, adaptive sequencing, targeted hints, and remediation triggers, using a longitudinal causal framework with horizon-locked outcomes and learner-level cross-fitting. The analytic cohort includes 2,480 learners and 118,640 decision points observed across 12 instructional weeks, with median 41 decisions per learner. Intervention exposure rates per 100 decisions are 38.6 for sequencing, 24.1 for hints, and 8.9 for remediation, with higher targeting intensity in low-mastery strata. Causal estimates show distinct temporal signatures by intervention mechanism. Targeted hints yield the largest same-session improvement, increasing mastery by 2.4 points, but effects attenuate at 7 days (1.3 points) and 14 days (0.9 points). Adaptive sequencing provides more stable medium-horizon benefits, improving mastery by 1.6 points same-session, 2.8 points at 7 days, and 2.2 points at 14 days. Remediation triggers demonstrate delayed consolidation, increasing mastery by 1.1 points same-session, 3.4 points at 7 days, and 4.1 points at 14 days, albeit with wider uncertainty consistent with lower overlap and late-course concentration. Heterogeneity analyses at the 7-day horizon indicate sequencing peaks for mid-mastery learners, reaching 3.9 points under high engagement versus 3.4 under low engagement, while hints are most effective for low mastery with low engagement (1.6 points) and decline sharply for high mastery with high engagement (0.4 points). Remediation remains meaningful across strata, reaching 3.6 points for mid mastery with high engagement and 2.3 points for high mastery with high engagement, supporting a diagnostic targeting interpretation rather than uniform escalation. Robustness and diagnostic checks support internal validity. After weighting, standardized mean differences for key confounders fall to 0.05–0.09, and placebo effects on pre-decision outcome change remain near zero in magnitude (absolute value ≤0.05) across all intervention types. Overlap trimming of the lowest 5% support preserves the ranking of interventions, with only modest attenuation for remediation, and effective sample size remains adequate for sequencing and hints while declining for remediation in late decision indices. These findings justify a tiered deployment strategy where sequencing is the default optimization lever, hints are constrained to high-instability episodes and paired with post-hint practice allocation, and remediation is gated by high-confidence misconception signals with overlap and effective-sample-size monitoring.

Rincian Artikel

Cara Mengutip
[1]
A. Bahurmuz, “Measuring the Impact of Human–AI Collaborative Personalized Interventions through Temporal Causal Inference ”, Int. J. Appl. Inf. Manag., vol. 6, no. 2, hlm. 325–342, Jun 2026.
Bagian
Articles