Decision Policy Optimization for Human–AI Collaboration Using Off-Policy Reinforcement Learning from Logged Interaction Data

Main Article Content

Hery Hery
Ariel Christopher Wawolangi

Abstract

 This paper investigates offline policy optimization for adaptive learning using logged student interaction traces, targeting reliable improvement without online exploration. A conservative offline reinforcement learning pipeline is implemented with calibrated behavior-policy propensities and doubly robust off-policy evaluation. Using 128,640 student trajectories (2.94 million events) with a 32-dimensional state representation and 12 pedagogical actions, the optimized policy achieved a +0.042-return improvement over a supervised next-item baseline under doubly robust estimation, with a bootstrap confidence width of ±0.021. Self-normalized estimators produced consistent rankings, reporting a +0.041 improvement with comparable uncertainty. Performance gains were horizon-stable and concentrated in medium horizons, where improvement increased from +0.012 at 1 step to +0.055 at 5 steps and remained positive through 10 steps. Safety analysis showed a shift toward bettersupported actions, increasing mean action support from 0.31 to 0.44 and reducing the low-support decision rate from 0.18 to 0.06. Uncertainty pruning activated on 11% of decisions, decreasing the high-uncertainty rate from 0.22 to 0.08 and reducing the maximum importance weight from 14.7 to 9.3, while effective sample size increased by 908. Student-level stratification indicated the strongest gains for mid mastery and mid engagement learners (mean improvement 0.046, median 0.044), with smaller but consistent benefits for high mastery learners driven by reduced repetition rather than correctness shifts. Ablation results confirmed that conservatism and pruning are complementary: removing conservatism increased tail risk and widened confidence intervals, while removing pruning increased evaluation variance despite similar mean return. These findings demonstrate that evidence-constrained offline reinforcement learning can produce deployable adaptive policies with measurable improvements and quantifiable safety guarantees under logged-data constraints.

Article Details

How to Cite
[1]
H. Hery and A. C. . Wawolangi, “Decision Policy Optimization for Human–AI Collaboration Using Off-Policy Reinforcement Learning from Logged Interaction Data”, Int. J. Appl. Inf. Manag., vol. 6, no. 2, pp. 272–289, Jun. 2026.
Section
Articles