Decision Policy Optimization for Human–AI Collaboration Using Off-Policy Reinforcement Learning from Logged Interaction Data
Main Article Content
Abstract
This paper investigates offline policy optimization for adaptive learning using logged student interaction traces, targeting reliable improvement without online exploration. A conservative offline reinforcement learning pipeline is implemented with calibrated behavior-policy propensities and doubly robust off-policy evaluation. Using 128,640 student trajectories (2.94 million events) with a 32-dimensional state representation and 12 pedagogical actions, the optimized policy achieved a +0.042-return improvement over a supervised next-item baseline under doubly robust estimation, with a bootstrap confidence width of ±0.021. Self-normalized estimators produced consistent rankings, reporting a +0.041 improvement with comparable uncertainty. Performance gains were horizon-stable and concentrated in medium horizons, where improvement increased from +0.012 at 1 step to +0.055 at 5 steps and remained positive through 10 steps. Safety analysis showed a shift toward bettersupported actions, increasing mean action support from 0.31 to 0.44 and reducing the low-support decision rate from 0.18 to 0.06. Uncertainty pruning activated on 11% of decisions, decreasing the high-uncertainty rate from 0.22 to 0.08 and reducing the maximum importance weight from 14.7 to 9.3, while effective sample size increased by 908. Student-level stratification indicated the strongest gains for mid mastery and mid engagement learners (mean improvement 0.046, median 0.044), with smaller but consistent benefits for high mastery learners driven by reduced repetition rather than correctness shifts. Ablation results confirmed that conservatism and pruning are complementary: removing conservatism increased tail risk and widened confidence intervals, while removing pruning increased evaluation variance despite similar mean return. These findings demonstrate that evidence-constrained offline reinforcement learning can produce deployable adaptive policies with measurable improvements and quantifiable safety guarantees under logged-data constraints.
Article Details

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with International Journal for Applied Information Management agree to the following terms: Authors retain copyright and grant the International Journal for Applied Information Management right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) the work for any purpose, even commercially with an acknowledgement of the work's authorship and initial publication in International Journal for Applied Information Management. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in International Journal for Applied Information Management. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).