: A Multi-Domain Reinforcement Learning Environment for Training LLM-Based Project Planners
Effective project planning requires reasoning about task dependencies, resource constraints, actor availability, and temporal dynamics—capabilities that large language models (LLMs) exhibit in limited, unreliable ways. We introduce , a reinforcement learning environment designed to train and evaluate LLM-based agents on resource-aware planning across four structurally distinct domains: software release management, hardware production scaling, compute program scheduling, and scientific research coordination. Our environment features a unified interface over domain-specific semantics, realistic stochasticity models (log-normal actor variance, Gaussian resource SLAs), dynamic re-planning triggers, and dual mutex contention on both actors and resources. We formalize the planning problem as a semi-Markov decision process with structured action spaces (JSON plans with diff-based revisions) and propose a GRPO-based training procedure for 32B-parameter LLMs. We construct a benchmark of 50 problems (25 medium, 25 large scale) and evaluate against classical solvers (CP-SAT, ILP), heuristic schedulers, and learned baselines. Our results demonstrate that [expected: LLM planners match solver performance on tractable instances while scaling to problem sizes where solvers fail, with significant improvements in resource utilization and contention reduction compared to heuristic baselines]. Code and benchmarks available at [URL].
Unknown Authors · Dec 25, 2025