Most LLM agents are prepared to maximize job success. They resolve GitHub issues or answer deeply researched questions, but they don’t reason carefully about when to ask the user questions or how to respect different interaction preferences. How can we design LLM agents that know when to ask better questions and tailor their behavior to each individual user?
a team of researchers Carnegie Mellon University CMU and OpenHands Formalizes these missing behaviors as 3 joint objectives, Productivity, Proactivity and PersonalizationAnd they are optimized with a multi-objective reinforcement learning framework called PPP inside a new environment named userville,

From task success to interactions with alert agents
The research team defines:
- Productivity As for the quality of task completion, for example F1 verified function localization on SWE-Bench or exact matching on BrowseComp-Plus.
- activism Asking necessary clarifying questions when initial prompts are unclear while avoiding unnecessary questions.
- personalization Adhering to the user’s specific interaction preferences such as brevity, format or language.
Userville, an interactive environment with priority aware simulators
UserVille transforms existing agent benchmarks into interaction-centric RL environments powered by LLM-based user simulators.
It has 3 stages:
- early vaccination: Precise action signals are rewritten into vague signals that have the same intent but remove details. This creates information asymmetry, the simulator still sees the accurate signal, the agent only sees the fuzzy version.
- Preference Aware User Simulation:Each user simulator is standardized a priori into one of 20 types of pools. Preferences include requirements such as brevity, number of questions per turn, answer format, time, language constraints, or JSON formatted questions. Twelve priors are used in training and 8 priors are kept for generalization tests.
- User Centered Evaluation: After the task, the simulator labels each question as low effort, medium effort, or high effort based on whether it can answer using the exact hint and how difficult it is to answer. The activation score is 1 if the overall session is low-effort, otherwise 0. The personalization score is 1 if the agent follows the preference, otherwise 0, averaged across sessions where the agent asked at least 1 question.
Userville is accelerated on 2 domains, software engineering with SWE-Gym for training and SWE-Bench Verified and SWE-Bench Full for evaluation, and BrowseComp-Plus with deep research and a search plus open_pages tool scaffold.

PPP, multi-objective RL for productive, proactive and personalized agents
Agents are implemented as React style tools using policies based on SEED-OSS-36B-Instruct. They can call domain tools and an ask_user tool that queries the user simulator.
PPP defines a trajectory level reward
r = rproduct +Rproject + rpersonnel.
- Productivity Award Rproduct The function metric is an exact match on F1 or BrowseComp-Plus on SWE-Func-Loc.
- Activism Award Rproject Adds a bonus of +0.05 if all questions in the session are low effort and applies a penalty of -0.1 for each medium effort question and -0.5 for each high effort question.
- personalization rewards RpersonnelAdds +0.05 when the agent follows the priority and adds a non-positive penalty defined by the priority specific rule for each violation.
The training uses the GRPO based RL algorithm with Clip Higher strategy and the token level policy from DAPO with gradient loss, and optimizes only the LLM generated tokens. The training environment is implemented with Verl. SEED-OSS-36B-INSTRUCT is trained for 200 steps with batch size 64 and group size 8. The maximum output length is 32k tokens for SWE-Func-Loc, 65k for SWE-Full, and 41k for Deep-Research. GPT 5 Nano is used as a user simulator. SWE scaffolds are based on OpenHands, and deep research uses a search tool and an open_pages tool with Qwen3-Embed-8B as a retriever.

experimental results
Table 2 (below the image) evaluates productivity, activation and personalization on SWE-bench verified Func-Lok and BrowseComp-Plus using fuzzy signals and averaging over 20 preferences.

SEED-OSS-36B-Instructions for the base model:
- On SWE-Func-Loc, Productivity 38.59, Proactivity 43.70, Personalization 69.07
- On BrowseComp-Plus, productivity 18.20, activation 37.60, personalization 64.76.
After PPP RL training, the PPP model reaches:
- On SWE-Func-Loc, Productivity 56.26, Proactivity 75.55, Personalization 89.26
- On BrowseComp-Plus, productivity 26.63, activation 47.69, personalization 76.85.
The average gain across all 3 dimensions and both datasets is 16.72 points relative to SEED-OSS-36B-Instruct and also outperforms GPT5 and other GPT series baselines on the PPP combined metric.
Negotiation is important for ambiguous signals. On SWE-Func-Loc, F1 with perfect signals and no interactions is 64.50. With unclear signals and no conversation it drops to 44.11. Adding interactions without RL does not fix this gap. With PPP training and negotiations, F1 has improved by 21.66 points under unclear signals.
PPP also changes interaction behavior. The ask ratio on SWE-Func-Loc increases from 50 percent to 100 percent under vague signals and from 51 percent to 85 percent on intensive research, while it remains low for precise signals. The number of questions per session increases at the beginning of training, then stabilizes with a high proportion of low-effort questions and very few low-effort questions.
key takeaways
- PPP gives a shape to agent training multi-objective rl problem Which jointly optimizes productivity, activation and personalization instead of focusing only on task success.
- userville creates obscure early version Existing benchmarks and their additions Priority Aware User SimulatorWhich enforces 20 specific interaction preferences and labels user effort levels.
- total reward combined Task metrics, user effort and priority followingImplemented with GRPO based RL algorithms, using bonuses for low effort questions and penalties for medium and high effort or preference violations.
- On SWE Bench Funk Lok and Browsecomp Plus with ambiguous signals, PPP Trained Seed OSS 36B There is significant improvement in all 3 metrics over the base model and GPT5 baseline, with an average gain of approximately 16.72 points across dimensions and datasets.
- PPP Agent Generalize unseen priorities, alternative simulators and difficult tasks As SWEs bench full, and they learn to ask fewer but more targeted low-effort questions, especially when prompts are ambiguous.
PPP and Userville mark an important step toward interaction aware LLM agents, as they explicitly encode productivity, activation, and personalization in the reward design, use preference aware user simulators that enforce up to 20 interaction preferences, and implement GRPO with DAPO style token level customization inside the Verl and OpenHands scaffolds. The improvements to SWE Bench Funk Lok, SWE Bench Full, and BrowseComp Plus show that interaction modeling is now a core capability, not a supporting feature.
check it out paper And repoFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.
Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.
🙌 Follow MarketTechPost: Add us as a favorite source on Google.