Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

Spread the love

Researchers from Stanford, EPFL and UNC introduce Harnessing the weak for the strong, W4SA new reinforcement learning RL framework that trains a small meta-agent to design and refine code workflows that call a robust executor model. The meta-agent doesn’t just fine-tune the robust model, it learns to organize it. W4S formalizes the workflow design as a multi turn Markov decision process, and trains the meta-agent with a method called Reinforcement Learning for Agentic Workflow Optimization, RLAOThe research team reports consistent gains in 11 benchmarks with the 7B meta-agent trained for approximately 1 GPU hour.

W4S works on a rotational basis. State includes work instructions, the current workflow program, and feedback from prior execution. An action has 2 components, the analysis of what to change, and the new Python workflow code that applies those changes. The environment executes code on the validation item, returns accuracy and failure cases, and provides a new state for the next turn. Meta-Agent can run a quick self-check on a sample, if errors occur it attempts up to 3 repairs, abandoning the action if errors persist. This loop prompts learning without touching the load of the strong performer.

Table of Contents

W4S runs as an iterative loop

workflow generation: The weak meta agent writes a new workflow that takes advantage of the strong model, expressed as executable Python code.

execution and reaction: Performs the workflow on robust model validation samples, then returns accuracy and error cases as feedback.
Refinement: Meta Agent uses feedback to analyze and update the workflow, then repeats the loop.

Reinforcement learning for agentic workflow optimization (RLAO).,

RLAO is an offline reinforcement learning process on multi turn trajectories. At each iteration, the system samples several candidate actions, keeps the best performing action to advance the state, and stores the others for training. The policy is optimized with reward weighted regression. The reward is sparse and compares the current validation accuracy to history, with more weight given when the new result beats the previous best, less weight given when it beats the last iteration. This objective favors steady progress while controlling exploration costs.

Understanding the results

On HumanEval with GPT-4o-mini as the executor, W4S achieves Pass@1 out of 95.4, with about 33 minutes of workflow optimization, zero meta-agent API cost, optimization execution cost of about $0.4, and about 2.7 minutes to execute the test set for about $0.5, totaling about $0.9. Under the same executor, AFlow and ADAS detect this number. The average gains reported against the strongest automated baseline range from 2.9% to 24.6% across the 11 benchmarks.

On mathematics transfer, the meta-agent is trained as an executor with GPT-3.5-Turbo on GSM Plus and MGSM, then evaluated on GSM8K, GSM HARD and SVAMP. The paper reports 86.5 on GSM8K and 61.8 on GSM hard, both above the automatic baseline. This indicates that the learned orchestration is transferred to related tasks without retraining the executor.

In tasks observed with GPT-4o-mini as the executor, W4S outperforms training-free automated methods that a planner does not learn. The study also runs ablation where the meta-agent is trained by supervised fine tuning instead of RLAO, with the RLAO agent providing better accuracy under the same computation budget. The research team includes GRPO baseline on 7B weak model for GSM hard, W4S outperforms it under limited computation.

Recurrence budget matters. The research team sets W4S on the main tables at about 10 optimization turns, while AFlow runs about 20 turns and ADAS runs about 30 turns. Despite fewer turns, the W4S achieves high accuracy. This suggests that the scheme learned on the code, combined with validation feedback, makes the search more sample efficient.

key takeaways

W4S writes a Python workflow to train a 7B weak meta agent with RLAO that uses strong executors modeled as multi-turn MDPs.
On HumanEval with GPT 4o Mini as the executor, W4S reaches a pass@1 of 95.4, with about 33 minutes of optimization and a total cost of about $0.9, beating the automated baseline under the same executor.

Across 11 benchmarks, W4S improves on the strongest baseline by 2.9% to 24.6%, while avoiding fine tuning of the stronger model.
The method runs an iterative loop, it generates a workflow, executes it on validation data, then refines it using feedback.
ADAS and AFlow also program or search on code workflows, W4S differs by training a planner with offline reinforcement learning.

W4S targets orchestration, not model weights, and trains the 7B meta agent to program workflows that invoke strong executors. W4S formalizes the workflow design as a multi turn MDP and optimizes the planner with RLAO using offline trajectories and reward weighted regression. Reported results show a pass@1 out of 95.4 on HumanEval with the GPT 4o Mini, an average gain of 2.9% to 24.6% across 11 benchmarks, and about 1 GPU hour of training for the Meta Agent. Framing is compared to ADAS and AFlow, which search agent designs or build code graphs, while W4S fine-tunes the executor and learns the planner.

check it out technical paper And GitHub repoFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

🙌 Follow MarketTechPost: Add us as a favorite source on Google.

Source link