Stop Freezing Your Sub-Agents: A New Way to Train Multi-Agent Systems

Training a planner and a tool-user simultaneously has always been a nightmare of gradient synchronization. A new decoupled method changes that.

Introduction

If you've ever tried to build a vertical multi-agent system where a "Planner" model directs a "Tool User" model you've likely hit the Training Wall.

You usually have two bad options:

The Monolith: Train one giant model to do everything. It becomes a jack-of-all-trades but master of none, often forgetting how to plan while learning how to use a web browser.

The Frozen Worker: Train the Planner but freeze the Tool User (or vice versa). Your Planner gets smarter, but it's shouting instructions at a dumber, static sub-agent that never learns to adapt to the Planner's style.

Why not train both? Because gradients. Backpropagating a reward signal through a Planner, across a network call to a Sub-Agent, and back again is an engineering nightmare. It breaks computation graphs and requires massive, synchronized GPU clusters.

A new technique (let's call it M-GRPO) has just popped up on arXiv that solves this by decoupling the training. It allows you to co-train distinct, specialized models on separate servers without ever passing gradients between them.

Here is the engineering breakdown of how it works.

The Core Innovation

Decoupled Group Relative Policy Optimization

The method is a hierarchical extension of GRPO (Group Relative Policy Optimization). If you recall, GRPO saves memory by removing the value function (Critic) and instead normalizing rewards against the "group" of outputs generated by the same prompt.

M-GRPO applies this to a Planner-Executor hierarchy.

1. The Vertical Architecture

Main Agent (Planner): Decomposes the user query and delegates tasks.

Sub-Agent (Executor): Receives the sub-task, uses tools (search, code execution), and reports back.

2. The "Spreadsheet" Synchronization

Instead of a complex end-to-end differentiable pipeline, the agents run on completely separate servers.

Server A runs the Planner.

Server B runs the Executor.

They don't share gradients. They only share outcomes. The Planner gets a reward based on the final answer. The Executor gets a "composite" reward based on:

Did I follow the Planner's format?

Did the Planner get the final answer right? (Alignment)

Did an external expert system rate my tool usage as high-quality? (Local Quality)

Because GRPO relies on relative advantages (is this sample better than the average of the group?), you only need to sync these scalar reward values. No massive tensors flow across the network.

3. Trajectory Alignment (The "Padding" Trick)

This is the cleverest engineering bit. In a standard training batch, one Planner rollout might trigger the Sub-Agent 0 times (pure reasoning), while another triggers it 5 times (heavy research). This creates jagged tensors that GPUs hate.

M-GRPO introduces Trajectory Alignment:

Define a target number of calls, say $D_{max}$.

If usage < $D_{max}$: Randomly duplicate existing sub-agent trajectories to fill the gap.

If usage > $D_{max}$: Randomly drop the excess trajectories.

This forces the data into a fixed shape ($Batch \times D_{max}$), allowing efficient, stable batch updates on the Sub-Agent's GPU.

Comparison & Benchmarks

The researchers tested this on heavy reasoning benchmarks like GAIA, XBench-DeepSearch, and WebWalkerQA.

Baseline 1 (Unified Model): A single model doing both roles.

Baseline 2 (Frozen Sub-Agent): Training the Planner while keeping the Tool User static.

Result: M-GRPO consistently beat both. The most interesting finding was role specialization. The Planner learned to stop delegating simple logic puzzles to the Sub-Agent (which is slow) and handled them internally. Conversely, for obscure knowledge retrieval, it learned to write highly specific queries that the Sub-Agent was now trained to understand.

Examples

How Behavior Shifts

To visualize what "co-training" actually achieves, let's look at how the interaction evolves from "Untrained" to "M-GRPO Trained."

Scenario: Complex Verification

User Query: "Verify if the 2024 solar maximum predictions match the observed sunspot data from last month."

Before (Frozen Sub-Agent)

The Planner writes a generic query, and the Sub-Agent gets confused by the ambiguity.

PLANNER (Output):
Action: Delegate to SearchAgent
Input: "solar maximum 2024 sunspot data"

SUB-AGENT (Frozen):
> Searches generic terms.
> Returns: "NASA predicts solar max in 2025. Sunspots are dark areas." (Too vague)

PLANNER (Final):
"The data is inconclusive based on the search."

After (M-GRPO Co-Trained)

The Planner learns to be specific, and the Sub-Agent learns to prioritize recent data tables over general articles.

PLANNER (Output):
Action: Delegate to SearchAgent
Input: "NOAA monthly sunspot number October 2024 vs solar cycle 25 prediction prediction"

SUB-AGENT (Trained):
> Recognizes 'vs' implies comparison.
> Searches specific NOAA databases.
> Returns: "Observed SSN: 112.4. Predicted: 115.6. Difference is within 3% margin."

PLANNER (Final):
"Yes, the observations are aligning closely. October 2024 data shows a sunspot number of 112.4, which matches the prediction of 115.6."

Practical Takeaways

Format is the First Victim: When training starts, the first thing agents learn is strict formatting (JSON/XML). Before they get "smart," they get "structured."

Shared Fate Rewards: Giving the Sub-Agent a partial reward based on the Planner's final success is crucial. Without it, the Sub-Agent just maximizes its own tool-usage score (hacking the metric) without actually helping the main goal.

Hardware Decoupling: You can now train your Planner on H100s and your smaller Sub-Agent on A10s (or even older hardware), located in different physical racks. As long as they share a database for trajectory logs, it works.

Conclusion

The era of the "God Model" that does everything is fading. The future is specialized, vertical swarms of agents. Until now, training them was an infrastructure headache. By combining GRPO's critic-less simplicity with trajectory padding and decoupled servers, we can finally fine-tune an entire team of agents as easily as we fine-tune a single model. If you are building agentic workflows, stop trying to cram everything into one context window. Split the roles, decouple the servers, and let the rewards align the team.