Back to Community

Research note · Research

Multi-Agent Cooperative Teaming Strategies

Merging Model-Based Control with Multi-Agent Reinforcement Learning for Cooperative Teams

Christian Llanes, Spencer W. Jensen, and Samuel Coogan

Read the paper on arXiv

This paper is about a familiar tension in drone autonomy: reinforcement learning can discover cooperative behavior from rewards that are awkward to write as clean control objectives, while model predictive control (MPC) can turn robot dynamics and constraints into feasible actions. The hard part is getting both strengths into the same policy without ending up with either a brittle hand-built controller or a black-box neural network that looks good in simulation and becomes hard to trust on hardware.

The paper Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies proposes a direct version of that combination. Each agent still learns through multi-agent reinforcement learning, but the actor is not just a multilayer perceptron that outputs motor commands. It first outputs the parameters of a model predictive control problem, and the MPC solver then computes the feasible action that the robot will actually execute.

That design makes the learned policy behave more like a high-level planner. It decides what the controller should care about, how hard it should care, and where the short-horizon plan should point, while the MPC layer handles the lower-level work of turning that intent into a dynamically feasible command.

Paper details

Quick read

The work extends actor-critic model predictive control to multi-agent reinforcement learning. The policy is trained with MAPPO, a multi-agent version of proximal policy optimization, and at execution time each agent runs its own MPC layer, while centralized value information is used during training to make the cooperative reward easier to learn.

The authors test the method in two settings. The first is a 2-versus-2 drone pursuit-evasion task where two evader drones learn to make two pursuer drones collide with each other. The evaders use MA-AC-MPC, the pursuers use a proportional-navigation-style adversarial controller, and the learned behavior is trained in Crazyflow, checked with CrazySim, and demonstrated on hardware through ROS 2 and Crazyswarm2.

The second setting is a heterogeneous landing task in which a drone and a mecanum-wheeled rover each learn policies so the drone can land on the moving rover. On hardware, MA-AC-MPC lands successfully in 5 out of 5 trials with 0.055 m mean final error, while the MA-AC-MLP baseline lands in 3 out of 5 trials with 0.240 m mean final error.

The result is not that MPC makes learning free. In the pursuit-evasion experiment, MA-AC-MPC takes 162.2 hours to train for 2 million steps, compared with 18.0 to 30.0 hours for the MLP baselines over 4 million steps. Inference is also slower, at about 1.052 ms per agent for MA-AC-MPC compared with 0.120 to 0.197 ms for the MLP actors.

The stronger claim is narrower and more useful: the MPC actor is more expensive, but in these experiments it transfers better to real robots and stays more robust when the physical system shifts away from the nominal training setup.

Why this problem matters

Multi-agent drone tasks are rarely just control problems or just learning problems. If the task is written as a control problem, the designer has to define the objective, which works well enough for tracking a route, holding formation, or landing at a known target. It becomes harder when the right behavior depends on strategy: baiting a pursuer, coordinating with a teammate, timing a landing on a moving rover, or deciding when one agent should make room for another.

If the task is written as a reinforcement learning problem, the reward can be much more flexible. The team can be rewarded for a discrete event, a delayed outcome, or a cooperative behavior that would be hard to describe as a smooth cost function, and the policy can discover strategies the designer did not explicitly write down.

Pure learned control has its own problem. A neural actor can output commands that are plausible in training but fragile on hardware, especially if it learns the simulator's quirks, hides unstable behavior inside a large network, or fails when mass, actuation, sensing, or latency changes.

MA-AC-MPC is aimed at the middle ground, where reinforcement learning shapes the strategy and a model-based controller keeps the final command closer to the robot's actual dynamics.

The core idea

The actor in MA-AC-MPC has two layers. The first is a neural network that looks at the agent's observation and outputs parameters for an MPC problem. In the paper, those parameters include stage references, terminal references, and weighting matrices.

The second layer is the MPC solver, which receives the current state and the learned parameters before solving a short-horizon optimal control problem. The first action from that solution becomes the agent's command.

In plain terms:

observationlearned MPC cost parametersconstrained MPC solveaction\text{observation} \rightarrow \text{learned MPC cost parameters} \rightarrow \text{constrained MPC solve} \rightarrow \text{action}

This changes what the neural network has to learn. A plain MLP actor must learn both the strategic behavior and the low-level action map, while MA-AC-MPC asks the neural network to learn the local objective and lets the solver handle the dynamics. That separation is the main engineering idea in the paper.

Training is centralized, execution is distributed

The method follows the common centralized-training, decentralized-execution pattern. During training, a centralized critic can see the broader environment state, which helps the value function understand team rewards, teammate behavior, inactive agents, and global outcomes. The paper uses MAPPO, a multi-agent version of PPO, with a shared reward setup for the cooperative tasks.

During execution, each agent runs its own actor and MPC solver, so the action does not come from a centralized computer solving one giant optimization problem for the whole team. That matters because centralized MPC scales poorly as the number of agents grows and depends heavily on reliable communication.

The deployed policy still uses observations that include partial state information from other agents, so it is not communication-free autonomy. The command generation, however, is local to each agent, and the MPC problem remains an individual agent problem rather than one large centralized solve.

The paper also handles inactive agents with masking. In the pursuit-evasion task, agents can be captured, collide, or leave the boundary, and updating the critic as if a dead agent still had normal observations can create bad training signals. The implementation uses zero masks, active flags, one-hot agent IDs, and bootstrap cutoffs for inactive agents.

That kind of detail is not glamorous, but it often decides whether a multi-agent RL experiment actually trains.

Why leap-c and acados matter

The MPC layer has to be differentiable because the policy is trained end to end. Gradients need to flow through the solver so the cost network can learn which MPC parameters produce better team behavior.

Older differentiable MPC implementations are often limited to simpler problem classes, such as quadratic costs or box-constrained iterative LQR. This paper uses leap-c, which provides differentiable interfaces around acados, and that matters because acados supports fast nonlinear optimal control with nonlinear dynamics, costs, and inequality constraints.

The paper's implementation uses this stack to make the MPC layer more general than a toy optimizer inside a neural network. It can encode useful robot constraints while still participating in the actor-critic training loop.

There is a cost. The authors note that acados code generation is CPU-oriented and cannot be run on the GPU in the same way as a fully neural policy, so MA-AC-MPC is slower to train and slower to evaluate than the MLP baseline. The question is whether the added structure buys enough robustness to justify the cost.

The hardware results suggest that, at least in these tasks, it can.

Experiment 1: making pursuers collide

The first benchmark is a multi-agent pursuit-evasion task with two evaders and two pursuers in 3D. The evaders are the learning agents, and their job is not simply to run away. In the modified task used by this paper, the evaders win only if all pursuers are dead and at least one evader remains alive. A pursuer dies when it collides with a teammate or captures an evader, while an evader dies if it is captured, collides with a teammate, hits the ground, or exits the boundary.

That setup forces cooperation because a single evasive maneuver may survive briefly, but the team objective is to lure the pursuers into a bad interaction. The learned strategy has to exploit the pursuer policy, not just maximize distance.

The pursuers use a fixed proportional-navigation-style strategy. The paper says the controller captures evaders with near 100% success in ordinary conditions, so the evaders are learning against a strong scripted adversary rather than a weak chase rule.

The training environment uses Crazyflow. The simulated dynamics use Crazyflow's `first_principles` Crazyflie model, which includes rigid body dynamics, rotor dynamics, aerodynamic drag, and gyroscopic effects, while the MPC actor uses the faster `so_rpy` model. That mismatch is intentional: the environment can be more detailed than the model used inside the real-time controller.

What the evader policy sees

The MA-AC-MPC evader policy is shared by both evaders. Each evader receives its own position, velocity, rotation matrix, body rates, and one-hot agent ID, along with teammate and pursuer state information such as position and velocity. Those shared observations are masked based on whether agents are alive.

The critic receives a broader shared state during training. This lets the critic evaluate the whole team situation even though each actor runs from its own observation at execution time.

The reward is built from sparse events and shaping terms. The sparse rewards include capture penalties, evader collision penalties, pursuer collision rewards, and boundary penalties, while the shaping terms encourage pursuer proximity, closing velocity before pursuer-pursuer collisions, reasonable evader attitude, low velocity, thrust near hover, and smooth controls.

This is not a reward-free demonstration of emergent strategy. The environment is carefully shaped. That is normal for this class of robotics RL, and the paper is fairly clear about it.

Curriculum learning does real work

The authors first tried training with fully random initial positions and found that the evaders struggled to learn cooperative collision-inducing behavior, so they introduced a 10-level curriculum.

The curriculum gradually increases difficulty by changing spawn distances, collision tolerances, mass and inertia randomization, and disturbance terms. Advancement requires a 70% evader win rate. At the last level, evader spawn is randomized by 0.1 m, evader collision tolerance is 0.20 m, pursuer-pursuer and pursuer-evader collision tolerance are 0.20 m, and disturbance/randomization terms are active.

That progression is important because the policy is not born with sophisticated deception. It learns the cooperative behavior only after the task is staged in a way that makes the early versions solvable.

For builders, this is one of the more practical lessons in the paper. If a multi-agent behavior is too sparse or too strategic, a good architecture may still fail without a curriculum that exposes useful intermediate signals.

Pursuit-evasion numbers

The pursuit-evasion results show a clear tradeoff.

MethodNetwork sizeTraining stepsTraining timeInference time per agent
MA-AC-MPC256 x 2 actor / 256 x 2 critic2M162.2 h1.052 +/- 0.474 ms
MA-AC-MLP256 x 2 actor / 256 x 2 critic4M18.0 h0.120 +/- 0.063 ms
MA-AC-MLP512 x 2 actor / 512 x 2 critic4M26.8 h0.140 +/- 0.077 ms
MA-AC-MLP512 x 3 actor / 512 x 2 critic4M30.0 h0.197 +/- 0.175 ms

The MPC actor is much more expensive because it trains on CPU-heavy solver calls and pays an optimization cost at inference. Still, the paper reports that MA-AC-MPC reaches stronger evader win rates in fewer environment steps than the tested MLP baselines.

The most useful robustness test changes the evader masses after training. Both models are trained around a nominal 40.6 g evader mass, but at evaluation the paper sweeps each evader independently from 30.6 g to 50.6 g in 2 g intervals. The MPC dynamics remain at the nominal mass, and the sweep is run for 1000 episodes at each mass variation.

In the figure, the MA-AC-MPC model outperforms the best MLP baseline across the entire sweep. The paper notes that the difference plot has no blue region, meaning there is no tested mass combination where the MLP has the higher win rate.

That is the strongest pursuit-evasion evidence. The learned MPC actor is not just better at the exact training condition. It handles a modest physical shift better than the neural actor, even when the internal MPC model itself has not been updated to the new mass.

Hardware bridge for pursuit-evasion

After training, the authors deploy the MA-AC-MPC model through ROS 2 and Crazyswarm2. Before going to hardware, they run the ROS 2 nodes through CrazySim, which uses the real Crazyflie firmware in the loop with MuJoCo.

The paper compares trajectories from Crazyflow, CrazySim, and hardware. The curves are not identical, but they follow a similar trend despite sensor noise, disturbances, estimator uncertainty, and model mismatch, and one hardware trajectory shows the evaders cooperating to lead the pursuers into a head-on collision.

The paper does not provide a large statistical hardware table for the pursuit-evasion task, so this result should be read as a transfer demonstration rather than a full hardware benchmark. The important point is the integration path: train in Crazyflow, verify with firmware-in-the-loop, and then deploy with Crazyswarm2.

That path is relevant because learned multi-agent strategies often fail between simulation and hardware for reasons that are not visible in the reward curve. The paper gives a concrete example of pushing the same learned structure through multiple realism layers.

Experiment 2: landing a drone on a moving rover

The second benchmark is more grounded: a drone must land on a moving ground robot, and both agents are part of the learned system.

The drone is modeled with Crazyflow's detailed dynamics in the environment and a faster `so_rpy` model inside the MPC. The rover is a Yahboom ROSMASTER X3, a mecanum-wheeled platform that can move forward, sideways, and rotate. The rover MPC uses a body-velocity model with wheel-speed constraints taken from the X3 configuration.

This is a heterogeneous task because the drone and rover have different dynamics, different observation vectors, and different roles. The drone must approach, descend, and touch down, while the rover must move in a way that makes the landing possible.

The task reward is split into shared and agent-specific pieces. The shared reward encourages horizontal and vertical progress toward the landing condition and gives a large landing bonus, while the drone-specific terms penalize crashes, boundary violations, excessive descent speed, bad altitude behavior, too much velocity, attitude error, and rough control. The rover-specific terms discourage unnecessary motion, yaw rate, lateral or backward behavior, rough commands, and boundary contact.

This reward is more like an engineering task description than a pure game. It gives the team a goal, but it also encodes the failure modes that matter on a real platform.

Landing setup

Both MA-AC-MPC and MA-AC-MLP are trained with 256 x 256 actor and critic networks using MAPPO. Training runs for 1.2 million environment steps with 128 parallel environments. The PPO rollout length is 256, with 4 learning epochs and 4 mini-batches per update.

The training hyperparameters are conventional: learning rate 3e-4, discount factor 0.99, GAE parameter 0.95, PPO clipping ratio 0.2, entropy coefficient 0.01, value loss coefficient 1.0, and gradient norm clipping at 0.5. The MPC horizon for both the drone and rover is N = 2.

The landing curriculum has six levels. It starts with the rover stationary and the drone spawned within 0.50 m horizontally and 0.5 to 1.0 m vertically, then ends with the drone spawned up to 4.50 m away, 0.5 to 2.0 m vertically, rover speed up to 1.0 m/s, and mass/inertia randomization active.

In simulation training, the paper says MA-AC-MLP is more sample efficient on this task, which makes the hardware result more interesting. The easier-to-train neural policy does not transfer as reliably.

Landing hardware numbers

The hardware landing table is the cleanest result in the paper.

MethodSuccess rateMean final errorTrial errors
MA-AC-MPC5/5, 100%0.055 m0.060 m, 0.075 m, 0.049 m, 0.053 m, 0.040 m
MA-AC-MLP3/5, 60%0.240 m0.581 m, 0.038 m, 0.074 m, 0.046 m, 0.460 m

The MLP failures are not small misses: the failed trials end with errors of 0.581 m and 0.460 m, while the MA-AC-MPC trials stay tightly clustered around the landing pad.

That is the paper's most concrete argument for putting MPC inside the actor. The MLP can learn the simulation task and even appear strong during training, but the MPC-structured policy is more repeatable when the learned behavior meets a real drone and rover.

For practical autonomy work, that matters more than a smoother training curve. A policy that is slower to train but lands five times in a row may be the better engineering choice.

What the method is really buying

The paper's results point to three benefits. The first is feasibility: the action comes from an MPC solve that knows the agent's dynamics and constraints. That does not guarantee safety, but it reduces the chance that the neural actor asks for behavior the platform cannot produce.

The second is representational efficiency. In the pursuit-evasion task, the authors argue that MA-AC-MPC lets the network operate as a high-level task planner rather than forcing it to memorize as much low-level dynamics inside the neural weights. That helps explain why a smaller MA-AC-MPC network can outperform larger MLP baselines in fewer environment steps, even though each step is much slower.

The third is hardware robustness. The mass-sweep test and rover landing trials both show the same pattern: the MPC-structured actor handles physical variation better than the plain neural actor in the reported experiments.

Those are not universal guarantees, but they are evidence that the architectural bias is useful. The learned policy is still trained from rewards, while the final action is filtered through a model-based planner.

What builders should not overread

MA-AC-MPC is not a shortcut around robotics difficulty. The MPC model still has to be good enough, because if the model is wrong in the wrong way, the solver will produce confident but misleading actions. The paper's strongest experiments use careful drone models, system identification parameters, firmware-in-the-loop checks, and real hardware validation.

The reward still matters as well. Both tasks use shaped rewards and curriculum learning: the pursuit-evasion behavior depends on event rewards, proximity shaping, control penalties, and staged difficulty, while the rover landing behavior depends on a long list of task-specific terms. MA-AC-MPC gives structure to the actor, but it does not remove the need to design the task carefully.

The computation is not free either. Inference around 1 ms per agent is still fast enough for the paper's 50 Hz control loop, but it is an order of magnitude slower than the MLP baselines, and training time is much higher in the pursuit-evasion task. The CPU-bound solver path also changes how the method scales.

The experiments are also small. Two evaders and two pursuers are enough to show cooperative strategy, but not enough to prove large-swarm behavior, and the landing task is a convincing hardware demonstration but still only five listed trials per method rather than a field-scale reliability study.

Those limits keep the claim in the right category. MA-AC-MPC is a promising control-learning architecture for cooperative robot teams. It is not a finished answer for every multi-agent drone system.

Where this fits with Nimbus and Droneforge

For Droneforge builders, the useful lesson is not only "use MPC." The more specific lesson is that AI drone autonomy can be split into intent and execution.

Nimbus workflows already separate pieces of the stack: telemetry, route planning, video, object tracking, command generation, and real flight replay. MA-AC-MPC suggests a clean place for learned coordination to live in a developer drone workflow, where the policy chooses local references and weights from observations while a model-based layer converts that into commands that respect the drone's dynamics.

A practical version might look like this:

team_policy = nimbus.autonomy.ma_ac_mpc(
    agents=["df1_a", "df1_b", "rover"],
    objective="cooperative_landing",
    model="identified_dynamics",
    horizon=2,
)

while mission.active:
    observations = team.observe()
    commands = team_policy.step(observations)
    team.send(commands)

The important boundary is that the policy is not replacing every safety mechanism. It would still sit behind geofencing, operator override, command limits, health checks, and real telemetry review.

For inspection, moving-platform landing, coordinated search, escort behavior, cooperative tracking, and adversarial testing, that architecture is attractive because it lets a learned policy reason about team behavior without asking a neural network to own every motor-level detail.

What to watch next

The next question is scale. MA-AC-MPC is most compelling when the team behavior is strategic and the robots have real constraints, but that is also what makes the method harder to scale. Every additional agent adds policy complexity, communication assumptions, and solver work. The paper avoids a centralized multi-agent MPC solve, which helps, but the per-agent optimization cost still matters.

Broader hardware trials would also clarify the transfer story. The rover landing table is strong, but repeated tests across lighting, floor conditions, rover speeds, payloads, battery states, and localization noise would show where the MPC structure actually pays off and where it just adds computation.

Another useful direction would be better tooling around model updates. If real flight logs can identify the MPC dynamics quickly, then MA-AC-MPC becomes more than a training architecture and starts to look like part of a live engineering loop: train, deploy, compare telemetry, update model, retrain or retune, and repeat.

That loop is where this kind of method could become useful outside the lab.

Bottom line

This paper is worth reading because it treats multi-agent robot learning as a control problem with learned objectives, not as a choice between hand-written control and black-box policies.

The paper's best result is the hardware landing comparison. A plain neural actor trains efficiently in simulation, but the MPC actor lands more reliably on the real rover, and the pursuit-evasion mass sweep tells a similar story: the MPC actor is slower and heavier, but it handles physical variation better in the tested range.

For drone autonomy, that tradeoff is familiar. The fastest policy to train is not always the policy you want near hardware, and MA-AC-MPC is a serious attempt to make learned cooperation behave more like an engineered control system without giving up the strategic flexibility of reinforcement learning.

Research context

The Droneforge research section collects practical notes for builders who want to connect drone autonomy ideas to real hardware. Topics may include perception, tracking, mission planning, route replay, benchmarks, datasets, and lessons from operating Nimbus with DF1 in repeatable field workflows.

These notes are written for developers who need more than abstract robotics theory. The goal is to connect papers, experiments, and field observations to concrete Nimbus App and Python Library workflows that can be tested with video, telemetry, commands, and route planning tools.

As this section grows, each research entry will point builders toward the assumptions, constraints, and practical tradeoffs behind real autonomy experiments. That context helps teams decide what to prototype, what to measure, and how to evaluate progress.

Community archive

Continue exploring Droneforge changelogs, research notes, and Nimbus examples through the community archive. These internal links help connect related releases, technical notes, and builder resources.