Poisoning Attacks on Multi-Agent Reinforcement Learning Systems
Abstract
Multi-agent reinforcement learning (MARL) is increasingly used to train teams of legged robots, but its reliance on a shared reward signal makes it an attractive target for reward-poisoning attacks. We study a simple, physically-realizable attack against a team of Unity ML-Agents Crawler robots learning to walk: an attacker-in-the-environment places a "tempting" poison cube in the arena that, on contact, injects a large reward broadcast to every agent at once. Because the injected reward (+100) dwarfs the legitimate per-step locomotion reward ([0, 1]) and the goal-touch bonus (+1), the poison dominates the learning signal: agents abandon proper gait and goal-reaching and rush the cube. Crucially, because the reward is broadcast to the whole team, a single agent's contact contaminates every agent's objective — a one-touch, system-wide poisoning of the cooperative reward. We implement the attack in Unity ML-Agents and evaluate its effect across learning algorithms (PPO and SAC) and team sizes (single vs. multi-crawler), framing MARL locomotion as a system that must survive a poisoning attack.
Method
We implement the attack in Unity ML-Agents on the Crawler benchmark — a team of four-legged robots learning to walk toward a target.
Agent & clean objective. The Crawler is a four-legged ragdoll driven by
20 continuous actions (per-leg joint target rotations and joint strengths). Its
observations include velocity relative to a stabilized orientation cube, the velocity goal, body-direction
alignment, the target's relative position, a downward ground raycast, and per-body-part ground contact. The
legitimate per-step reward is matchSpeedReward × lookAtTargetReward — approaching 1 when
the crawler both matches its randomized target walking speed (0.1–15) and faces the target — plus
+1 when it touches its legitimate target.
The attack — a tempting poison cube. An attacker-placed cube teleports among
five fixed positions (arena center and the four corners). When any agent
collides with it, it broadcasts AddReward(+100) to all agents
(agent0…agent9) simultaneously and then relocates. At roughly 100× the
scale of the clean signal, the poison reshapes the optimal policy from "walk to the goal" into "reach the
cube"; broadcasting the reward turns a single contact into team-wide contamination of the
cooperative objective — the core multi-agent vulnerability.
Setup. Unity ML-Agents (scenes Crawler, MAPA,
MARLAA); learners PPO (on-policy) and SAC (off-policy);
team sizes single vs. multi-crawler (the poison reward is broadcast to up to 10 agents); each trained
clean (no cube) and poisoned (tempting attack active).
The poison channel. The clean reward keeps each crawler walking toward its target; the tempting cube injects +100 on a single contact and broadcasts it to the whole team, overwhelming the locomotion signal. (placeholder figure)
Results
We compare clean training against training under the tempting attack across the PPO/SAC × single/multi grid. Once the cube is discovered, its +100 contact reward dominates the ≤1 locomotion signal and collapses gait into cube-seeking; broadcasting the reward spreads this across the team. The grid below is the reported evaluation — quantitative entries are pending the original Humanoids 2025 run logs.
| Setting | Clean locomotion reward |
Clean goal success |
Poisoned cube-contact rate |
Poisoned locomotion reward |
|---|---|---|---|---|
| PPO · single | — | — | — | — |
| PPO · multi | — | — | — | — |
| SAC · single | — | — | — | — |
| SAC · multi | — | — | — | — |
Table 1. Evaluation grid — clean vs. tempting-attack across PPO/SAC and single/multi-crawler. Populate from the training logs (higher locomotion reward / goal success is better; a higher cube-contact rate means a stronger lure).
Effect of the attack. Locomotion reward over training, clean vs. under the tempting attack, contrasting PPO/SAC and single/multi-crawler. (placeholder figure)
BibTeX
@inproceedings{choi2025poisoning,
title = {Poisoning Attacks on Multi-Agent Reinforcement Learning Systems},
author = {Chanyeok Choi and Cho and Youngmoon Lee},
booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids),
Late-Breaking Report},
year = {2025}
}