Poisoning Attacks on Multi-Agent Reinforcement Learning Systems

1Hanyang University
IEEE-RAS Humanoids 2025 · Late-Breaking Report
A team of ML-Agents crawlers in an arena with an attacker-placed tempting reward cube.

An attacker drops a "tempting" cube into the arena. Touching it pays a huge reward to every agent at once — hijacking the shared reward channel and luring a team of crawlers away from walking.

Abstract

Multi-agent reinforcement learning (MARL) is increasingly used to train teams of legged robots, but its reliance on a shared reward signal makes it an attractive target for reward-poisoning attacks. We study a simple, physically-realizable attack against a team of Unity ML-Agents Crawler robots learning to walk: an attacker-in-the-environment places a "tempting" poison cube in the arena that, on contact, injects a large reward broadcast to every agent at once. Because the injected reward (+100) dwarfs the legitimate per-step locomotion reward ([0, 1]) and the goal-touch bonus (+1), the poison dominates the learning signal: agents abandon proper gait and goal-reaching and rush the cube. Crucially, because the reward is broadcast to the whole team, a single agent's contact contaminates every agent's objective — a one-touch, system-wide poisoning of the cooperative reward. We implement the attack in Unity ML-Agents and evaluate its effect across learning algorithms (PPO and SAC) and team sizes (single vs. multi-crawler), framing MARL locomotion as a system that must survive a poisoning attack.

Method

We implement the attack in Unity ML-Agents on the Crawler benchmark — a team of four-legged robots learning to walk toward a target.

Agent & clean objective. The Crawler is a four-legged ragdoll driven by 20 continuous actions (per-leg joint target rotations and joint strengths). Its observations include velocity relative to a stabilized orientation cube, the velocity goal, body-direction alignment, the target's relative position, a downward ground raycast, and per-body-part ground contact. The legitimate per-step reward is matchSpeedReward × lookAtTargetReward — approaching 1 when the crawler both matches its randomized target walking speed (0.1–15) and faces the target — plus +1 when it touches its legitimate target.

The attack — a tempting poison cube. An attacker-placed cube teleports among five fixed positions (arena center and the four corners). When any agent collides with it, it broadcasts AddReward(+100) to all agents (agent0…agent9) simultaneously and then relocates. At roughly 100× the scale of the clean signal, the poison reshapes the optimal policy from "walk to the goal" into "reach the cube"; broadcasting the reward turns a single contact into team-wide contamination of the cooperative objective — the core multi-agent vulnerability.

Setup. Unity ML-Agents (scenes Crawler, MAPA, MARLAA); learners PPO (on-policy) and SAC (off-policy); team sizes single vs. multi-crawler (the poison reward is broadcast to up to 10 agents); each trained clean (no cube) and poisoned (tempting attack active).

Schematic: the tempting cube injects +100 into the shared reward, broadcast to every agent.

The poison channel. The clean reward keeps each crawler walking toward its target; the tempting cube injects +100 on a single contact and broadcasts it to the whole team, overwhelming the locomotion signal. (placeholder figure)

Results

We compare clean training against training under the tempting attack across the PPO/SAC × single/multi grid. Once the cube is discovered, its +100 contact reward dominates the ≤1 locomotion signal and collapses gait into cube-seeking; broadcasting the reward spreads this across the team. The grid below is the reported evaluation — quantitative entries are pending the original Humanoids 2025 run logs.

Setting Clean
locomotion reward
Clean
goal success
Poisoned
cube-contact rate
Poisoned
locomotion reward
PPO · single
PPO · multi
SAC · single
SAC · multi

Table 1. Evaluation grid — clean vs. tempting-attack across PPO/SAC and single/multi-crawler. Populate from the training logs (higher locomotion reward / goal success is better; a higher cube-contact rate means a stronger lure).

Locomotion reward over training, clean versus under the tempting attack.

Effect of the attack. Locomotion reward over training, clean vs. under the tempting attack, contrasting PPO/SAC and single/multi-crawler. (placeholder figure)

BibTeX

@inproceedings{choi2025poisoning,
  title     = {Poisoning Attacks on Multi-Agent Reinforcement Learning Systems},
  author    = {Chanyeok Choi and Cho and Youngmoon Lee},
  booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids),
               Late-Breaking Report},
  year      = {2025}
}