Poisoning Attacks on Multi-Agent Reinforcement Learning Systems

Chanyeok Choi1, Jaehwan Cho1, Youngmoon Lee1
1Hanyang University
IEEE-RAS Humanoids 2025 · Late-Breaking Report
Two top-down arena views: crawler agents and green target points (left, baseline) and the same arena with red attacker-placed lure points scattered among them (right, under attack).

Crawler agents walk to green target points to collect reward. An attacker scatters red lure points that pay a huge reward on contact — a reward trap planted in the team's shared reward structure that pulls every agent off its path. Left: baseline environment. Right: under attack.

Abstract

Humanoid robots are increasingly relying on reinforcement learning by building reward models aligned to humans, such as training language models to follow human instructions. However, multi-agent reinforcement learning systems such as robot teaming suffer large performance loss due to reward model anomaly, and their low observability makes anomaly detection challenging. This paper investigates the impact of poisoning attacks that exploit shared reward structures in multi-agent reinforcement learning, luring agents into reward traps. Specifically, we present a poisoning attack tailored for deep reinforcement learning in multi-agent setup, and evaluate its vulnerability on two representative reinforcement learning algorithms: PPO and SAC. Results demonstrate performance degradation of 18.7% (PPO) and 20.9% (SAC). While SAC showed a marginal decline in performance compared to PPO, PPO experienced a severe reward collapse under attack. This suggests that PPO is vulnerable to poisoning attacks, especially in multi-agent environments. These findings call for robust defense mechanisms against reward-based attacks in multi-agent reinforcement learning systems.

Poisoning Attack on Multi-Agent Learning

Environment — crawler vs. attacker. We study policy and reward poisoning in environments with limited data, where an attacker strategically disrupts multi-agent learning dynamics. Within a 50×50 m arena, each crawler agent walks to a green target point to maximize its cumulative reward, while the attacker interferes by generating lure points (marked in red) at arbitrary locations. The reward rules are predefined at environment-design time and are not altered for individual agents: when a crawler reaches a lure point it receives +100 and the attacker receives +1; if a crawler removes a lure point it is penalized with −1. Roles, rewards, and objectives of both sides are fixed in the environment, and each agent learns within this setting.

Reward manipulation & random behavior. One core component is reward manipulation: injecting premature rewards and subtly altering reward timing to induce suboptimal behavior while evading detection by conventional monitoring systems. The other is random behavior: the attacker acts unpredictably or disrupts key task elements — e.g., relocating the agent's goal object to random positions or unreachable heights — which hinders the agent's ability to learn a stable policy and exposes its vulnerability under non-deterministic conditions.

The tempting reward attack. The attack is an adversarial strategy based on reward addiction: artificially high-reward locations are inserted away from the optimal path, so agents are lured by inflated rewards placed earlier in the environment instead of following the intended trajectory toward the goal. In multi-agent settings this effect is amplified as multiple agents converge on the misleading reward spots, disrupting cooperation and goal achievement. To maximize impact, high-reward points are placed randomly, increasing distraction and destabilizing policy learning — significantly reducing learning efficiency and overall performance.

Implementation. We build the attack in Unity ML-Agents on the Crawler benchmark — a four-legged ragdoll driven by 20 continuous joint actions whose legitimate per-step reward, matchSpeedReward × lookAtTargetReward (≤ 1), plus a +1 target-touch bonus, keeps it walking toward its target. The attacker's lure (TemptingAttack.cs) teleports among five fixed positions — the arena center and the four corners — and relocates after each contact. In the multi-agent scene, a single contact broadcasts the +100 lure reward to all ten crawlers (agent0…agent9) at once, so one touch contaminates the entire team's learning signal through the shared reward structure. We train PPO (on-policy) and SAC (off-policy) in single-agent and multi-agent configurations, each with and without the attack, for 1M steps.

Results

We evaluate the attack across multi-agent and single-agent settings with PPO and SAC, comparing cumulative rewards of crawler and attacker agents with and without the attack after 1M training steps.

Scenario Agent Steps Cumulative reward
Multi-Agent PPOCrawler1M528.4
Multi-Agent PPO under AttackCrawler1M429.4
Multi-Agent PPO under AttackAttacker1M−2.903
Multi-Agent SACCrawler1M971.3
Multi-Agent SAC under AttackCrawler1M769.9
Multi-Agent SAC under AttackAttacker1M−2.449
Single-Agent PPOCrawler1M647.5
Single-Agent PPO under AttackCrawler1M302.5
Single-Agent PPO under AttackAttacker1M1
Single-Agent SACCrawler1M1,276
Single-Agent SAC under AttackCrawler1M23.93
Single-Agent SAC under AttackAttacker1M−31.43

Table I. Cumulative reward with and without attack. Highlighted rows show crawler performance under the poisoning attack.

In the multi-agent setting, the attack drops PPO from 528.4 to 429.4 (−18.7%) and SAC from 971.3 to 769.9 (−20.9%). In single-agent scenarios the impact is more severe: PPO falls from 647.5 to 302.5, and SAC shows the largest drop — from 1,276 to 23.93 — while the attacker itself collects little or negative reward. The attack effectively disrupts agent learning, with PPO more vulnerable than SAC, and multi-agent systems, while slightly more robust, still clearly affected.

Reward curves over training steps: single-agent PPO and SAC (top) and multi-agent PPO and SAC (bottom), each with and without the poisoning attack.

Reward variations of PPO and SAC in single-agent (top) and multi-agent (bottom) environments under baseline and attack conditions — attack conditions cause greater reward fluctuations and performance degradation, especially in multi-agent settings.

Takeaway. Multi-agent reinforcement learning systems are at the core of human-interactive robotics, but they are highly vulnerable to reward poisoning attacks during training. This risk is particularly severe in environments with strong inter-agent dependencies, where manipulated rewards can compromise the entire system. Mitigating this threat requires not only robust reward functions but also early detection and defense mechanisms — future research must prioritize not just performance, but also security and trustworthiness.

BibTeX

@inproceedings{choi2025poisoning,
  title     = {Poisoning Attacks on Multi-Agent Reinforcement Learning Systems},
  author    = {Chanyeok Choi and Jaehwan Cho and Youngmoon Lee},
  booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids),
               Late-Breaking Report},
  year      = {2025}
}