01 — Talk
Presentation Video
ICML talk recording hosted on SlidesLive.
02 — Materials
Project Materials
Download the ICML presentation slides and poster.
03 — Abstract
Abstract
Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is that collapse is not only a property of the backup rule or network architecture: optimizer dynamics themselves can directly trigger or suppress instability.
From a control-theoretic viewpoint, we model offline TD learning as a feedback system and analyze Adam-based critic updates. This yields a necessary and sufficient condition for stability of the induced local update dynamics: within the regime we analyze, these dynamics are stable if and only if the spectral radius of the corresponding update operator is strictly below one.
Key Counter-Intuitive Insights
- ✓ Collapse is Dynamic: Value collapse is directly triggered by the interplay of optimization dynamics and bootstrapping.
- ✓ Moment Contamination: Standard loss-based penalties inject noise into Adam's exponential moving averages, distorting the optimization trajectory.
- ✓ Decoupled Safety: Decoupling the orthogonality step ensures worst-case task optimization safety while preserving Adam's dissipative features.
04 — Theory
Theoretical Framework
We model offline TD learning as a feedback system. In standard supervised learning, targets are fixed. In TD, updates propagate through a bootstrapped next-action target set $X'$:
$$e_{t+1} = e_t + \eta S \bar{e}_t + o(\eta)$$
$$S \triangleq \gamma K(X', X) - K(X, X)$$
Where $K(X_1, X_2) = Z(X_1)^\top D Z(X_2)$ represents the preconditioned Gram operator, capturing pairwise feature similarity.
The Hurwitz Condition for Stability
The discrete-time linear recurrence converges exponentially to $0$ if and only if the spectral radius of the companion matrix satisfies: $\rho(A(\eta)) < 1$ Which holds for sufficiently small $\eta$ if and only if $S$ is Hurwitz, i.e., all eigenvalues have negative real parts: $\text{Re}(\lambda) < 0 \quad \forall \lambda \in \text{spec}(S)$
Sufficient Condition: $\gamma \|\Phi\|_2 \|\Phi_*\|_2 + \|\Phi^\top \Phi - I\|_2 < 1$. Input normalization limits the scale term, while AdamO dynamically reduces the geometric distortion $\|\Phi^\top \Phi - I\|_2$.
05 — Simulator
Interactive TD Feedback Simulator
Interactively simulate standard Adam (expansive regime) vs. AdamO (contractive regime).
Parameters Expansive (Unstable)
📈 Graph Interpretation:
Critic Loss (y-axis): Simulated bootstrapping TD errors. Notice how it explodes exponentially if the spectrum is expansive ($\rho \ge 1$), but stabilizes smoothly if Hurwitz is satisfied ($\rho < 1$).
💡 Tip:
Try dragging the Orthogonality ($\kappa$) slider up to represent activating AdamO, and watch the real-time stabilization in effect.
06 — Playground
Geometry Projection Playground
Standard optimizers degrade task performance when injecting regularizers blindly because weight constraints pollute the running moments.
AdamO's budget mechanism ($\tau$) prevents the orthogonality gradient $r_t$ from cancelling the first-order task descent $g_t$. The corrective drift is calculated, then projected if it exceeds the task budget.
Gradient Projection Space
07 — Results
Empirical Performance (D4RL)
| Domain | Task Name | Baseline Adam | AdamO (Ours) | Relative Delta |
|---|---|---|---|---|
| AntMaze | AntMaze-umaze-diverse | 47.0 | 82.2 | 📈 +74.9% |
| AntMaze | AntMaze-medium-play | 0.3 | 28.5 | 🚀 +28.2 (Score) |
| AntMaze | AntMaze-large-diverse | 0.0 | 16.5 | 🚀 +16.5 (Score) |
| Locomotion | Hopper-medium-expert | 32.6 | 84.2 | 📈 +158.2% |
| Locomotion | Walker2d-medium-expert | 22.4 | 98.5 | 📈 +339.7% |
| Adroit | Pen-human | -4.1 | 83.1 | 🚀 Fully Recovered |
| Adroit | Pen-cloned | 5.6 | 82.4 | 📈 +1371% |
08 — Cite
BibTeX Citation
@inproceedings{qiao2026adamo,
title={AdamO: A Collapse-Suppressed Optimizer for Offline RL},
author={Qiao, Nan and Yue, Sheng and Wang, Shuning and Ren, Ju},
booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year={2026},
url={https://arxiv.org/abs/2605.01968}
}