AdamO: A Collapse-Suppressed Optimizer for Offline RL

01 — Talk

Presentation Video

ICML talk recording hosted on SlidesLive.

Open on SlidesLive

AdamO ICML 2026 slides first page preview

02 — Materials

Project Materials

Download the ICML presentation slides and poster.

PDF

Presentation Slides

Full ICML talk deck for AdamO, including motivation, method details, and experimental results.

PDF

Conference Poster

A0 poster summary of the stability analysis, AdamO update rule, and benchmark performance.

03 — Abstract

Abstract

Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is that collapse is not only a property of the backup rule or network architecture: optimizer dynamics themselves can directly trigger or suppress instability.

From a control-theoretic viewpoint, we model offline TD learning as a feedback system and analyze Adam-based critic updates. This yields a necessary and sufficient condition for stability of the induced local update dynamics: within the regime we analyze, these dynamics are stable if and only if the spectral radius of the corresponding update operator is strictly below one.

Key Counter-Intuitive Insights

✓ Collapse is Dynamic: Value collapse is directly triggered by the interplay of optimization dynamics and bootstrapping.
✓ Moment Contamination: Standard loss-based penalties inject noise into Adam's exponential moving averages, distorting the optimization trajectory.
✓ Decoupled Safety: Decoupling the orthogonality step ensures worst-case task optimization safety while preserving Adam's dissipative features.

04 — Theory

Theoretical Framework

We model offline TD learning as a feedback system. In standard supervised learning, targets are fixed. In TD, updates propagate through a bootstrapped next-action target set $X'$:

// Linearized TD-error dynamics
$$e_{t+1} = e_t + \eta S \bar{e}_t + o(\eta)$$

// TD Feedback Operator
$$S \triangleq \gamma K(X', X) - K(X, X)$$

Where $K(X_1, X_2) = Z(X_1)^\top D Z(X_2)$ represents the preconditioned Gram operator, capturing pairwise feature similarity.

The Hurwitz Condition for Stability

The discrete-time linear recurrence converges exponentially to $0$ if and only if the spectral radius of the companion matrix satisfies: $\rho(A(\eta)) < 1$ Which holds for sufficiently small $\eta$ if and only if $S$ is Hurwitz, i.e., all eigenvalues have negative real parts: $\text{Re}(\lambda) < 0 \quad \forall \lambda \in \text{spec}(S)$

Sufficient Condition: $\gamma \|\Phi\|_2 \|\Phi_*\|_2 + \|\Phi^\top \Phi - I\|_2 < 1$. Input normalization limits the scale term, while AdamO dynamically reduces the geometric distortion $\|\Phi^\top \Phi - I\|_2$.

05 — Simulator

Interactive TD Feedback Simulator

Interactively simulate standard Adam (expansive regime) vs. AdamO (contractive regime).

Parameters Expansive (Unstable)

Stepsize ($\eta$) 0.020

Discount Factor ($\gamma$) 0.99

Orthogonality Correction ($\kappa$) [AdamO] 0.00

Geometric Distortion ($\epsilon_0$) 0.80

Spectral Radius $\rho(A(\eta))$: 1.03

Hurwitz Status: Violated (Collapse)

📈 Graph Interpretation:

Critic Loss (y-axis): Simulated bootstrapping TD errors. Notice how it explodes exponentially if the spectrum is expansive ($\rho \ge 1$), but stabilizes smoothly if Hurwitz is satisfied ($\rho < 1$).

💡 Tip:

Try dragging the Orthogonality ($\kappa$) slider up to represent activating AdamO, and watch the real-time stabilization in effect.

06 — Playground

Geometry Projection Playground

Standard optimizers degrade task performance when injecting regularizers blindly because weight constraints pollute the running moments.

AdamO's budget mechanism ($\tau$) prevents the orthogonality gradient $r_t$ from cancelling the first-order task descent $g_t$. The corrective drift is calculated, then projected if it exceeds the task budget.

Conflict Budget ($\tau$) 0.10

$\tau = 0.0$ represents a strict Conflict-Free mode, guaranteeing no task degradation. $\tau > 0$ allows controlled geometric alignment.

Gradient Projection Space

Task Gradient $g_t$ Raw Correction $r_t$ Projected $\delta_t$

07 — Results

Empirical Performance (D4RL)

Domain	Task Name	Baseline Adam	AdamO (Ours)	Relative Delta
AntMaze	AntMaze-umaze-diverse	47.0	82.2	📈 +74.9%
AntMaze	AntMaze-medium-play	0.3	28.5	🚀 +28.2 (Score)
AntMaze	AntMaze-large-diverse	0.0	16.5	🚀 +16.5 (Score)
Locomotion	Hopper-medium-expert	32.6	84.2	📈 +158.2%
Locomotion	Walker2d-medium-expert	22.4	98.5	📈 +339.7%
Adroit	Pen-human	-4.1	83.1	🚀 Fully Recovered
Adroit	Pen-cloned	5.6	82.4	📈 +1371%

Domain	Task Name	Baseline IQL (Adam)	IQL + AdamO (Ours)	Relative Delta
AntMaze	AntMaze-umaze-diverse	55.8	68.2	📈 +22.2%
AntMaze	AntMaze-medium-diverse	66.9	75.5	📈 +12.8%
AntMaze	AntMaze-large-play	38.5	41.8	📈 +8.5%
Adroit	Pen-human	75.1	99.8	📈 +32.8%
Adroit	Pen-cloned	46.5	82.2	📈 +76.7%

08 — Cite

BibTeX Citation

@inproceedings{qiao2026adamo,
  title={AdamO: A Collapse-Suppressed Optimizer for Offline RL},
  author={Qiao, Nan and Yue, Sheng and Wang, Shuning and Ren, Ju},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026},
  url={https://arxiv.org/abs/2605.01968}
}