ICML 2026 · Accepted Regular Paper

AdamO: A Collapse-Suppressed Optimizer for Offline RL

Suppressing value collapse in temporal-difference updates through a decoupled orthogonality correction regulated by a strict task-alignment budget.

Nan Qiao Sheng Yue Shuning Wang Ju Ren
+1371%
Peak Gain (Pen-cloned)
ρ < 1
Hurwitz Stability
Backbones (TD3+BC / IQL)
D4RL
Benchmark Suite

Presentation Video

ICML talk recording hosted on SlidesLive.

Open on SlidesLive
AdamO ICML 2026 slides first page preview

Project Materials

Download the ICML presentation slides and poster.

Abstract

Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is that collapse is not only a property of the backup rule or network architecture: optimizer dynamics themselves can directly trigger or suppress instability.

From a control-theoretic viewpoint, we model offline TD learning as a feedback system and analyze Adam-based critic updates. This yields a necessary and sufficient condition for stability of the induced local update dynamics: within the regime we analyze, these dynamics are stable if and only if the spectral radius of the corresponding update operator is strictly below one.

  • Collapse is Dynamic: Value collapse is directly triggered by the interplay of optimization dynamics and bootstrapping.
  • Moment Contamination: Standard loss-based penalties inject noise into Adam's exponential moving averages, distorting the optimization trajectory.
  • Decoupled Safety: Decoupling the orthogonality step ensures worst-case task optimization safety while preserving Adam's dissipative features.

Theoretical Framework

We model offline TD learning as a feedback system. In standard supervised learning, targets are fixed. In TD, updates propagate through a bootstrapped next-action target set $X'$:

// Linearized TD-error dynamics
$$e_{t+1} = e_t + \eta S \bar{e}_t + o(\eta)$$
// TD Feedback Operator
$$S \triangleq \gamma K(X', X) - K(X, X)$$

Where $K(X_1, X_2) = Z(X_1)^\top D Z(X_2)$ represents the preconditioned Gram operator, capturing pairwise feature similarity.

The Hurwitz Condition for Stability

The discrete-time linear recurrence converges exponentially to $0$ if and only if the spectral radius of the companion matrix satisfies: $\rho(A(\eta)) < 1$ Which holds for sufficiently small $\eta$ if and only if $S$ is Hurwitz, i.e., all eigenvalues have negative real parts: $\text{Re}(\lambda) < 0 \quad \forall \lambda \in \text{spec}(S)$

Sufficient Condition: $\gamma \|\Phi\|_2 \|\Phi_*\|_2 + \|\Phi^\top \Phi - I\|_2 < 1$. Input normalization limits the scale term, while AdamO dynamically reduces the geometric distortion $\|\Phi^\top \Phi - I\|_2$.

Interactive TD Feedback Simulator

Interactively simulate standard Adam (expansive regime) vs. AdamO (contractive regime).

Stepsize ($\eta$) 0.020
Discount Factor ($\gamma$) 0.99
Orthogonality Correction ($\kappa$) [AdamO] 0.00
Geometric Distortion ($\epsilon_0$) 0.80
Spectral Radius $\rho(A(\eta))$: 1.03
Hurwitz Status: Violated (Collapse)

📈 Graph Interpretation:

Critic Loss (y-axis): Simulated bootstrapping TD errors. Notice how it explodes exponentially if the spectrum is expansive ($\rho \ge 1$), but stabilizes smoothly if Hurwitz is satisfied ($\rho < 1$).

💡 Tip:

Try dragging the Orthogonality ($\kappa$) slider up to represent activating AdamO, and watch the real-time stabilization in effect.

Geometry Projection Playground

Standard optimizers degrade task performance when injecting regularizers blindly because weight constraints pollute the running moments.

AdamO's budget mechanism ($\tau$) prevents the orthogonality gradient $r_t$ from cancelling the first-order task descent $g_t$. The corrective drift is calculated, then projected if it exceeds the task budget.

Conflict Budget ($\tau$) 0.10
$\tau = 0.0$ represents a strict Conflict-Free mode, guaranteeing no task degradation. $\tau > 0$ allows controlled geometric alignment.
Task Gradient $g_t$ Raw Correction $r_t$ Projected $\delta_t$

Empirical Performance (D4RL)

Domain Task Name Baseline Adam AdamO (Ours) Relative Delta
AntMaze AntMaze-umaze-diverse 47.0 82.2 📈 +74.9%
AntMaze AntMaze-medium-play 0.3 28.5 🚀 +28.2 (Score)
AntMaze AntMaze-large-diverse 0.0 16.5 🚀 +16.5 (Score)
Locomotion Hopper-medium-expert 32.6 84.2 📈 +158.2%
Locomotion Walker2d-medium-expert 22.4 98.5 📈 +339.7%
Adroit Pen-human -4.1 83.1 🚀 Fully Recovered
Adroit Pen-cloned 5.6 82.4 📈 +1371%

BibTeX Citation

@inproceedings{qiao2026adamo,
  title={AdamO: A Collapse-Suppressed Optimizer for Offline RL},
  author={Qiao, Nan and Yue, Sheng and Wang, Shuning and Ren, Ju},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026},
  url={https://arxiv.org/abs/2605.01968}
}
Copied successfully!