Why Does Cosine Annealing With Warmup Stabilize Training?#
Motivation#
In training deep neural networks, learning rate is definitely one of the most important parameter to tune. Optimization algorithms like Adam and SGD tell us how the weights \(\boldsymbol{\theta} \in \boldsymbol{\Theta}\) should be updated, but the learning rate \(\eta\) tells us the rate at which the weights are being updated.
Theoretically and empircally, the magnitude of the learning rate \(\eta\) can have a significant impact on the training process. If the learning rate is too large, we might experience divergence, on the other hand, if the learning rate is too small, the model might take longer to converge or might get stuck in a local minima. The condition number of the problem also impacts optimization efficiency, as discussed in the momentum section, where the concept can be understood as the ratio between the smallest and largest changes possible in response to adjustments in different directions of the parameter space, reflecting the variance in sensitivity across these directions[1] [Zhang et al., 2023]. As we progress through the training steps, it is also equally important to apply a learning rate scheduler to adjust (may not be monotonous decay) the learning rate discriminatively.
In the paper SGDR: Stochastic Gradient Descent with Restarts by Loshchilov and Hutter, they introduced an heuristic that relies on the empirical observation that we can improve the convergence of the model (usually in ill-conditioned situations) if we want follow an annealing process over the learning rate. This means that at the beginning of training, we do not want to decrease the learning too drastically. My (potentially wrong) intuition is that this may allow the model to consider exploring a larger parameter space without too much constraints than if we were to rapidly decrease the learning rate. The authors further claim that as we progress towards the end of the training, we would want to “fine-tune” the model parameters with a very small learning rate, as it could potentially help “refine” the solution space to find a “more optimal” set of parameters [Loshchilov and Hutter, 2016]. This idea naturally lands us to using cosine function because the cosine curve starts with a gentle slope, which coincides with the idea of gradual decrease in learning rate in the beginning, and the cosine curve naturally flattens and approaches zero towards the end as it reaches the end of its cycle, which again coincides with the idea of fine-tuning the model parameters with a very small learning rate.
Consequently, a cosine decaying scheduler has the below function form for learning rates in the range \(t \in [0, T]\):
Here \(\eta_0\) is the initial learning rate, \(\eta_T\) is the target rate at time \(T\). Furthermore, for \(t>T\) we simply pin the value to \(\eta_T\) without increasing it again. \(T\) represents the end of the learning rate annealing phase rather than the absolute end of training. It’s the point in time when the learning rate reaches \(\eta_T\), the target rate, and beyond which the learning rate is maintained constant at \(\eta_T\).
During \(0 \leq t < T\): The learning rate \(\eta_t\) is actively adjusted according to the cosine annealing formula. It transitions from the initial learning rate \(\eta_0\) towards the target rate \(\eta_T\), following a half-cosine wave.
For \(t \geq T\): The learning rate is set to \(\eta_T\) and no longer changes. This doesn’t necessarily mean that training must stop at \(t = T\). Training can continue beyond \(T\) with the learning rate fixed at \(\eta_T\).
In code, we can observe the behavior of the cosine annealing scheduler as follows:
1from __future__ import annotations
2
3from typing import Any, List
4
5import matplotlib.pyplot as plt
6import torch
7from torch.optim import Optimizer
8from torch.optim.lr_scheduler import CosineAnnealingLR, _LRScheduler
9
10def get_learning_rates(optimizer: Optimizer, scheduler: _LRScheduler, steps: int) -> List[float]:
11 lrs = []
12 for _ in range(steps):
13 lrs.append(optimizer.param_groups[0]["lr"])
14 optimizer.step()
15 scheduler.step()
16 return lrs
17
18def plot_learning_rates(
19 lrs: List[float], title: str, marker: str = "o", ax: plt.Axes | None = None, **kwargs: Any
20) -> None:
21 ax = ax or plt.gca()
22
23 ax.plot(lrs, label=title, marker=marker, **kwargs)
24 ax.set_title(title)
25 ax.set_xlabel("Step")
26 ax.set_ylabel("Learning Rate")
27 ax.legend()
28
29def main() -> None:
30 initial_lr = 0.1
31 eta_min = 0
32 steps = 100
33 model = torch.nn.Linear(2, 1)
34
35 optimizer = torch.optim.SGD(model.parameters(), lr=initial_lr)
36 scheduler_non_cyclic = CosineAnnealingLR(optimizer, T_max=steps, eta_min=eta_min)
37 lrs_non_cyclic = get_learning_rates(optimizer, scheduler_non_cyclic, steps)
38
39 optimizer = torch.optim.SGD(model.parameters(), lr=initial_lr)
40 scheduler_cyclic = CosineAnnealingLR(optimizer, T_max=steps // 8, eta_min=eta_min)
41 lrs_cyclic = get_learning_rates(optimizer, scheduler_cyclic, steps)
42
43 # Plotting
44 fig, axes = plt.subplots(1, 2, figsize=(12, 4))
45 plot_learning_rates(lrs_non_cyclic, 'Non-Cyclic Cosine Annealing', ax=axes[0])
46 plot_learning_rates(lrs_cyclic, 'Cyclic Cosine Annealing', ax=axes[1])
47
48 plt.tight_layout()
49 plt.show()
50
51main()
Warmup#
Our motivation would have ended here, but in practice, we often see that the cosine annealing scheduler is often combined with a warmup phase. In Fig. 18, we can see that the loss curve is relatively smooth and converges way better than the ones without warmup.
It might be worth having some intuition on why warmup works so well in practice, and in particular, in language models like Transformers.
Firstly, the RAdam paper suggests warmup works as a variance reduction technique, which overcomes the problem of bias correction factors in optimizers like Adam, where having these bias correction factors would lead to larger variance in the adaptive learning rate during the initial training iterations [Lippe, 2023]. More concretely, Adam estimates the first and second moments of the gradient to change the learning rate of each individual parameter (hence adaptive) and having high variance between adaptive learning rates may de-stablize the training. If we don’t want to swap out Adam, then this calls for a warmup phase to stabilize the learning rate and reduce the variance in the early stages of training.
Secondly, language models like Transformers use iteratively applied Layer Normalization across layers can lead to very high gradients during the first iterations, which can be solved by using Pre-Layer Normalization (similar to Pre-Activation ResNet), which applies normalization before the layer’s main operations, contributing to gradient stabilization and reducing the necessity for a warm-up phase, or replacing Layer Normalization by other techniques (Adaptive Normalization, Power Normalization) [Lippe, 2023].
However, even though there are solutions to the problem, certain setups still use the Adam optimizer, and therefore warmup is still a simple and effective technique to stabilize the learning rate in the early stages of training - solving the afforementioned problems (i.e. stabilize the bias correction factors, moving averages of gradients and squared gradients).
To this end, we end our discussion on the motivation behind 1) using cosine annealing schedulers and 2) using warmup phases, often coupled with cosine annealing schedulers. In what follows, we will provide a more formal definition of the cosine annealing scheduler with warmup, and provide a running example to illustrate the behavior of the scheduler.
Definition#
The CosineAnnealingWithWarmupScheduler
decays the learning rate \(\eta\)
according to the decreasing part of a cosine curve, with an initial warmup
\(t_{\text{warmup}}\).
This scheduler modulates \(\eta\) within defined upper and lower bounds over a predetermined interval, employing a cosine function. The formula for cosine annealing reflects the shape of a half-cosine wave, which decreases from a maximum value to a minimum and then increases back to the maximum. This cycle can repeat multiple times over the training process, depending on how the scheduler is configured. Although this approach suggests cyclic adjustments (oscillations) within the training duration, for simplicity’s sake, our specific implementation, inspired by MosaicML’s Composer’s CosineAnnealingWithWarmupScheduler, explicitly excludes considerations for such cycles/oscillations.
(Cosine Annealing With Warmup)
The CosineAnnealingWithWarmupScheduler
modulates the learning rate \(\eta\)
according to a two-phase process: a warmup phase followed by a
cosine annealing phase. The learning rate multiplier[2]
\(\alpha_{t}\) at any given time (step) \(t\) is given by:
where we denote:
\(t\) represents the current training step or epoch.
\(\eta_{\max}\) as the maximum learning rate reached during training, and often is the initial learning rate given into an optimizer.
\(t_{\text{warmup}}\) denotes the duration of the warmup period, in terms of the number of steps or epochs, during which the learning rate linearly increases to the maximum learning rate \(\eta_{\max}\).
\(t_{\max}\) as the maximum number of training steps, or maximum number of iterations in an epoch (see here).
\(\tau_w = \frac{t - t_{\text{warmup}}}{t_{\max}}\), the fraction of post-warmup time elapsed,
\(\alpha_f\) is a scaling factor that determines the final learning rate multiplier to decay to (a value between \(0\) and \(1\)), and this is a fixed value. For example, if \(\alpha_f = 0.1\) and the initial learning rate is \(\eta_{\max} = 3e-4\), then the final learning rate will be \(\eta_{\min} = 3e-4 \times 0.1 = 3e-5\).
The actual learning rate \(\eta_{t}\) at time (step) \(t\) is then computed as:
where we emphasize again that \(\eta_{\max}\) is the maximum learning rate reached during training.
A Word on Oscillations
Note that if you set \(t_{\max}\) to the total number of training steps that is needed for the entire dataset \(\mathcal{S}\), the scheduler will only decay the learning rate after the warmup phase and not oscillate further. This configuration means that after completing the linear increase during the warmup, the learning rate will decrease following a cosine curve until it reaches the final learning rate determined by \(\alpha_f\).
Single Cycle (No Oscillation): If \(t_{\max}\) is set to cover exactly one half-cycle of the cosine function from the end of the warmup phase to the conclusion of training, the learning rate will monotonically decrease from its maximum value (at the end of warmup) to its minimum value (as determined by \(\alpha_f\)) without oscillating. This is because the scheduler’s active period only spans a single descent phase of the cosine wave.
Multiple Cycles (Oscillation): If \(t_{\max}\) is set to allow for a longer duration than what is needed for a single half-cycle descent, the cosine annealing function can complete its initial descent and then begin to ascend as part of a new cycle. This leads to oscillations in the learning rate—after decreasing, it will start to increase again, potentially multiple times, depending on the total number of cycles fitted within \(t_{\max}\). This is where the term “oscillation” comes into play; it describes the periodic increase and decrease in the learning rate according to the cosine function over multiple cycles.
True oscillation, where the learning rate decreases and then increases within a training regime, typically requires either a restart mechanism (as seen in Cosine Annealing with Warm Restarts) or an explicit multi-cycle configuration. A standard cosine annealing scheduler, especially with a warmup phase, generally only supports a monotonic decrease within a single cycle, unless it is specifically designed to handle restarts or multiple cycles.
Implementation#
1from __future__ import annotations
2
3import math
4from functools import partial
5
6from torch.optim.lr_scheduler import LambdaLR
7from torch.optim.optimizer import Optimizer
8from torch.optim import Adam
9from torch import nn
10
11def _get_cosine_schedule_with_warmup_lr_lambda(
12 current_step: int, *, num_warmup_steps: int, num_training_steps: int, alpha_f: float
13) -> float:
14 if current_step < num_warmup_steps:
15 alpha = current_step / max(1, num_warmup_steps)
16 else:
17 tau_w = (current_step - num_warmup_steps) / num_training_steps
18 tau_w = min(1.0, tau_w)
19 alpha = alpha_f + (1 - alpha_f) * (1 + math.cos(math.pi * tau_w)) / 2
20 return alpha
21
22
23def get_cosine_annealing_with_warmup(
24 optimizer: Optimizer,
25 num_warmup_steps: int,
26 num_training_steps: int,
27 alpha_f: float = 0.1,
28 last_epoch: int = -1,
29 verbose: bool = False,
30) -> LambdaLR:
31 lr_lambda = partial(
32 _get_cosine_schedule_with_warmup_lr_lambda,
33 num_warmup_steps=num_warmup_steps,
34 num_training_steps=num_training_steps,
35 alpha_f=alpha_f,
36 )
37 return LambdaLR(optimizer, lr_lambda, last_epoch, verbose)
38
39# Experiment 1
40num_warmup_steps = 5
41num_training_steps = 10
42alpha_f = 0.5
43initial_lr = 3e-4
44
45dummy_model = nn.Linear(1, 1)
46optimizer = Adam(dummy_model.parameters(), lr=initial_lr)
47scheduler = get_cosine_annealing_with_warmup(optimizer, num_warmup_steps, num_training_steps, alpha_f)
48assert isinstance(scheduler, LambdaLR)
49lrs1 = get_learning_rates(optimizer, scheduler, steps=num_training_steps)
50
51# Experiment 2
52num_warmup_steps = 200
53num_training_steps = 1000
54
55dummy_model = nn.Linear(1, 1)
56optimizer = Adam(dummy_model.parameters(), lr=initial_lr)
57scheduler = get_cosine_annealing_with_warmup(optimizer, num_warmup_steps, num_training_steps, alpha_f)
58lrs2 = get_learning_rates(optimizer, scheduler, steps=num_training_steps)
59
60fig, axes = plt.subplots(1, 2, figsize=(12, 4))
61plot_learning_rates(lrs1, 'Cosine Annealing With Warmup (Short)', ax=axes[0])
62plot_learning_rates(lrs2, 'Cosine Annealing With Warmup (Long)', ax=axes[1])
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
An Example Walkthrough#
For simplicity, we assume that there are a total of \(10\) training steps (or epoches) depending on how you define it. We will use the following hyperparameters:
\(\eta_{\max} = 3 \times 10^{-4}\)
\(t_{\text{warmup}} = 5\)
\(t_{\max} = 10\)
\(\alpha_f = 0.5\)
We can use the code to verify the learning rate at each step with our manual computation.
[ │ 0.0, │ 5.9999999999999995e-05, │ 0.00011999999999999999, │ 0.00017999999999999998, │ 0.00023999999999999998, │ 0.0003, │ 0.0002963292387221365, │ 0.00028567627457812104, │ 0.0002690838939219355, │ 0.00024817627457812105 ]
1. Warmup Phase#
During the warmup phase, when the current training step \(t\) is less than the warmup time \(t_{\text{warmup}}\), the learning rate multiplier is linearly increased from \(0\) to \(1\).
Mathematically, the learning rate multiplier \(\alpha_{t}\) at time (step) \(t\) is
The learning rate at this phase is:
During the warmup phase, the learning rate will linearly increase from \(0\) to \(\eta_{\max}\) in the first \(t_{\text{warmup}}\) steps. Since \(\eta_{\max} = 3 \times 10^{-4}\) and \(t_{\text{warmup}} = 5\), the learning rate will be increased as follows:
\(t = 1\):
\[\begin{split} \begin{align*} \alpha_1 &= \frac{t}{t_{\text{warmup}}} = \frac{1}{5} = 0.2 \\ \eta_1 &= \alpha_1 \times \eta_{\max} = 0.2 \times 3 \times 10^{-4} = 6 \times 10^{-5} \end{align*} \end{split}\]\(t = 2\):
\[\begin{split} \begin{align*} \alpha_2 &= \frac{t}{t_{\text{warmup}}} = \frac{2}{5} = 0.4 \\ \eta_2 &= \alpha_2 \times \eta_{\max} = 0.4 \times 3 \times 10^{-4} = 1.2 \times 10^{-4} \end{align*} \end{split}\]\(t = 3\):
\[\begin{split} \begin{align*} \alpha_3 &= \frac{t}{t_{\text{warmup}}} = \frac{3}{5} = 0.6 \\ \eta_3 &= \alpha_3 \times \eta_{\max} = 0.6 \times 3 \times 10^{-4} = 1.8 \times 10^{-4} \end{align*} \end{split}\]\(t = 4\):
\[\begin{split} \begin{align*} \alpha_4 &= \frac{t}{t_{\text{warmup}}} = \frac{4}{5} = 0.8 \\ \eta_4 &= \alpha_4 \times \eta_{\max} = 0.8 \times 3 \times 10^{-4} = 2.4 \times 10^{-4} \end{align*} \end{split}\]\(t = 5\):
\[\begin{split} \begin{align*} \alpha_5 &= \frac{t}{t_{\text{warmup}}} = \frac{5}{5} = 1 \\ \eta_5 &= \alpha_5 \times \eta_{\max} = 3 \times 10^{-4} \times 1 = 3 \times 10^{-4} \end{align*} \end{split}\]
The linear relationship for the warmup phase can be represented as a function of the current training step \(x\):
where \(t_{\text{warmup}}\)is the total number of steps in the warmup phase. This function describes how the learning rate multiplier \(\alpha_t\) grows linearly from \(0\) to \(1\) as \(t\) progresses from \(0\) to \(t_{\text{warmup}}\).
2. Cosine Decay Phase#
After the warmup phase, the learning rate multiplier follows a cosine decay pattern. This phase commences once the current training step \(t\) is greater than or equal to the warmup time \(t_{\text{warmup}}\), and it continues until the maximum training step \(t_{\text{max}}\).
2.1. Tau Fraction#
We first define a variable \(\tau_w\) to represent the fraction of post-warmup time elapsed. Mathematically, it is defined as:
where:
\(t\): Current training step.
\(t_{\text{warmup}}\): Warmup time in training steps.
\(t_{\text{max}}\): Total duration of the scheduler in training steps.
2.2. Learning Rate Multiplier#
The learning rate multiplier \(\alpha_t\) during the cosine decay phase is given by:
where \(\alpha_f\) is the scaling factor that determines the final learning rate multiplier to decay to.
2.3. Learning Rate#
The actual learning rate \(\eta_t\) during this phase is then computed as:
Example#
Using the running example with:
\(\eta_{\text{max}} = 3 \times 10^{-4}\)
\(t_{\text{warmup}} = 5\)
\(t_{\text{max}} = 10\)
\(\alpha_f = 0.5\)
The learning rate will be computed as follows:
\(t = 6\):
\[\begin{split} \begin{align*} \tau_w &= \frac{6 - 5}{10} = 0.1 \\ \alpha_6 &= 0.5 + \frac{1}{2}(1 - 0.5) \left(1 + \cos \left(0.1\pi\right)\right) = 0.975445 \\ \eta_6 &= 3 \times 10^{-4} \times 0.975445 = 0.0002963292387221365 \end{align*} \end{split}\]\(t = 7\):
\[\begin{split} \begin{align*} \tau_w &= \frac{7 - 5}{10} = 0.2 \\ \alpha_7 &= 0.5 + \frac{1}{2}(1 - 0.5) \left(1 + \cos \left(0.2\pi\right)\right) = 0.904508 \\ \eta_7 &= 3 \times 10^{-4} \times 0.904508 = 0.00028567627457812104 \end{align*} \end{split}\]… (and so on for the remaining steps)
Mathematical Intuition#
This version of implementation is slightly confusing because there is an \(\alpha_f\) term in the cosine decay formula.
Cosine Function: The cosine function oscillates between -1 and 1. By taking a scaled and shifted version of the cosine function, we can create a curve that starts at its highest point and gradually descends to its lowest point over the interval \([0, t_{\text{max}}]\).
Decay Formula: The formula for the learning rate multiplier during the decay phase is:
\[ \alpha_{t} = \alpha_f + \frac{1}{2}(1 - \alpha_f)\left(1 + \cos \left(\pi \times \tau_w\right)\right) \]Here, \(\tau_w\) is the fraction of time elapsed since the warmup phase, and it ranges from 0 to 1. The \(\cos(\pi \times \tau_w)\) term creates a curve that starts at 1 (when \(\tau_w = 0\)) and ends at -1 (when \(\tau_w = 1\)). The scaling and shifting ensure that \(\alpha_t\) starts at 1 and decays to \(\alpha_f\).
More concretely, the expression
describes the learning rate multiplier during the decay phase, where \(\tau_w\) is the fraction of time elapsed since the warmup phase where \(\tau_w\) is the fraction of time elapsed since the warmup phase, and it ranges from \(0\) to \(1\).
Let’s zoom into the cosine decay part in more details:
The term \(\cos(\pi \times \tau_w)\) oscillates between \(1\) and \(-1\) as \(\tau_w\) varies from \(0\) to \(1\).
When you add \(1\) to this term, the expression \(1 + \cos(\pi \times \tau_w)\) oscillates between \(0\) and \(2\).
Multiplying this by \(\frac{1}{2}\) scales it down, so the expression \(\frac{1}{2}\left(1 + \cos(\pi \times \tau_w)\right)\) oscillates between 0 and 1.
The term \(\frac{1}{2}(1 - \alpha_f)\) scales this oscillation so that the amplitude is adjusted based on the desired final learning rate multiplier \(\alpha_f\). This means if \(\alpha_f = 0.5\), the expression \(\frac{1}{2}(1 - \alpha_f)\left(1 + \cos(\pi \times \tau_w)\right)\) oscillates between \(0\) and \(0.5\).
Adding \(\alpha_f\) shifts the entire expression so that it starts at 1 when \(\tau_w = 0\) and decays to \(\alpha_f\) when \(\tau_w = 1\).
The addition of \(\alpha_f\) in the decay formula serves the purpose of setting the final value of the learning rate multiplier \(\alpha_t\) to \(\alpha_f\) at the end of training. Let’s break down the equation step by step to understand why \(\alpha_f\) is added back.
Given the formula:
First, consider the case where \(\tau_w = 0\) (i.e., the beginning of the decay phase):
The cosine term becomes \(\cos(0) = 1\).
The entire expression inside the parentheses becomes \(1 + 1 = 2\).
The scaling factor \(\frac{1}{2}(1 - \alpha_f)\) then multiplies this by \(\frac{1 - \alpha_f}{2}\).
So the expression becomes \(\alpha_f + \frac{1 - \alpha_f}{2} \times 2 = \alpha_f + (1 - \alpha_f) = 1\).
So at \(\tau_w = 0\), \(\alpha_t\) starts at 1.
Now consider the case where \(\tau_w = 1\) (i.e., the end of training):
The cosine term becomes \(\cos(\pi) = -1\).
The entire expression inside the parentheses becomes \(1 - 1 = 0\).
The scaling factor then multiplies this by \(\frac{1 - \alpha_f}{2} \times 0 = 0\).
So the expression becomes \(\alpha_f + 0 = \alpha_f\).
So at \(\tau_w = 1\), \(\alpha_t\) decays to \(\alpha_f\).
By adding \(\alpha_f\), we ensure that the learning rate multiplier starts at \(1\) and smoothly decays to the desired final value \(\alpha_f\). Without adding \(\alpha_f\), the expression would start at \(1\) but decay to \(0\), rather than the intended final value. The addition of \(\alpha_f\) shifts the entire decay curve so that it aligns with the desired starting and ending values.
PyTorch’s CosineAnnealingLR vs. Composer’s CosineAnnealingScheduler#
We know that Composers’ CosineAnnealingWithWarmupScheduler
is basically its
CosineAnnealingScheduler
with warmup, and the latter is also basically a copy
of PyTorch’s CosineAnnealingLR
. However, there is a slight difference in their
formulas. Let’s compare the two to see if they are equivalent.
In PyTorch’s CosineAnnealingLR, they implemented the cosine annealing scheduler without warmup, but the base formula should be similar. Let’s take a look to see how they coincide.
The formula looks a bit different at first glance (without loss of generality,
we can ignore warmup here), after digging a bit deeper, I tried to establish the
equivalence by setting eta_min
of pytorch to alpha_f x eta_max
,
where there is a new notation \(\eta_{\min}\), which is the minimum learning rate. Furthermore, they do not have the \(\alpha_f\) term, which is the scaling factor that determines the final learning rate multiplier to decay to.
To prove their equivalence, set \(\eta_{\min} = \alpha_f \times \eta_{\max}\), and
Setting \(\eta_{\min} = \alpha_f \times \eta_{\max}\) is also not an arbitrary choice. If we interpret \(\eta_{\min}\) as the minimum learning rate, then it makes sense to set it to \(\alpha_f \times \eta_{\max}\), since \(\alpha_f\) is the scaling factor that determines the final learning rate multiplier to decay to from \(\eta_{\max}\). More concretely, if the initial learning rate \(\eta_{\max} = 3e-4\) and \(\alpha_f = 0.1\), then the final learning rate will be \(\eta_{\min} = 3e-4 \times 0.1 = 3e-5\).
References and Further Readings#
Citations#
[1] I. Loshchilov and F. Hutter, “SGDR: Stochastic Gradient Descent with Restarts”, CoRR, vol. abs/1608.03983, 2016.
[2] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, “Chapter 12.11. Learning Rate Scheduling” in Dive into Deep Learning, Cambridge University Press, 2023.
[3] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the Variance of the Adaptive Learning Rate and Beyond”, arXiv preprint arXiv:1908.03265, [Submitted on 8 Aug 2019 (v1), last revised 26 Oct 2021 (this version, v4)].