AdamW Optimizer
Introduction
The AdamW optimizer supports EmbeddingVariable, which adds the weight decay function compared to the Adam optimizer.
This is a kind of implementation of the AdamW optimizer which is mentioned in Loshch ilov & Hutter (https://arxiv.org/abs/1711.05101) “Decoupled Weight Decay Regularization”.
User interface
You need to use the tf.train.AdamWOptimizer
function interface during training, which is the same as other TF native Optimizers. The specific definition is as follows:
class AdamWOptimizer(DecoupledWeightDecayExtension, adam.AdamOptimizer):
def __init__(self,
weight_decay,
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-8,
use_locking=False,
name="AdamW"):
# call function
optimizer = tf.train.AdamWOptimizer(
weight_decay=weight_decay_new
learning_rate=learning_rate_new,
beta1=0.9,
beta2=0.999,
epsilon=1e-8)
Example
import tensorflow as tf
var = tf.get_variable("var_0", shape=[10,16],
initializer=tf.ones_initializer(tf.float32))
emb = tf.nn.embedding_lookup(var, tf.cast([0,1,2,5,6,7], tf.int64))
fun = tf.multiply(emb, 2.0, name='multiply')
loss = tf.reduce_sum(fun, name='reduce_sum')
gs= tf.train.get_or_create_global_step()
opt = tf.train.AdamWOptimizer(weight_decay=0.01, learning_rate=0.1)
g_v = opt.compute_gradients(loss)
train_op = opt.apply_gradients(g_v)
init = tf.global_variables_initializer()
sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
with tf.Session(config=sess_config) as sess:
sess.run([init])
print(sess.run([emb, train_op, loss]))
print(sess.run([emb, train_op, loss]))
print(sess.run([emb, train_op, loss]))