# Incremental Checkpoint ## Introduction In large-scale sparse training, the data is skewed, and most of the data in adjacent full checkpoints remains unchanged. In this context, saving only incremental checkpoints for sparse parameters will greatly reduce the overhead caused by frequently saving checkpoints. After the PS failover, try to restore the model parameters of the latest training on the PS through a recent full checkpoint and a series of incremental checkpoints to reduce repeated calculations. ## API ```python def tf.train.MonitoredTrainingSession(..., save_incremental_checkpoint_secs=None, ...): pass ``` extra parameters: `save_incremental_checkpoint_secs`, default: `None`. User can set the incremental_save checkpoint time in seconds, to generate the incremental checkpoint. ## Example High-level API(`tf.train.MonitoredTrainingSession`) ```python import tensorflow as tf import time, os tf.logging.set_verbosity(tf.logging.INFO) sparse_var=tf.get_variable("a", shape=[30,4], initializer=tf.ones_initializer(tf.float32),partitioner=tf.fixed_size_partitioner(num_shards=4)) dense_var=tf.get_variable("b", shape=[30,4], initializer=tf.ones_initializer(tf.float32),partitioner=tf.fixed_size_partitioner(num_shards=4)) ids=tf.placeholder(dtype=tf.int64, name='ids') emb = tf.nn.embedding_lookup(sparse_var, ids) fun = tf.multiply(emb, 2.0, name='multiply') loss = tf.reduce_sum(fun, name='reduce_sum') gs = tf.train.get_or_create_global_step() opt=tf.train.AdagradOptimizer(0.1, initial_accumulator_value=1) g_v = opt.compute_gradients(loss) train_op = opt.apply_gradients(g_v, global_step=gs) path = 'test/test4tolerance/ckpt/' with tf.train.MonitoredTrainingSession(checkpoint_dir=path, save_checkpoint_secs=60, save_incremental_checkpoint_secs=20) as sess: for i in range(1000): print(sess.run([gs, train_op, loss], feed_dict={"ids:0": i%10})) time.sleep(1) ``` Estimator Configure parameters when constructing `EstimatorSpec` `tf.train.Saver` and `tf.train.Scaffold` set `incremental_save_restore=True`,`tf.train.CheckpointSaverHook` set save incremental checkpoint interval `incremental_save_secs` ``` def model_fn(self, features, labels, mode, params): ... scaffold = tf.train.Scaffold( saver=tf.train.Saver( sharded=True, incremental_save_restore=True), incremental_save_restore=True) ... return tf.estimator.EstimatorSpec( mode, loss=loss, train_op=train_op, training_hooks=[logging_hook], training_chief_hooks=[ tf.train.CheckpointSaverHook( checkpoint_dir=params['model_dir'], save_secs=params['save_checkpoints_secs'], save_steps=params['save_checkpoints_steps'], scaffold=scaffold, incremental_save_secs=120)], scaffold=scaffold) ``` ## Model Export By default, incremental checkpoint subgraphs cannot be exported to SavedModel. If users want to support second-level updates through "incremental model update" in Serving, they need to export incremental checkpoint subgraphs to SavedModel. You need to use the [Estimator](https://github.com/DeepRec-AI/estimator) provided by DeepRec to export incremental checkpoint subgraphs. Example: ```python estimator.export_saved_model( export_dir_base, serving_input_receiver_fn, ... save_incr_model=True) ``` Attention: When there is no incremental model when building graph, an error will be reported when configuring save_incr_model=True, so there is only the full amount in the graph, and save_incr_model can only be configured with false (default value). When there are full and incremental models in the graph, save_incr_model is set to true, and the SavedModel graph can load full or incremental models. If save_incr_model is set to false, the SavedModel graph can only load the full model.