Asynchronous Embedding Lookup

Background

In the scenario of distributed training of sparse models, as the model becomes more and more complex, the input features of the model also increase. As a result, the worker node needs to perform a large number of embedding lookup operations on the ps node during each step of training, causing the time-consuming proportion of this operation to increase in training, which becomes the bottleneck of the model training speed.

DeepRec provides the function of asynchronous Embedding lookup, which can automatically determine the subgraph of Embedding lookup and realize the asynchronous execution with the main graph, thereby eliminating the impact of model communication bottleneck on training and improving training performance.

After the embedding lookup operation is asynchronous, the gradient update operation and the embedding lookup operation of the work node are located in different graphs. As a result, the worker node could not obtain the latest embedding lookup result from the PS node, which may affect the training convergence speed and model effect.

Description

On the premise that the user’s original graph has an io stage, the Asynchronous Embedding Lookup function can automatically determine the subgraph of Embedding lookup and realize the asynchronous execution with the main graph.

Attention

The prerequisite for enabling the Asynchronous Embedding Lookup function is that there is an io stage in the user’s original graph, which should be after the sample reading and before the embedding lookup operation. Please refer to Pipeline-Stage.
This function conflicts with Pipline-SmartStage, and the SmartStage function will be turned off if the Asynchronous Embedding Lookup function is enabled.

User API

Currently, the Asynchronous Embedding Lookup function supports the following Embedding Lookup interfaces in DeepRec.

tf.contrib.feature_column.sequence_input_layer()
tf.contrib.layers.safe_embedding_lookup_sparse()
tf.contirb.layers.input_from_feature_columns()
tf.contirb.layers.sequence_input_from_feature_columns()
tf.feature_column.input_layer()
tf.nn.embedding_lookup()
tf.nn.embedding_lookup_sparse()
tf.nn.safe_embedding_lookup_sparse()
tf.nn.fused_embedding_lookup_sparse()
tf.python.ops.embedding_ops.fused_safe_embedding_lookup_sparse()

DeepRec provides the following configuration options in ConfigProto：

sess_config = tf.ConfigProto()
sess_config.graph_options.optimizer_options.do_async_embedding = True
sess_config.graph_options.optimizer_options.async_embedding_options.threads_num = 4
sess_config.graph_options.optimizer_options.async_embedding_options.capacity = 4
sess_config.graph_options.optimizer_options.async_embedding_options.use_stage_subgraph_thread_pool = False # optional
sess_config.graph_options.optimizer_options.async_embedding_options.stage_subgraph_thread_pool_id = 0 # optional

Configuration Options	Description	Default Value
do_async_embedding	Enable Asynchronous Embedding Lookup or not	False
threads_num	The number of threads that execute embedding lookup subgraph asynchronously	0 (must be set)
capacity	The maximum number of Asynchronous Embedding lookup results that a worker node can cache	0 (must be set)
use_stage_subgraph_thread_pool	Use an independent thread pool to run the embedding lookup subgraph or not, need to create an independent thread pool first	False(optional)
stage_subgraph_thread_pool_id	index of independent thread pool	0(optional, the index range is [0, the number of independent thread pools - 1])

Attention

async_embedding_threads_num is not as big as possible, you only need to let the execution of the main graph not have to wait for the result of the embedding lookup operation. If the value is too large, it will preempt computing resources for model training and occupy more communication bandwidth. It is recommended to set according to the following formula. You can also start at 1 and adjust upwards to find the best value.

A larger capacity will consume more memory, and will also cause a larger difference between the cached embedding lookup result and the latest result obtained from the PS node, resulting in slow training convergence. It is recommended to set it to the same value as async_embedding_threads_num, which can be adjusted upwards from 1.
The independent thread pool option can make different Stage subgraphs run in different thread pools, avoiding competition with the default thread pool for the main graph and other subgraphs. For how to create an independent thread pool, please refer to Pipeline-Stage.

Performance

CPU scenario

The performance of this feature in the DLRM model in modelzoo.

The Aliyun ECS instance is ecs.hfc7.24xlarge, the training cluster consists of 10 Aliyun ECS instances.

Aliyun ECS Instance configuration information：

	Description
CPU	Intel Xeon Platinum (Cooper Lake) 8369 96 cores
MEM	192 GiB
NET	32 Gbps

Training configuration：

	size
PS nodes	8
worker nodes	30
cpu cores per PS	15
cpu cores per worker	10

Result

model	baseline（global steps/sec）	async embedding（global steps/sec）	speedup
DLRM	1008.6968	1197.932	1.1876