# GPU Multi-stream ## Introduction In training scenarios, GPUs are often used to accelerate computation. Since different computing kernels are only committed on the same stream, insufficient execution concurrency and low GPU utilization may occur under some models. To this end, we provide GPU Multi-stream optimization. This feature provides multiple GPU streams, and has a variety of built-in graph splitting rules, and users can also manually specify the sub-graph. This feature enables several subgraphs without data dependency to be submitted to different GPU streams for execution, achieving concurrent execution at the subgraph level, thereby improving GPU utilization. ## User API This feature can be enabled by setting the following parameters in `tf.ConfigProto`. ```python import tensorflow as tf from tensorflow.core.protobuf import rewriter_config_pb2 sess_config = tf.ConfigProto() sess_config.graph_options.rewrite_options.use_multi_stream = (rewriter_config_pb2.RewriterConfig.ON) # Turn on the multi-stream feature sess_config.graph_options.rewrite_options.multi_stream_opts.multi_stream_num = 4 # The number of streams ``` ## Graph Splitting Strategy ### 1. Manual Graph Splitting For manual graph splitting, we provide the `tf.stream()` API to specify the `stream id`, which can be nested. At the same time, the `tf.colocate_with()` API also supports the requirement of associating the newly created operation with one specified operation and placing them on the same stream. #### Usage ```python with tf.stream(0): # Set the context of stream 0, and the stream id should be less than the number of streams specified in tf.ConfigProto a = tf.placeholder(tf.float32, [None, 1], name='a') # This operation will be placed on stream 0 with tf.stream(1): # Set the context of stream 1, and the stream id should be less than the number of streams specified in tf.ConfigProto b = tf.placeholder(tf.float32, [None, 1], name='b') # This operation will be placed on stream 1 # Go back to the context of stream 0 c = tf.constant([1, 2, 3, 4], tf.float32, [4, 1], name='c') # This operation will be placed on stream 0 with tf.colocate_with(a): # Associated with `a` d = tf.constant([5, 6, 7, 8], tf.float32, [4, 1], name='d') # This operation is associated with `a`, and will be placed on the same GPU stream of `a` ``` #### Best Practise ```python import tensorflow as tf from tensorflow.core.protobuf import rewriter_config_pb2 import numpy as np import os os.environ['CUDA_VISIBLE_DEVICES']='0' learning_rate = 0.01 max_train_steps = 1000 log_step = 100 train_X = np.array([[3.3], [4.4], [5.5], [6.71], [6.93], [4.168], [9.779], [6.182], [7.59], [2.167], [7.042], [10.791], [5.313], [7.997], [5.654], [9.27], [3.1]], dtype=np.float32) train_Y = np.array([[3.3], [4.4], [5.5], [6.71], [6.93], [4.168], [9.779], [6.182], [7.59], [2.167], [7.042], [10.791], [5.313], [7.997], [5.654], [9.27], [3.1]], dtype=np.float32) train_Z = np.array([[1.7], [2.76], [2.09], [3.19], [1.694], [1.573], [3.336], [2.596], [2.53], [1.221], [2.827], [3.465], [1.65], [2.904], [2.42], [2.94], [1.3]], dtype=np.float32) total_samples = train_X.shape[0] Z_ = tf.placeholder(tf.float32, [None, 1]) with tf.stream(0): X = tf.placeholder(tf.float32, [None, 1]) W_X = tf.Variable(tf.random_normal([1, 1]), name='weight_x') b = tf.Variable(tf.zeros([1]), name='bias') X_Result = tf.matmul(X, W_X) X_Result = tf.add(X_Result, b) with tf.stream(1): Y = tf.placeholder(tf.float32, [None, 1]) W_Y = tf.Variable(tf.random_normal([1, 1]), name='weight_y') Y_Result = tf.matmul(Y, W_Y) Z = X_Result + Y_Result loss = tf.reduce_sum(tf.pow(Z-Z_, 2)) / (total_samples) optimizer = tf.train.GradientDescentOptimizer(learning_rate) train_op = optimizer.minimize(loss) sess_config = tf.ConfigProto() sess_config.graph_options.rewrite_options.use_multi_stream = (rewriter_config_pb2.RewriterConfig.ON) sess_config.graph_options.rewrite_options.multi_stream_opts.multi_stream_num = 2 with tf.Session(config=sess_config) as sess: sess.run(tf.global_variables_initializer()) print("Start training:") for step in range(max_train_steps): sess.run(train_op, feed_dict={X: train_X, Y: train_Y, Z_: train_Z}) if step % log_step == 0: c = sess.run(loss, feed_dict={X: train_X, Y: train_Y, Z_: train_Z}) print("Step:%d, loss==%.4f, W_X==%.4f, b==%.4f, W_Y=%.4f" % (step, c, sess.run(W_X), sess.run(b), sess.run(W_Y))) final_loss = sess.run(loss, feed_dict={X: train_X, Y: train_Y, Z_: train_Z}) w_x, b, w_y= sess.run([W_X, b, W_Y]) print("Step:%d, loss==%.4f, W_X==%.4f, b==%.4f, W_Y=%.4f" % (max_train_steps, final_loss, w_x, b, w_y)) print("Linear Regression Model: Z=%.4f*X + %.4f*Y + %.4f" % (w_x, w_y, b)) ``` ## Enabling GPU MPS This optimization can adapt GPU MPS (Multi-Process Service). Users can enable GPU MPS by following these steps. 1. The host starts the GPU MPS. ```bash nvidia-cuda-mps-control -d ``` 2. Docker launch configuration (if training inside the container) The `--ipc=host` option needs to be added so that the GPU MPS can be communicated with the process in the container. The following is an example. ```bash sudo docker run -itd --name --ipc=host --gpus='"device=0"' bash ``` In this example, GPU 0 is bound to the created container, and one GPU is visible in the container. The GPU MPS can be used by directly executing GPU training tasks in the container.