CPU Memory Optimization


On the CPU side, memory libraries ptmalloc or jemalloc can cause severe page faults while allocating large memory chunks common to DL applications. To solve this issue, DeepRec optimizes memory allocation to reduce the memory usage and minor page faults, and improve the running performance. When this optimization is enabled, DeepRec will collect memory allocation information (after the number of steps reaches the START_STATISTIC_STEP threshold), and then generate an allocation plan based on the collected memory allocation information of each step. When generating the allocation plan, it will determine whether the previously generated memory allocation plan meets the current allocation requirements, and if it is considered a stable step. When the number of stable steps reaches the STABLE_STATISTIC_STEP threshold or the total number of steps collected reaches the MAX_STATISTIC_STEP threshold, DeepRec will stop collecting memory information. Since memory allocation information needs to be collected for optimization, the performance gain can only be observed after a certain number of steps.

User API

On the CPU side, the current version of DeepRec supports the CPU memory optimization of stand-alone and distributed training/inference, which is enabled by default, and can be turned off using the export ENABLE_MEMORY_OPTIMIZATION=0 command. There are several environment variables. START_STATISTIC_STEP configures the step to start collecting memory information. STABLE_STATISTIC_STEP configures how many stable steps the allocation policy ends. MAX_STATISTIC_STEP configures the maximal steps to end the memory allocation policy. The default values are 100, 10, and 100, respectively. These values generally do not need to be changed, and the START_STATISTIC_STEP can be increased when there are many initialization graphs, and the STABLE_STATISTIC_STEP and MAX_STATISTIC_STEP can be increased when the main computational graph is irregular or there are more running computational graphs.

Using jemalloc

The CPU side can adapts the memory optimization with the jemalloc library. After setting the MALLOC environment variable, add the LD_PRELOAD jemalloc dynamic library before the python command, for example:

export MALLOC_CONF="background_thread:true,metadata_thp:auto,dirty_decay_ms:20000,muzzy_decay_ms:20000"
LD_PRELOAD=./libjemalloc.so.2 python ...