keras with tensorflow on GPU machine - some parts are very slow

ghz 9hours ago ⋅ 2 views

I'm trying to train a model using keras \ tensorflow (1.4) on a p3.2xlarge aws machine (which has a NVIDIA Tesla V100 GPU) two parts of the initialisation are very slow when using a GPU, but run in a reasonable time on CPU

The first part is "calling" an embedding layer during model setup

network = embedding(input)

this embedding layer is used several times, but only the 1st time is slow it appears that this is the phase that the weights are copied to the GPU, and it takes a few minuets (~5) for a 400000 * 200 weight matrix

the second long part is the call to train_on_batch for the first batch (it takes about 20 minutes)

not sure it's relevant, but according to this post, it might be related to using a custom layer

are there any ways to speed up these parts?

EDIT These parts aren't slow when running the same code on a p2.xlarge aws machine (which has a Tesla K80 GPU)

Answer

It seems like you're encountering slow initialization during the setup phase and the first batch processing when training with the Tesla V100 GPU on AWS (p3.2xlarge), but the same code runs faster on the Tesla K80 GPU (p2.xlarge). This is indeed unusual, as the V100 should generally outperform the K80, especially with tasks like training deep learning models. Let’s go through the potential causes and solutions to speed things up:

1. Initial Slow Embedding Layer Setup

The first part of the issue is the embedding layer initialization, where the weight matrix (400,000 x 200) is copied to the GPU for the first time. This is typically a one-time operation, but the initial transfer can be slow due to a few reasons:

Potential Causes:

  • Weight Initialization on GPU: If the weights are not being properly transferred to the GPU during the model setup, TensorFlow/Keras might be waiting for the weights to be moved to the GPU, causing delays.
  • First-time TensorFlow Initialization on GPU: The first operation that triggers GPU usage (like your embedding layer) can sometimes take longer due to TensorFlow/Keras initializing the GPU, setting up libraries, or caching libraries for faster future operations.

Possible Solutions:

  • Pre-load TensorFlow onto the GPU: Before your actual training, run a simple operation (like a matrix multiplication or a small forward pass) that will initialize the GPU and avoid delays during the embedding layer setup.

    Example:

    import tensorflow as tf
    import numpy as np
    
    # Just a dummy operation to warm-up GPU
    tf.keras.backend.clear_session()  # Clear any prior sessions
    tf.config.experimental.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True)
    
    # Create a dummy tensor of the same size as your embedding weights
    dummy_input = np.random.randint(0, 400000, size=(1, 100))
    dummy_model = tf.keras.Sequential([
        tf.keras.layers.Embedding(400000, 200, input_length=100)
    ])
    dummy_model.predict(dummy_input)
    

    This ensures the GPU is properly initialized before the actual training begins.

  • Use the embeddings_initializer: If you are using a custom weight initialization, consider using a specific GPU-optimized initializer that is more efficient on V100. For example, glorot_uniform or he_normal can sometimes be more optimized.

2. Slow First Batch Training (20 minutes)

The second issue is the slow training during the first batch, which is taking up to 20 minutes. This can happen for several reasons, particularly in environments with large models or datasets. Here are a few suggestions to diagnose and speed up this part:

Potential Causes:

  • Large Data Transfer During First Batch: The first batch might involve the transfer of a lot of data from CPU to GPU, which can cause delays. This is often more noticeable with large datasets or when using the .train_on_batch() method (which may bypass some optimizations).

  • GPU Memory Initialization: The first batch might trigger the GPU to allocate more memory or initialize additional resources, which can slow down the training process.

Possible Solutions:

  • Pre-load Data to GPU: Before training, you can manually move data to the GPU to ensure that no data transfer delays occur when training starts.

    # Move data to GPU memory before training
    data = np.random.random((batch_size, input_dim))  # Example data
    data = tf.convert_to_tensor(data)  # Explicitly convert to tensor
    
    # Ensure the data is preloaded onto the GPU before starting training
    with tf.device('/GPU:0'):
        _ = model.predict(data)  # Run a quick dummy prediction
    
  • Use TensorFlow Data API: If you are not already doing this, use the TensorFlow tf.data.Dataset API, which efficiently handles batch processing and data pipelining for GPU-based models. This ensures that data is efficiently loaded and preprocessed in parallel during training.

    Example:

    import tensorflow as tf
    
    # Define a simple dataset
    dataset = tf.data.Dataset.from_tensor_slices((data, labels))
    dataset = dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
    
  • Reduce the Batch Size for the First Batch: Sometimes, large batch sizes can cause issues with the initial batch as they put heavy memory pressure on the GPU. Try using a smaller batch size for the first batch, and then increase it for subsequent batches.

  • Check TensorFlow GPU Settings: Ensure that TensorFlow is using the GPU correctly, and that there are no bottlenecks in GPU memory allocation. You can check the GPU utilization during training to confirm that it is being used effectively.

    You can monitor the GPU utilization with nvidia-smi:

    watch -n 1 nvidia-smi
    

    This can help you understand whether the GPU is being used effectively or if there are issues with memory allocation.

  • Use Mixed Precision: If you're using Keras with TensorFlow 1.x, switching to mixed-precision training can significantly improve performance on V100 GPUs, as it leverages Tensor Cores for faster computations. Here's how you can enable mixed precision in TensorFlow 1.x:

    from tensorflow.keras import mixed_precision
    
    policy = mixed_precision.Policy('mixed_float16')
    mixed_precision.set_global_policy(policy)
    

    Mixed precision reduces memory usage and can improve training speed, particularly for large models.

3. Using train_on_batch

If you are using train_on_batch, this may bypass some of TensorFlow’s internal optimizations that are available when using .fit(). This can be slower due to lack of optimizations like input data pipelining, gradient accumulation, and parallelization.

  • Try Using .fit() Instead of train_on_batch: If possible, switch from using train_on_batch to using .fit(). This will enable TensorFlow's automatic handling of batch training and data pipeline optimizations, which may help reduce the overhead on the first batch.

    Example:

    model.fit(train_dataset, epochs=10, batch_size=batch_size)
    

4. TensorFlow Version and Hardware Utilization

Make sure you are using a recent version of TensorFlow and that all hardware acceleration features are enabled. The Tesla V100 supports FP16 operations, which can speed up training, so ensure that you are not using a suboptimal configuration.

  • Ensure proper installation of GPU-optimized libraries (CUDA, cuDNN): Ensure that the version of TensorFlow is compatible with the installed versions of CUDA and cuDNN on the machine. Compatibility issues can lead to inefficient use of the GPU.

  • TensorFlow 1.4 is relatively old: TensorFlow 1.4 is quite old, and it might not have optimizations and bug fixes related to the newer GPUs like the V100. If possible, consider upgrading to TensorFlow 1.15 or TensorFlow 2.x, which may provide better GPU performance and optimizations for the V100.

5. Summary of Solutions:

  • Pre-warm the GPU with a dummy operation.
  • Use tf.data for efficient data loading and prefetching.
  • Ensure proper GPU memory management.
  • Try mixed-precision training for faster computation.
  • Ensure TensorFlow is correctly using the GPU and consider updating TensorFlow.
  • Use .fit() instead of train_on_batch if possible.

Let me know if any of these suggestions help improve the performance!