Synchronize CUDA To Time CUDA Operations

Synchronize CUDA To Time CUDA Operations#

Twitter Handle LinkedIn Profile GitHub Profile Tag Tag

# %pip install -q omniverse==0.0.62

Timing with cuda operations is tricky, because cuda operations are asynchronous. This means that the time taken to execute a cuda operation is not the time taken to execute the operation, but the time taken to queue the operation. This is because the operation is queued on the GPU, and the CPU continues to execute the next operation. Consider the following code that runs on a P100 GPU (you can test it on kaggle):

from __future__ import annotations

from timeit import default_timer
from typing import Tuple

import torch
import logging
import sys

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
    force=True,
)
logger = logging.getLogger(__name__)


def perform_matrix_multiplication(x: torch.Tensor, synchronize: bool = False) -> float:
    logger.info("Starting to time.")
    start_time = default_timer()

    _ = torch.matmul(x, x)

    if synchronize:
        torch.cuda.synchronize()

    end_time = default_timer()

    time_taken = end_time - start_time
    logger.info("Ending timer. Time take is %s", time_taken)
    return time_taken


def main(
    tensor_size: Tuple[int, int] = (20000, 20000), dtype: torch.dtype = torch.float32, synchronize: bool = False
) -> float:
    if not torch.cuda.is_available():
        raise RuntimeError("CUDA is not available. This example requires a CUDA-enabled GPU.")

    device: torch.device = torch.device("cuda")

    x: torch.Tensor = torch.randn(*tensor_size, device=device, dtype=dtype)

    time = perform_matrix_multiplication(x, synchronize=synchronize)
    return time

You see that the time taken to complete the matmul torch.matmul(x, x) is so fast at just \(0.0003\) seconds. Different runs will give different results since variance could be high here, but the point here will be apparent once we run with the torch.cuda.synchronize() function.

_ = main(synchronize=False)
2024-08-11 09:55:52,387 - __main__ - INFO - Starting to time.
2024-08-11 09:55:52,389 - __main__ - INFO - Ending timer. Time take is 0.0003956739992645453
_ = main(synchronize=True)
2024-08-11 09:56:05,935 - __main__ - INFO - Starting to time.
2024-08-11 09:56:07,824 - __main__ - INFO - Ending timer. Time take is 1.8874175820001255

Now we see that the operation actually took around \(2\) seconds to complete, that is more than \(6000\) times slower than the previous result. Why? Because in the first run, you are essentially just measuring the time taken to queue the operation, not the time taken to execute the operation. In other words, and less pedantically since I am not well versed with CUDA on a deep level, once torch.matmul(x, x) is called, the CPU queues the operation in the default CUDA stream for execution on the GPU, the CPU does not wait for the operation to complete, and continues to execute the next operation. In our case, if synchronize is False, it will reach end_time = default_timer() and print the time taken to queue the operation, which is very fast. This is possible because GPU operations are asynchrnous and does not block the CPU here.

However, if we set synchronize to True, the CPU is forced to wait until all preceding CUDA operations are completed in the default stream before continuing.

Even then, using the default timer is not the best way to measure time taken to execute CUDA operations. One can read more on it such as using torch.cuda.Event and other native profiling tools in the torch ecosystem.