Synchronize CUDA To Time CUDA Operations#
# %pip install -q omniverse==0.0.62
Timing with cuda operations is tricky, because cuda operations are asynchronous. This means that the time taken to execute a cuda operation is not the time taken to execute the operation, but the time taken to queue the operation. This is because the operation is queued on the GPU, and the CPU continues to execute the next operation. Consider the following code that runs on a P100 GPU (you can test it on kaggle):
from __future__ import annotations
from timeit import default_timer
from typing import Tuple
import torch
import logging
import sys
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
handlers=[logging.StreamHandler(sys.stdout)],
force=True,
)
logger = logging.getLogger(__name__)
def perform_matrix_multiplication(x: torch.Tensor, synchronize: bool = False) -> float:
logger.info("Starting to time.")
start_time = default_timer()
_ = torch.matmul(x, x)
if synchronize:
torch.cuda.synchronize()
end_time = default_timer()
time_taken = end_time - start_time
logger.info("Ending timer. Time take is %s", time_taken)
return time_taken
def main(
tensor_size: Tuple[int, int] = (20000, 20000), dtype: torch.dtype = torch.float32, synchronize: bool = False
) -> float:
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available. This example requires a CUDA-enabled GPU.")
device: torch.device = torch.device("cuda")
x: torch.Tensor = torch.randn(*tensor_size, device=device, dtype=dtype)
time = perform_matrix_multiplication(x, synchronize=synchronize)
return time
You see that the time taken to complete the matmul torch.matmul(x, x)
is so
fast at just \(0.0003\) seconds. Different runs will give different results since
variance could be high here, but the point here will be apparent once we run
with the torch.cuda.synchronize()
function.
_ = main(synchronize=False)
2024-08-11 09:55:52,387 - __main__ - INFO - Starting to time.
2024-08-11 09:55:52,389 - __main__ - INFO - Ending timer. Time take is 0.0003956739992645453
_ = main(synchronize=True)
2024-08-11 09:56:05,935 - __main__ - INFO - Starting to time.
2024-08-11 09:56:07,824 - __main__ - INFO - Ending timer. Time take is 1.8874175820001255
Now we see that the operation actually took around \(2\) seconds to complete, that
is more than \(6000\) times slower than the previous result. Why? Because in the
first run, you are essentially just measuring the time taken to queue the
operation, not the time taken to execute the operation. In other words, and less
pedantically since I am not well versed with CUDA on a deep level, once
torch.matmul(x, x)
is called, the CPU queues the operation in the default CUDA
stream for execution on the GPU, the CPU does not wait for the operation to
complete, and continues to execute the next operation. In our case, if
synchronize
is False
, it will reach end_time = default_timer()
and print
the time taken to queue the operation, which is very fast. This is possible
because GPU operations are asynchrnous and does not block the CPU here.
However, if we set synchronize
to True
, the CPU is forced to wait until all
preceding CUDA operations are completed in the default stream before continuing.
Even then, using the default timer is not the best way to measure time taken to
execute CUDA operations. One can read more on it such as using
torch.cuda.Event
and other native profiling tools in the torch ecosystem.