Fundamentals of Accelerated Computing with CUDA Python Training

This workshop teaches you the fundamental tools and techniques for running GPU-accelerated Python applications using CUDA® GPUs and the Numba compiler. You’ll work though dozens of hands-on coding exercises and, at the end of the training, implement a new workflow to accelerate a fully functional linear algebra program originally designed for CPUs, observing impressive performance gains. After the workshop ends, you’ll have additional resources to help you create new GPU-accelerated applications on your own.
Course Details


1 day


  • Basic Python competency, including familiarity with variable types, loops, conditional statements, functions, and array manipulations
  • NumPy competency, including the use of ndarrays and ufuncs
  • No previous knowledge of CUDA programming is required

Skills Gained

  • GPU-accelerate NumPy ufuncs with a few lines of code.
  • Configure code parallelization using the CUDA thread hierarchy.
  • Write custom CUDA device kernels for maximum performance and flexibility.
  • Use memory coalescing and on-device shared memory to increase CUDA kernel bandwidth.
Course Outline
  • Introduction
  • Introduction to CUDA Python with Numba
    • Begin working with the Numba compiler and CUDA programming in Python.
    • Use Numba decorators to GPU-accelerate numerical Python functions.
    • Optimize host-to-device and device-to-host memory transfers.
  • Custom CUDA Kernels in Python with Numba
    • Learn CUDA’s parallel thread hierarchy and how to extend parallel program possibilities.
    • Launch massively parallel custom CUDA kernels on the GPU.
    • Utilize CUDA atomic operations to avoid race conditions during parallel execution.
  • Multidimensional Grids, and Shared Memory for CUDA Python with Numba
    • Learn multidimensional grid creation and how to work in parallel on 2D matrices.
    • Leverage on-device shared memory to promote memory coalescing while reshaping 2D matrices.
  • Final Review