Introduction and Motivation
- The presenter starts by asking the audience about their knowledge of GPUs and then shifts to a tangent about two cars, an MG MGB GT and a modern Volvo, to illustrate the difference in complexity and repairability between old and new technology 19s.
- The presenter notes that despite the advancements in technology, the way humans interact with these vehicles has not changed much, and this idea will be relevant throughout the talk 1m9s.
- The presenter shares their personal experience of starting a new job and being tasked with making a program faster, despite having no prior knowledge of GPUs or CUDA 1m45s.
- The presenter's initial attempt at writing a CUDA program resulted in a simple function that sometimes worked and sometimes didn't, depending on the hardware and software 2m31s.
Initial GPU Programming Challenges
- The presenter is puzzled by the inconsistent behavior of the program and sets out to answer two questions: why the program sometimes works and why the otherwise performant program is slow 3m12s.
- The presenter briefly mentions the differences between CPUs and GPUs, noting that CPUs typically have many threads that run, but does not fully elaborate on this point 3m32s.
Understanding GPUs and Concurrency
- Writing code for GPUs is declarative, where the programmer specifies what they want to achieve, and the hardware decides how to execute it, often implicitly making the code concurrent 3m54s.
- A CPU is like an office where multiple people work on relatively independent tasks, whereas a GPU is like a factory with specialized equipment for specific tasks, making one thing very well 4m23s.
- In a factory analogy, storage space contains raw materials, similar to memory in computing, and dedicated space for each person or team can be thought of as shared workspaces 4m49s.
- Implicit concurrency in GPUs is achieved by specifying a small program that divides up a range and expects it to run in parallel, with each thread processing a portion of the range 5m41s.
Streams and Data Processing
- The concept of a moving assembly line, introduced by Henry Ford, revolutionized manufacturing by having workers stay fixed while the pieces moved, and this idea is preferred in computing when possible 6m37s.
- In computing, the moving assembly line concept is preferred when hardware can perform specific tasks with no dependency between each other, and this is encapsulated via the notion of a stream 7m7s.
- A stream is an ordered sequence, and this concept is important in understanding how GPUs process data 7m9s.
- GPUs can handle tasks in a consistent way through the use of streams, which allow for the execution of logically consistent sets of operations, enabling the hardware to be utilized more efficiently when tasks spend most of their time reading and writing from memory 7m12s.
GPU Performance and Big O Notation
- A high computation to data ratio is ideal for GPUs, as seen in the example of matrix multiplication, where the computation time is O(n^3) and the data is O(n^2), resulting in a significant speedup when using GPUs 8m15s.
- However, the Big O notation can be misleading, as asymptotically better algorithms may not be faster on GPUs due to the importance of constants in these settings 8m57s.
Memory Management in GPUs
- The goal is to write a mem copy function that can handle any type of copy efficiently, regardless of whether the memory is on the CPU or GPU 9m17s.
- There are three types of memory to consider on a GPU: memory allocated using Cuda malloc, which is prioritized and remains in the same location; memory allocated using Cuda malloc, which is allocated in system RAM; and memory that comes from other sources 9m51s.
- Cuda malloc is a special type of memory that is constantly prioritized and will never migrate, requiring manual decision to move it to a different location 10m9s.
- Unified memory models in GPUs allow for direct access to system memory by the graphics card, making it a desirable feature 10m44s.
Unified Memory Model
- Managed memory or unified memory is a type of memory allocation where the physical storage may be either on a hardware device (GPU) or in system memory, and its location can change depending on the program's needs 11m4s.
- This type of memory allocation is achieved through a series of system calls, including opening a file descriptor for a special device provided by Nvidia, allocating a large slab of memory via mmap, and deallocating memory 12m5s.
- The Nvidia-provided device serves as an interface between system memory and device memory, allowing for the allocation and deallocation of memory 12m11s.
- When allocating memory, the program specifies the address where the memory should exist, ensuring that the pointers are the same across both the system and the device 13m10s.
- Deallocating memory involves remapping the memory and freeing what's left over, which can be seen in the Nvidia System Profiler 13m38s.
- The Nvidia System Profiler is a useful tool for understanding what happens when using unified memory models, showing the program's memory allocation and deallocation calls 14m3s.
System Calls and Page Faults
- Using tools like estrace can help understand the underlying system calls and memory allocation processes involved in unified memory models 11m49s.
- When a program runs, it immediately encounters a page fault due to the way the driver enforces memory allocation, resulting in the allocation of physical storage and handling of the page fault, which can take a significant amount of time 14m29s.
- The time it takes to handle the page fault can be substantial, and in some cases, it can take up a great deal of the program's running time, especially when allocating more memory 15m12s.
- The hardware tries to help by allocating more memory each time a page fault occurs, resulting in an irregular pattern of page faults with increasing spacing between them 15m32s.
- This pattern repeats, but after a certain point, the hardware seems to forget that the program is running into page faults and stops helping, resulting in a continued pattern of page faults 15m59s.
Data Movement and Coherency
- The physical location of data in unified memory models is invisible to a program and may be changed at any time, even if the program is accessing the data through a shared pointer 16m27s.
- This is a rare occurrence in programming, where memory can move in a way that is entirely opaque and matters significantly, unlike caches where data is just copied and stored 16m57s.
- The access to data through a virtual address will remain valid and coherent from any processor, regardless of locality, which is necessary due to the complexity of the system 17m23s.
- The requirement for accesses to remain valid has significant implications for the performance of programs using unified memory models 17m45s.
Memory Copy Performance Issues
- Existing functions, such as cudaMemcpy, can be used to copy data in CUDA, which may be a solution to the issues encountered with unified memory models 17m57s.
- A memory copy operation is performed on the CPU, which should be fast since the pointers are accessible via the CPU, but it falls into a pitfall, resulting in poor performance 18m5s.
- The performance issue is illustrated in a graph, showing a significant decrease in throughput as the size of the data being copied increases, with a drop from over 600 megabytes per second to around 0.085 megabytes per second 18m25s.
- The reason for this poor performance is due to CPU page faults, which occur when the CPU tries to access a page of memory that is not currently available, resulting in a request to the GPU to allocate space 19m44s.
- The page faults are caused by the interaction between the device (graphics card) and the CPU, which is limited by the page size of the CPU, resulting in a large number of page faults (around 37,000) 20m32s.
Page Faults and Performance Impact
- Each page fault results in a CIS call, which is expensive and leads to poor performance, with the page faults being 4 kilobyte in size, which may be improved by increasing the page size 21m15s.
- The page faults occur regularly and repeatedly, resulting in a significant impact on performance, as shown in a zoomed-in view of the profile 22m0s.
- The GPU's hardware attempts to help with memory management by transferring more memory than requested, resulting in a pattern of page faults and speculative prefetches that repeat and are unequally spaced 22m17s.
- Each red line represents a page fault for 4K, while purple lines represent a speculative prefetch, with the width of each block growing as the number of page faults increases 22m53s.
- The hardware handles this process automatically, without the need for user intervention, and the amount of memory transferred can range from 64 kilobytes to 2 GB 23m2s.
CUDA Managed Memory and Prefetching
- Using CUDA's managed memory with specified copying from the device to the CPU can result in slightly faster performance, but still exhibits a stepwise curve due to prefetches 23m41s.
- Copying from the device to the host results in a faster line, reaching up to 10 GB per second, due to the GPU's capacity for dealing with larger pages up to 2 megabytes 24m15s.
Using CUDA Functions for Memory Management
- It is recommended to use CUDA functions instead of standard functions when writing code that needs to deal with memory management generically, as they can provide faster performance 24m40s.
- Using CUDA functions can result in a speedup of roughly twice as fast as standard functions, even in the slowest case, with a simple one-line change 24m50s.
Memory Migration and Performance
- The cost of memory management comes from handling page faults and physical memory mapping, leading to the question of whether moving memory to the same location before copying would be more efficient 25m21s.
- CUDA allows for memory migration, which can potentially improve performance by eliminating the need for copying between GPU and CPU 25m41s.
- Unified memory models in GPUs allow for memory to be moved between devices, and the size of the memory to be specified, enabling the splitting of arrays across multiple devices, including the CPU and GPU, which can lead to a different type of parallelism 26m13s.
- This parallelism allows for updates to be made on an array on the GPU and CPU simultaneously, thanks to the guarantees provided by the unified memory model 26m45s.
Prefetching and Performance Bottlenecks
- Prefetching the source pointer to the device can result in a significant improvement in performance, with a flat line of around 16 GB per second, but this performance drops drastically when the destination array can no longer fit in memory 27m1s.
- The performance drop is due to the occurrence of page faults, which happen when the system tries to write to or read something that is not already in memory 27m38s.
- Prefetching both pointers ahead of time can result in a huge amount of performance, but this also leads to a significant drop in performance when neither pointer fits in memory anymore 28m3s.
- The system's ability to manage memory proactively is compromised when both pointers are prefetched, leading to poor performance 28m34s.
- The system's decision on what to evict from memory can be flawed, leading to poor performance, as it may evict important data 29m19s.
Managing Memory Accesses
- A significant issue with unified memory models in GPUs is the constant reading of data from memory that is not necessary, resulting in continued page faults for reads and writes, indicating the need for more careful management of memory copies 29m38s.
- A solution to this issue is to manage memory accesses more efficiently, as shown in a profile from a running program where memory accesses are optimized to reduce bandwidth consumption and page faults 30m3s.
- If device memory is full, it will never get evicted, and using the Cuda malloc API can cause the graph to shift arbitrarily to the left by allocating more card memory 30m45s.
- Moving large amounts of data, such as 512 megabytes, repeatedly can cause slow bandwidth due to constant eviction of necessary data 30m56s.
Optimizing Memory Copies
- To manage memory copies more carefully, it is necessary to define a prefetch size, create multiple streams to cue operations properly, and calculate the number of prefetches needed 31m23s.
- A loop can be used to start the copy process, prefetching the previous data and sending it back to the CPU, and then prefetching the blocks needed from system memory back onto the card 31m49s.
- The CPU needs to be involved in remapping data back into its own memory, and this can be done by explicitly sending the data back to the CPU and using a different stream for prefetching 32m30s.
- Using a different stream for prefetching may not significantly improve performance, but it is still a better approach 33m1s.
- By combining various prefetches, sending data back to its original location, and managing memory, performance can be significantly improved, with the custom code potentially beating standard CUDA functions for certain tasks 33m56s.
Complexity of Unified Memory
- The use of a single pointer across devices and system memory was initially thought to simplify things, but it has proven to be more complicated than expected 34m43s.
- The complexity of managing memory and pointers is due in part to the fact that computers are still being programmed with concepts and tools designed 50 years ago, such as those used for the PDP1 35m27s.
- The idea of changing programming languages to express hardware more explicitly was discussed in the 1990s, but it was met with resistance, and the tools used today are still largely the same 35m46s.
- It is impossible for a compiler to statically determine whether a function or code will work due to the inability to succinctly express the properties of a pointer, making it difficult to ensure safe and reliable operation 36m7s.
- The hardware is capable of handling complex memory operations, but the lack of explicit expression of pointer properties makes it difficult to write safe and reliable code 36m30s.
- The resistance to change in programming tools and concepts has not been beneficial, and it is time to reconsider the way computers are programmed to better match the capabilities of modern hardware 36m43s.
Profiling and Code Simplicity
- To understand performance issues in code, it's essential to profile the code using different tools and try to understand why things are happening the way they are, as there is always a reason for it, no matter how complex it may seem 36m50s.
- When writing code, it's crucial to consider simplifying things to make it easier for others to understand, and to prioritize performance when in doubt 37m39s.
Unified Memory: Benefits and Drawbacks
- Unified memory models in GPUs, sold by companies like Nvidia and AMD, can make it easier to port old CPU applications to GPUs, but they may not be worth using in performance-critical environments 38m22s.
- Unified memory models can be useful for migration or compatibility reasons, or when more memory is needed than is available on the GPU, but they may not be better than manually handling memory copies 38m48s.
- In some cases, the overhead of unified memory models may not be related to page faults, but rather to coherency issues, where the device hardware needs to ensure a consistent view of memory across the system 39m40s.
- Even if the same page is requested multiple times on the same device, the hardware may still incur overhead to ensure coherency, and counting page faults may not be enough to overcome these issues 40m11s.








