YouTube video summary

Stanford CS149 I Parallel Computing I 2023 I Lecture 4 - Parallel Programming Basics

Technology15 Sep 20246 min summaryFrom Stanford Online
Stanford CS149 I Parallel Computing I 2023 I Lecture 4 - Parallel Programming Basics
Stanford Online
YouTube

Parallel Programming Concepts

  • There are four concepts to understand: multicore, SIMD, multi-threading, and super scaler. 30s
  • An ispc function differs from a normal C function in that it runs multiple copies of the same function with a different program index (IND) for each copy. 4m35s

ispc Function Execution

  • Each copy of the function runs sequentially, and the return value is given when all copies have completed. 5m35s
  • Gang size is determined at compile time and influences the width of the vector instructions used. 6m21s

ispc Compiler Optimization

  • An implementation of the ispc_sinx function could theoretically use a for loop to iterate through program instances, but the ispc compiler utilizes vector instructions for efficiency. 7m56s
  • The ispc compiler does not spawn threads and the gang size can impact performance. 14m13s

Work Assignment in ispc

  • The speaker presents two different programs that achieve the same outcome but with different mappings of program instances to work, highlighting the programmer's control over work assignment. 12m21s
  • The first program is preferred because it accesses consecutive indices in memory, which are likely to be on the same cache line, making it more efficient. 13m10s

High-Level Programming Constructs

  • The for each construct in ispc is a high-level mechanism that allows the compiler to decide how to map iterations onto program instances for good parallelism, potentially leading to better performance than manually specifying the mapping. 17m29s
  • Programmers should utilize higher-level programming constructs whenever possible, as they often provide efficient solutions. If necessary, lower-level optimizations can be explored. 18m46s

Parallelizing Code with Independent Iterations

  • The ispc compiler excels at parallelizing code with independent iterations, such as applying the absolute value to elements of an array. However, it's crucial to ensure the program's logic remains valid when iterations are executed out of order. 20m49s
  • Attempting to compute a sum using a per-program-instance variable within a for each loop in ispc leads to incorrect results and compilation errors. This is because each program instance would have its own copy of the sum variable, and returning a value from multiple instances is not supported. 23m49s

Race Conditions in Parallel Programming

  • There is a potential race condition in a program when multiple program instances attempt to simultaneously update a single sum variable using the += operator. 24m38s
  • The ispc library provides a specialized cross-instance function to safely compute the sum of values from multiple program instances, addressing the race condition by accumulating partial sums locally and then performing a final reduction. 25m21s

SIMD Vector Instructions

  • The ispc compiler can transform code with cross-lane operations into a program that utilizes SIMD vector instructions, enabling parallel execution on a single core. 26m2s
  • ispc can generate fast code for operations on vectors, such as adding tensors or mapping a lambda onto a collection. 31m12s

Performance Comparison of Parallel Execution Methods

  • A C program with three tests, where 'do work' is a function that simulates a task, demonstrates the performance differences between sequential execution, spawning a thread per task, and using a thread pool. 34m24s
  • Spawning a thread per task is significantly slower than sequential execution or using a thread pool, especially when the task is computationally inexpensive. 37m28s

ispc Thread Pool

  • The ispc programming language creates a thread pool sized appropriately for the machine and divides up tasks amongst the pool. 38m2s
  • There is a point at which parallelizing a function with small tasks is outweighed by the cost of context switching. 38m52s

Parallel Programming Principles

  • Programmers are generally responsible for decomposing a problem into smaller pieces that can be performed in parallel. 42m32s
  • Amdahl's Law, even with infinite processors, the maximum speedup is limited to 1/s. For example, if 1% of a program is sequential, the maximum speedup is limited to 100x, even with infinite processors. 47m42s

Impact of Sequential Code on Large Systems

  • Impact of Sequential Code on Large Systems: On systems with a large number of processors (e.g., supercomputers), it becomes crucial to minimize the sequential portion of code. If the sequential portion is not sufficiently small, the potential benefits of using such a large system diminish significantly. 48m17s
  • Work Assignment in Parallel Programming: A key aspect of parallel programming is assigning work to different processing units (cores, processors, etc.). This assignment can be done manually by the programmer or automatically by the programming language, compiler, or runtime system. The goal is to keep all processing units busy and maximize efficiency. 48m51s

ispc Task Assignment

  • In ispc, tasks are decomposed and assigned to threads for work. The system manages the assignment of tasks to idle threads, optimizing performance. 50m5s
  • Synchronization is crucial in parallel programming to ensure that tasks are assigned to only one worker and workers receive the next task in order. 51m20s

Mapping Threads to Hardware

  • Mapping threads or program instances to hardware execution contexts is a key aspect of parallel programming. This mapping can be handled by the operating system or, in the case of ispc, by the compiler. 51m52s

Parallelizing the Red-Black Gauss-Seidel Algorithm

  • It is not possible to parallelize the code by running separate chunks concurrently because the values in one chunk depend on the values calculated in previous chunks. 56m35s
  • One way to parallelize the code is to use a checkerboard pattern, updating all the red cells in parallel based on the black cells, and then updating all the black cells in parallel based on the red cells. 59m10s

Work Assignment in the Red-Black Algorithm

  • When parallelizing the checkerboard pattern, there are two assignment options: dividing the grid in a blocked way across processors or dividing them in an interleaved way. 1h1m13s
  • If a blocked assignment of array elements is used in the red-black Gauss-Seidel algorithm, less data needs to be exchanged between processors because neighboring processors already have the information they need. 1h2m4s

Parallel Implementation of the Red-Black Algorithm

  • A parallel implementation of the red-black Gauss-Seidel algorithm can be written using a "for all" loop, where each iteration of the loop can be executed in parallel. 1h3m56s
  • In a lower-level implementation of the algorithm, each thread in a shared address space can be responsible for updating a block of rows, and synchronization constructs like locks and barriers can be used to ensure correctness. 1h5m46s

Synchronization in Parallel Programming

  • Updating a shared variable, such as diff in this example, requires using locks to ensure atomicity and prevent race conditions. 1h8m8s
  • Instead of locking within the innermost loop, it is more efficient to create a private copy of the variable (my_diff), update it without locks, and then accumulate the results into the global variable (diff) at the end of each iteration. 1h12m9s

Barriers in Parallel Programming

  • Barriers are synchronization mechanisms that force all threads to wait at a specific point in the code until all threads have reached that point. This code uses three barriers to ensure that all threads have finished their work before accumulating the results and checking for termination. 1h12m54s
  • All threads must complete the update to diff before any thread checks for system convergence. This ensures the diff value is correct. 1h14m34s
  • A barrier is needed to prevent threads from clearing the diff value before other threads have checked it. 1h15m20s

Contention Reduction in Parallel Programming

  • Using a single global diff variable introduces contention. Replicating the diff variable, similar to using local copies (my_diff), can remove this contention and potentially reduce the number of barriers needed. 1h16m49s
Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else
Save this summary

Then save anything you watch or read next.

Bookmark this summary, then save any video, article or PDF you read next.

Save to your library

Ready to get started?

Save, summarize & chat with your content.

GET STARTED

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop