YouTube video summary

Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

Technology20 Sep 20245 min summaryFrom Stanford Online
Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark
Stanford Online
YouTube

Warehouse-Scale Computers

  • Warehouse-sized computers, pioneered by Luis Barroso, are essentially clusters of commodity PCs connected together with ethernet, offering a scalable and relatively inexpensive computing solution. 4m59s
  • Warehouse-scale computers were initially built with cheap components, but to improve application development, they now utilize expensive, high-bandwidth networks similar to those in supercomputers. 7m17s
  • A typical rack in such a system consists of 20-40 servers, each with a multi-core CPU, connected via a top-of-rack switch, with the number of servers limited by power capacity. 8m40s
  • A notable bottleneck in these systems is the network bandwidth, which can be significantly lower than local disk access speeds, especially between racks. 12m25s
  • Network bandwidth in warehouse-scale computers has significantly improved, reaching speeds comparable to solid-state drives (around 2 gigabytes per second). This means accessing data from a remote node's disk can be as fast as accessing it locally. 14m18s

Distributed File Systems

  • Distributed file systems like Google File System (GFS) and Hadoop Distributed File System (HDFS) are used to store data persistently in a distributed system. These systems are designed for large files and primarily handle appending data and reading it, with infrequent in-place updates. 19m4s
  • A distributed file system is introduced, which divides large files into chunks or blocks, typically ranging from 64 to 256 megabytes in size. 20m18s
  • These blocks are replicated and distributed across multiple racks to prevent data loss in case of failures. 20m48s
  • A master node manages the metadata and directory of the replicas, providing a global namespace for the distributed file system. 21m1s

Data-Parallel Programming

  • Data parallel programming ideas can be used to program distributed computers, which are computers composed of multiple separate operating system instances. 53s
  • One of the main reasons to use a cluster of machines is to gain IO bandwidth, especially when processing huge amounts of data (hundreds of terabytes). 3m47s
  • Communication between nodes with different operating systems in a distributed system is achieved through message passing, which involves sending data between threads in different address spaces. 14m57s
  • Message Passing Interface (MPI) is discussed as a method for distributed data-parallel computing, but it is noted for its complexity and lack of inherent fault tolerance. 26m43s
  • The map function, used in data-parallel operations, is highlighted for its ease of parallelization due to the lack of dependencies between elements and its side-effect-free nature, which ensures that the input remains unchanged. 28m37s

MapReduce

  • The map-reduce programming model is explained, with the mapper function processing individual log file lines to identify entries from mobile clients and update a result map, while the reducer function aggregates values associated with unique keys to produce a final sum. 31m9s
  • MapReduce jobs can be used to count word occurrences in a file, with one map task per block of the input file. 34m46s
  • The parallelism in reducer tasks comes from associating a certain number of keys to each reducer task. 36m17s
  • To ensure a correct reduction when parallelizing across keys, all keys of the same value must be sent to the same reducer task. 37m36s
  • A hash function based on key value can be used to determine where data should be sent for reduction. 42m36s
  • A job scheduler can exploit data locality by running mapper jobs close to input blocks and reducer jobs near where most of the data is located. 46m22s
  • If a mapper node fails, the system can recover by running the tasks on a different node with access to the replicated data blocks. 47m20s
  • MapReduce can handle slow or failing machines by replicating tasks and using the results from the first completed task, while terminating the others. 49m6s
  • MapReduce is easy to understand and implement, but has limitations in its programming model, only allowing for a linear arrangement of map and reduce functions. 52m8s
  • MapReduce is inefficient for iterative algorithms, like PageRank, and for handling ad hoc queries on large datasets, as both scenarios require frequent access to the distributed file system. 53m17s

Spark

  • A 2011 paper argued that data locality in data centers was becoming irrelevant for two reasons: network bandwidth was increasing, and most big data application working sets could fit in memory. 55m32s
  • A potential problem with using memory instead of storage systems for intermediate data is the loss of computation in case of power failure. 58m41s
  • Spark addresses the problem of fault-tolerant, in-memory, distributed computing by introducing the concept of a resilient distributed dataset (RDD). 1h1m23s
  • Resilient Distributed Datasets (RDDs) are immutable, ordered collections of records, created through transformations from other RDDs or persistent storage. 1h1m46s
  • RDDs can be created from data in a distributed file system and transformed using operations like filter, map, and reduceByKey. 1h2m21s
  • The persist command can be used to keep an RDD in memory for efficient reuse in subsequent operations. 1h6m3s
  • Resilient Distributed Datasets (RDDs) can be implemented more efficiently by considering dependencies and applying optimizations like fusion and tiling. 1h13m20s
  • Narrow dependencies exist when an RDD partition depends on only one other partition, allowing for automatic fusion by the system. 1h16m7s
  • Wide dependencies occur when an RDD partition depends on multiple other partitions, such as in a group by key operation, making automatic fusion more challenging. 1h16m28s

Upcoming Lectures

  • The next lecture will cover efficient implementation of DNN and will be presented by Kavon. 1h17m26s
  • The following Tuesday's lecture will finish the Spark discussion and begin covering cache coherency. 1h17m38s
Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else
Save this summary

Then save anything you watch or read next.

Bookmark this summary, then save any video, article or PDF you read next.

Save to your library
Browse all from Stanford Online →

Ready to get started?

Save, summarize & chat with your content.

GET STARTED

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop