Warehouse-Scale Computers
- Warehouse-sized computers, pioneered by Luis Barroso, are essentially clusters of commodity PCs connected together with ethernet, offering a scalable and relatively inexpensive computing solution. 4m59s
- Warehouse-scale computers were initially built with cheap components, but to improve application development, they now utilize expensive, high-bandwidth networks similar to those in supercomputers. 7m17s
- A typical rack in such a system consists of 20-40 servers, each with a multi-core CPU, connected via a top-of-rack switch, with the number of servers limited by power capacity. 8m40s
- A notable bottleneck in these systems is the network bandwidth, which can be significantly lower than local disk access speeds, especially between racks. 12m25s
- Network bandwidth in warehouse-scale computers has significantly improved, reaching speeds comparable to solid-state drives (around 2 gigabytes per second). This means accessing data from a remote node's disk can be as fast as accessing it locally. 14m18s
Distributed File Systems
- Distributed file systems like Google File System (GFS) and Hadoop Distributed File System (HDFS) are used to store data persistently in a distributed system. These systems are designed for large files and primarily handle appending data and reading it, with infrequent in-place updates. 19m4s
- A distributed file system is introduced, which divides large files into chunks or blocks, typically ranging from 64 to 256 megabytes in size. 20m18s
- These blocks are replicated and distributed across multiple racks to prevent data loss in case of failures. 20m48s
- A master node manages the metadata and directory of the replicas, providing a global namespace for the distributed file system. 21m1s
Data-Parallel Programming
- Data parallel programming ideas can be used to program distributed computers, which are computers composed of multiple separate operating system instances. 53s
- One of the main reasons to use a cluster of machines is to gain IO bandwidth, especially when processing huge amounts of data (hundreds of terabytes). 3m47s
- Communication between nodes with different operating systems in a distributed system is achieved through message passing, which involves sending data between threads in different address spaces. 14m57s
- Message Passing Interface (MPI) is discussed as a method for distributed data-parallel computing, but it is noted for its complexity and lack of inherent fault tolerance. 26m43s
- The map function, used in data-parallel operations, is highlighted for its ease of parallelization due to the lack of dependencies between elements and its side-effect-free nature, which ensures that the input remains unchanged. 28m37s
MapReduce
- The map-reduce programming model is explained, with the mapper function processing individual log file lines to identify entries from mobile clients and update a result map, while the reducer function aggregates values associated with unique keys to produce a final sum. 31m9s
- MapReduce jobs can be used to count word occurrences in a file, with one map task per block of the input file. 34m46s
- The parallelism in reducer tasks comes from associating a certain number of keys to each reducer task. 36m17s
- To ensure a correct reduction when parallelizing across keys, all keys of the same value must be sent to the same reducer task. 37m36s
- A hash function based on key value can be used to determine where data should be sent for reduction. 42m36s
- A job scheduler can exploit data locality by running mapper jobs close to input blocks and reducer jobs near where most of the data is located. 46m22s
- If a mapper node fails, the system can recover by running the tasks on a different node with access to the replicated data blocks. 47m20s
- MapReduce can handle slow or failing machines by replicating tasks and using the results from the first completed task, while terminating the others. 49m6s
- MapReduce is easy to understand and implement, but has limitations in its programming model, only allowing for a linear arrangement of map and reduce functions. 52m8s
- MapReduce is inefficient for iterative algorithms, like PageRank, and for handling ad hoc queries on large datasets, as both scenarios require frequent access to the distributed file system. 53m17s
Spark
- A 2011 paper argued that data locality in data centers was becoming irrelevant for two reasons: network bandwidth was increasing, and most big data application working sets could fit in memory. 55m32s
- A potential problem with using memory instead of storage systems for intermediate data is the loss of computation in case of power failure. 58m41s
- Spark addresses the problem of fault-tolerant, in-memory, distributed computing by introducing the concept of a resilient distributed dataset (RDD). 1h1m23s
- Resilient Distributed Datasets (RDDs) are immutable, ordered collections of records, created through transformations from other RDDs or persistent storage. 1h1m46s
- RDDs can be created from data in a distributed file system and transformed using operations like filter, map, and reduceByKey. 1h2m21s
- The
persistcommand can be used to keep an RDD in memory for efficient reuse in subsequent operations. 1h6m3s - Resilient Distributed Datasets (RDDs) can be implemented more efficiently by considering dependencies and applying optimizations like fusion and tiling. 1h13m20s
- Narrow dependencies exist when an RDD partition depends on only one other partition, allowing for automatic fusion by the system. 1h16m7s
- Wide dependencies occur when an RDD partition depends on multiple other partitions, such as in a group by key operation, making automatic fusion more challenging. 1h16m28s








