Week 1: Introduction to Hadoop, HDFS, and YARN
The Big Data Challenge: Word Counting Example
Let's start with a practical example to understand the challenges of big data processing:
Word Counting in Textual Data
- Input: A data set containing several documents
- Task: Count the frequency of words appearing in the data set
- Simple solution:
- Initialize a dictionary (or a map structure) to store the results
- For each file, use a file reader to get the texts line by line
- Tokenize each line into words
- For each word, increase its count in the dictionary
Distributed Word Count
To handle large datasets, we need to distribute the processing:
Figure 1: Distributed Word Count Process
Challenges in Distributed Processing:
- Where to store the huge textual data set?
- How to split the data set into different blocks?
- How many blocks?
- What should be the size of each block?
- How to handle node failures or data loss?
- How to manage network connectivity issues?
Complexities of Distributed Processing
- Task assignment: How to efficiently assign tasks to different workers?
- Fault tolerance: What happens if tasks fail?
- Result aggregation: How do workers exchange and combine results?
- Synchronization: How to coordinate distributed tasks across different workers?
Big Data Storage Challenges
- Massive data volumes: Even a single document might not fit on one machine
- Reliability: Storing petabytes (PBs) of data reliably is challenging
- Failure handling: Need to manage various types of failures (Disk/Hardware/Network)
- Increased failure probability: More machines mean higher chances of component failures
Part 1. Introduction to Hadoop
1 Key Features of Hadoop:
- Massively scalable and automatically parallelizable
- Based on work from Google:
- Google: GFS (Google File System) + MapReduce + BigTable (Not open-source)
- Hadoop: HDFS (Hadoop Distributed File System) + Hadoop MapReduce + HBase (open-source)
- Named by Doug Cutting in 2006 (while working at Yahoo!), after his son's toy elephant
2. What Hadoop Offers:
- Redundant, Fault-tolerant data storage
- Parallel computation framework
- Job coordination
3. Why Use Hadoop?
- Cheaper: Scales to Petabytes or more easily using commodity hardware
- Faster: Enables parallel data processing
- Better: Suited for particular types of big data problems
4. Hadoop Versions Evolution
Hadoop 1.x
- Data storage (HDFS):
- Runs on commodity hardware (usually Linux)
- Horizontally scalable
- Processing (MapReduce):
- Parallelized (scalable) processing
- Fault Tolerant
- Other Tools/Frameworks: HBase, Hive, etc.
Hadoop 2.x
Introduced YARN (Yet Another Resource Negotiator):
- A resource-management platform responsible for managing computing resources in clusters
- Enables a Multi-Purpose Platform supporting Batch, Interactive, Online, and Streaming applications
Hadoop 3.x
Key improvements in Hadoop 3.x:
- Minimum Java version: Java 8/11 (up from Java 7)
- Storage Scheme: Introduced Erasure encoding in HDFS
- Fault Tolerance: Erasure coding for more efficient fault tolerance
- Storage Overhead: Reduced from 200% to about 50%
- Scalability: Improved to support more than 10,000 nodes in a cluster
- NameNodes: Support for multiple standby NameNodes
2.5. Hadoop Ecosystem
The Hadoop ecosystem is a combination of technologies that work together to solve big data problems:
Figure 2: The Hadoop Ecosystem
Part 2: HDFS (Hadoop Distributed File System)
1. Introduction to File Systems
A file system defines the methods and data structures that an operating system uses to keep track of files on a disk or partition.
2. Latency and Throughput
Understanding these concepts is crucial for HDFS design:
-
Latency: Time required to perform an action or produce a result
- Measured in time units (e.g., seconds, milliseconds)
- Example: I/O latency is the time to complete a single I/O operation
-
Throughput: Number of actions executed or results produced per unit of time
- Measured in units produced per time unit
- Example: Disk throughput is the maximum rate of sequential data transfer (e.g., MB/sec)
3. HDFS: The Hadoop Distributed File System
HDFS solves the data movement problem with a simple principle:
Why This Approach?
- Not enough RAM to hold all data in memory
- Disk access is slow (high latency), but disk throughput is reasonable
HDFS Design Goals:
- Handle very large datasets (10K nodes, 100 million files, 10PB)
- Streaming data access (optimized for batch processing, high throughput)
- Simple coherency model (write-once-read-many)
- "Moving computation is cheaper than moving data"
- Portability across heterogeneous hardware and software platforms
- Fault tolerance on commodity hardware
4. HDFS Architecture
Figure 3: HDFS Architecture
Key components:
-
NameNode:
- Manages the file system namespace
- Regulates access to files by clients
- Executes file system operations
-
DataNodes:
- Store and retrieve blocks
- Report back to the NameNode
-
Secondary NameNode:
- Performs periodic checkpoints of the namespace
- Helps NameNode to restart faster
Figure 4: Communication between NameNode and DataDode
5. HDFS Data Replication
- HDFS replicates file blocks for fault tolerance
- The NameNode makes all decisions regarding replication
- Default replication factor is 3 (configurable)
6. HDFS Read and Write Operations
Figure 5: File Read Data Flow in HDFS
Figure 5: File Write Data Flow in HDFS
7. HDFS Fault Tolerance
HDFS is designed with the assumption that failure is the norm, not the exception:
- DataNode failures are detected through missed heartbeats
- The NameNode initiates block re-replication when necessary
- The system can tolerate and recover from various types of failures
Part 3: YARN (Yet Another Resource Negotiator)
1. Why YARN?
In Hadoop 1.x, MapReduce handled both processing and resource management. This led to limitations in scalability and flexibility.
2. Introduction to YARN
Key motivations for YARN:
- Flexibility: Enable data processing models beyond MapReduce
- Efficiency: Improve performance and Quality of Service
- Resource Sharing: Support multiple workloads in the same cluster
3. YARN Components:
- ResourceManager: Arbitrates resources among all applications
- NodeManager: Per-node agent managing containers and reporting resource usage
- ApplicationMaster: Per-application component negotiating resources and working with NodeManagers
- Container: Unit of allocation for resources (CPU, memory, disk, network)
4. YARN Application Workflow
- Client submits an application
- ResourceManager allocates a container for the ApplicationMaster
- ApplicationMaster registers with ResourceManager
- ApplicationMaster negotiates resources
- ApplicationMaster launches containers on NodeManagers
- Application code executes in containers
- Client communicates with ApplicationMaster for status updates
- ApplicationMaster unregisters and shuts down upon completion