Week 1: Introduction to Hadoop, HDFS, and YARN

The Big Data Challenge: Word Counting Example

Let's start with a practical example to understand the challenges of big data processing:

Word Counting in Textual Data

Input: A data set containing several documents
Task: Count the frequency of words appearing in the data set
Simple solution:
1. Initialize a dictionary (or a map structure) to store the results
2. For each file, use a file reader to get the texts line by line
3. Tokenize each line into words
4. For each word, increase its count in the dictionary

Distributed Word Count

To handle large datasets, we need to distribute the processing:

wordcount Figure 1: Distributed Word Count Process

Challenges in Distributed Processing:

Where to store the huge textual data set?
How to split the data set into different blocks?
- How many blocks?
- What should be the size of each block?
How to handle node failures or data loss?
How to manage network connectivity issues?

Complexities of Distributed Processing

Task assignment: How to efficiently assign tasks to different workers?
Fault tolerance: What happens if tasks fail?
Result aggregation: How do workers exchange and combine results?
Synchronization: How to coordinate distributed tasks across different workers?

Big Data Storage Challenges

Massive data volumes: Even a single document might not fit on one machine
Reliability: Storing petabytes (PBs) of data reliably is challenging
Failure handling: Need to manage various types of failures (Disk/Hardware/Network)
Increased failure probability: More machines mean higher chances of component failures

Part 1. Introduction to Hadoop

1 Key Features of Hadoop:

Massively scalable and automatically parallelizable
Based on work from Google:
- Google: GFS (Google File System) + MapReduce + BigTable (Not open-source)
- Hadoop: HDFS (Hadoop Distributed File System) + Hadoop MapReduce + HBase (open-source)
Named by Doug Cutting in 2006 (while working at Yahoo!), after his son's toy elephant

2. What Hadoop Offers:

Redundant, Fault-tolerant data storage
Parallel computation framework
Job coordination

3. Why Use Hadoop?

Cheaper: Scales to Petabytes or more easily using commodity hardware
Faster: Enables parallel data processing
Better: Suited for particular types of big data problems

4. Hadoop Versions Evolution

Hadoop 1.x

Data storage (HDFS):
- Runs on commodity hardware (usually Linux)
- Horizontally scalable
Processing (MapReduce):
- Parallelized (scalable) processing
- Fault Tolerant
Other Tools/Frameworks: HBase, Hive, etc.

Hadoop 2.x

Introduced YARN (Yet Another Resource Negotiator):

A resource-management platform responsible for managing computing resources in clusters
Enables a Multi-Purpose Platform supporting Batch, Interactive, Online, and Streaming applications

Hadoop 3.x

Key improvements in Hadoop 3.x:

Minimum Java version: Java 8/11 (up from Java 7)
Storage Scheme: Introduced Erasure encoding in HDFS
Fault Tolerance: Erasure coding for more efficient fault tolerance
Storage Overhead: Reduced from 200% to about 50%
Scalability: Improved to support more than 10,000 nodes in a cluster
NameNodes: Support for multiple standby NameNodes

2.5. Hadoop Ecosystem

The Hadoop ecosystem is a combination of technologies that work together to solve big data problems:

Figure 2: The Hadoop Ecosystem

Part 2: HDFS (Hadoop Distributed File System)

1. Introduction to File Systems

A file system defines the methods and data structures that an operating system uses to keep track of files on a disk or partition.

2. Latency and Throughput

Understanding these concepts is crucial for HDFS design:

Latency: Time required to perform an action or produce a result
- Measured in time units (e.g., seconds, milliseconds)
- Example: I/O latency is the time to complete a single I/O operation
Throughput: Number of actions executed or results produced per unit of time
- Measured in units produced per time unit
- Example: Disk throughput is the maximum rate of sequential data transfer (e.g., MB/sec)

3. HDFS: The Hadoop Distributed File System

HDFS solves the data movement problem with a simple principle:

Why This Approach?

Not enough RAM to hold all data in memory
Disk access is slow (high latency), but disk throughput is reasonable

HDFS Design Goals:

Handle very large datasets (10K nodes, 100 million files, 10PB)
Streaming data access (optimized for batch processing, high throughput)
Simple coherency model (write-once-read-many)
"Moving computation is cheaper than moving data"
Portability across heterogeneous hardware and software platforms
Fault tolerance on commodity hardware

4. HDFS Architecture

Figure 3: HDFS Architecture

Key components:

NameNode:
- Manages the file system namespace
- Regulates access to files by clients
- Executes file system operations
DataNodes:
- Store and retrieve blocks
- Report back to the NameNode
Secondary NameNode:
- Performs periodic checkpoints of the namespace
- Helps NameNode to restart faster

namenode and datanode Figure 4: Communication between NameNode and DataDode

5. HDFS Data Replication

HDFS replicates file blocks for fault tolerance
The NameNode makes all decisions regarding replication
Default replication factor is 3 (configurable)

6. HDFS Read and Write Operations

read_dataflow Figure 5: File Read Data Flow in HDFS

write_dataflow Figure 5: File Write Data Flow in HDFS

7. HDFS Fault Tolerance

HDFS is designed with the assumption that failure is the norm, not the exception:

DataNode failures are detected through missed heartbeats
The NameNode initiates block re-replication when necessary
The system can tolerate and recover from various types of failures

Part 3: YARN (Yet Another Resource Negotiator)

1. Why YARN?

In Hadoop 1.x, MapReduce handled both processing and resource management. This led to limitations in scalability and flexibility.

2. Introduction to YARN

Key motivations for YARN:

Flexibility: Enable data processing models beyond MapReduce
Efficiency: Improve performance and Quality of Service
Resource Sharing: Support multiple workloads in the same cluster

3. YARN Components:

ResourceManager: Arbitrates resources among all applications
NodeManager: Per-node agent managing containers and reporting resource usage
ApplicationMaster: Per-application component negotiating resources and working with NodeManagers
Container: Unit of allocation for resources (CPU, memory, disk, network)

4. YARN Application Workflow

Client submits an application
ResourceManager allocates a container for the ApplicationMaster
ApplicationMaster registers with ResourceManager
ApplicationMaster negotiates resources
ApplicationMaster launches containers on NodeManagers
Application code executes in containers
Client communicates with ApplicationMaster for status updates
ApplicationMaster unregisters and shuts down upon completion

Week1 Week2