comp9313
Week2

Week 2 MapReduce

1. Introduction to MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large datasets. It was originally developed by Google and has since become a fundamental part of many big data processing frameworks, including Apache Hadoop.

2. The MapReduce Paradigm

The MapReduce paradigm consists of two main phases:

  1. Map: This phase takes input data and produces a set of intermediate key/value pairs.
  2. Reduce: This phase merges all intermediate values associated with the same intermediate key.

3. MapReduce Workflow

The MapReduce framework manages the entire process of distributing tasks, handling failures, and aggregating results. Here's a high-level overview of the workflow:

  1. Input data is split into chunks
  2. Map tasks process input chunks in parallel
  3. Map output is partitioned and sorted
  4. Reduce tasks process partitioned data
  5. Final output is written to the distributed file system

4. Implementing MapReduce

There are several ways to implement MapReduce jobs. Let's look at three common approaches:

mapreduce Figure 1: MapReduce Data Flow

5. Advanced MapReduce Concepts

5.1 Combiners

Combiners are "mini-reducers" that run on the map output. They can significantly reduce the amount of data transferred between the map and reduce phases.

5.2 Partitioners

Partitioners determine how the map output is divided among the reducers. The default partitioner uses a hash of the key modulo the number of reducers.

5.3 Design Patterns

Two common design patterns in MapReduce are:

  1. In-Mapper Combining: This pattern preserves state across multiple map calls to perform local aggregation.

This example demonstrates different approaches to computing the mean using MapReduce, highlighting the challenges and improvements made in each version.

Explanation

  • Version 1: Simple implementation, but can't use reducer as combiner due to the non-associative nature of mean calculation.
  • Version 2: Attempts to separate sum and count, but fails because combiner output doesn't match mapper output.
  • Version 3: Correctly implements combiner by maintaining consistent input/output types throughout.
  • Version 4: Uses in-mapper combining to perform local aggregation, reducing the amount of intermediate data.
  1. Pairs vs. Stripes: These are two approaches for representing relationships between items, often used in co-occurrence problems.