Hadoop and MapReduce: Demystifying Big Data Processing for the Enterprise

Distribute the load. We explain the internal mechanics of HDFS, NameNodes, and parallelized data computations.

VP
SHIVAM ITCS
·2 July 2010·10 min read·1 views

The Limits of Relational Analytics

When data volumes scale from gigabytes to petabytes, traditional relational databases cannot process analytics queries efficiently on a single server. Scaling up hardware becomes prohibitively expensive.

Apache Hadoop, an open-source framework inspired by Google's MapReduce paper, solves this problem by distributing data and processing tasks across commodity clusters.

The Hadoop Distributed File System (HDFS)

HDFS is the storage engine of Hadoop. It breaks large files into blocks (typically 64MB or 128MB) and distributes them across a cluster of nodes.

HDFS utilizes a Master/Slave architecture:

  • NameNode (Master): Manages the file system namespace, directories, and mapping of blocks to DataNodes. It is a single point of failure.
  • DataNodes (Slaves): Store and retrieve blocks of data as directed by the NameNode.

To prevent data loss from disk failures, HDFS replicates each block across multiple DataNodes (default replica factor is 3).

The MapReduce Execution Paradigm

MapReduce is the programming framework that processes data stored in HDFS parallelly. It consists of two phases:

1. The Map Phase

The input dataset is divided into splits. Each mapper process reads data and converts it into key-value pairs.

2. The Reduce Phase

The framework shuffles, sorts, and aggregates key-value pairs based on keys, emitting final outputs.

javacode
// Conceptual Java Mapper code in 2010
public class WordCountMapper extends MapReduceBase implements Mapper {
    public void map(Object key, Text value, OutputCollector output) {
        String line = value.toString();
        for (String word : line.split("\s+")) {
            output.collect(new Text(word), new IntWritable(1));
        }
    }
}

When to Adopt Hadoop?

Hadoop is designed for batch processing of unstructured or semi-structured data (like web server logs or clickstream history). It is not suitable for low-latency transaction processing (OLTP).

VP
Vijay Paliwal
Founder, SHIVAM ITCS · 18+ years enterprise & AI engineering
MCA · Ex-HiveGPT USA · Ex-Social27 Seattle
Hadoop and MapReduce: Demystifying Big Data Processing for the Enterprise | SHIVAM ITCS Blog | SHIVAM ITCS