The Limits of Relational Analytics
When data volumes scale from gigabytes to petabytes, traditional relational databases cannot process analytics queries efficiently on a single server. Scaling up hardware becomes prohibitively expensive.
Apache Hadoop, an open-source framework inspired by Google's MapReduce paper, solves this problem by distributing data and processing tasks across commodity clusters.
The Hadoop Distributed File System (HDFS)
HDFS is the storage engine of Hadoop. It breaks large files into blocks (typically 64MB or 128MB) and distributes them across a cluster of nodes.
HDFS utilizes a Master/Slave architecture:
- ◆NameNode (Master): Manages the file system namespace, directories, and mapping of blocks to DataNodes. It is a single point of failure.
- ◆DataNodes (Slaves): Store and retrieve blocks of data as directed by the NameNode.
To prevent data loss from disk failures, HDFS replicates each block across multiple DataNodes (default replica factor is 3).
The MapReduce Execution Paradigm
MapReduce is the programming framework that processes data stored in HDFS parallelly. It consists of two phases:
1. The Map Phase
The input dataset is divided into splits. Each mapper process reads data and converts it into key-value pairs.
2. The Reduce Phase
The framework shuffles, sorts, and aggregates key-value pairs based on keys, emitting final outputs.
// Conceptual Java Mapper code in 2010
public class WordCountMapper extends MapReduceBase implements Mapper {
public void map(Object key, Text value, OutputCollector output) {
String line = value.toString();
for (String word : line.split("\s+")) {
output.collect(new Text(word), new IntWritable(1));
}
}
}When to Adopt Hadoop?
Hadoop is designed for batch processing of unstructured or semi-structured data (like web server logs or clickstream history). It is not suitable for low-latency transaction processing (OLTP).