The Big Data Landscape
For most of the history of data analysis, a single powerful server could store and process all the data an organization needed to analyze. That assumption broke down in the late 2000s as internet-scale companies began generating logs, transactions, social interactions, sensor readings, and click streams at volumes measured in terabytes and petabytes — far beyond what a single machine could handle. The response was a new class of distributed computing frameworks designed to split large datasets across hundreds or thousands of commodity servers, process data in parallel, and aggregate results. Hadoop and Apache Spark are the two most important frameworks that emerged from this period.
For data analysts, understanding big data technologies is increasingly relevant even if you do not build the infrastructure yourself. Many analytical datasets live in Hadoop-based data warehouses (like Hive) or are processed with Spark before reaching the tools analysts use daily. Knowing how these systems work helps you understand query performance, design efficient data pipelines, and collaborate more effectively with data engineers.
The Three Vs of Big Data
Big data is often characterized by the "Three Vs" framework, which describes why traditional data management approaches are insufficient:
V | Description | Example |
|---|---|---|
Volume | The sheer size of data — terabytes to petabytes | A social media platform processing 500 billion user events per day |
Velocity | The speed at which data is generated and must be processed | Real-time fraud detection processing thousands of transactions per second |
Variety | The diversity of data formats — structured, semi-structured, unstructured | Combining SQL tables, JSON logs, images, and audio in one pipeline |
Additional Vs have been proposed: Veracity (data quality and trustworthiness), Value (the usefulness of the data), and Variability (inconsistency in data flow). Together, these characteristics define the design requirements for big data systems: horizontal scalability, fault tolerance, support for diverse data formats, and high throughput.
Apache Hadoop: Distributed Storage and Processing
Apache Hadoop, originally developed at Yahoo and released as open source in 2006, became the foundational big data framework. Hadoop provides a distributed file system and a batch processing engine for very large datasets across a cluster of commodity machines.
Hadoop Distributed File System (HDFS)
HDFS is Hadoop's storage layer. It splits large files into blocks (typically 128 MB each) and distributes those blocks across multiple nodes in the cluster. Each block is replicated (by default, 3 copies) across different nodes, providing fault tolerance — if one machine fails, the data remains available from other replicas.
HDFS Component | Role |
|---|---|
NameNode | The master node that manages file metadata (directory structure, block locations). Does not store actual data. |
DataNode | Worker nodes that store actual data blocks. Each DataNode periodically sends heartbeats and block reports to the NameNode. |
Secondary NameNode | Periodically checkpoints the NameNode's metadata — not a true standby; more of a backup utility. |
HDFS Federation | Multiple NameNodes managing different namespaces, allowing horizontal scaling of metadata. |
HDFS is optimized for high-throughput access to large files and write-once-read-many workloads. It is not suited for low-latency access or large numbers of small files (which overload the NameNode's metadata management).
MapReduce
MapReduce is Hadoop's original batch processing engine. A MapReduce job consists of two phases: a Map phase that processes input data and produces key-value pairs, and a Reduce phase that aggregates the key-value pairs produced by all Map tasks.
In the Map phase, the input data is split across mappers running in parallel on the nodes where the data lives (a principle called "moving computation to the data"). Each mapper processes its local data block independently. In the shuffle and sort phase, the framework groups all intermediate key-value pairs by key and sends them to reducers. In the Reduce phase, each reducer receives all values for a given key and produces the final output.
MapReduce excels at batch jobs that can be expressed as key-value aggregations: word counts, log analysis, join operations, and aggregations across massive datasets. However, it has significant limitations: it is slow (each job writes intermediate results to disk between Map and Reduce), difficult to program for complex multi-step workflows, and poorly suited to iterative algorithms (like machine learning training) that require many passes over data.
The Hadoop Ecosystem
Hadoop itself is just the storage (HDFS) and execution (MapReduce/YARN) layer. A rich ecosystem of tools built on top of Hadoop address its limitations and extend its capabilities:
Tool | Purpose | Notes |
|---|---|---|
YARN | Yet Another Resource Negotiator — cluster resource management | Replaced MapReduce as the scheduling layer; allows other engines (Spark, Tez) to run on Hadoop clusters |
Hive | SQL-like query interface (HiveQL) over HDFS data | Translates SQL to MapReduce or Tez jobs; the foundation of many Hadoop-based data warehouses |
HBase | NoSQL column-family database on top of HDFS | Low-latency random read/write access; msed for operational lookup tables |
Pig | High-level data flow language (Pig Latin) for MapReduce | Largely superseded by Spark and Hive |
Sqoop | Data transfer between Hadoop and relational databases | Import/export data between HDFS and MySQL, Oracle, etc. |
Flume | Distributed log and event data ingestion | Collects and streams log data into HDFS |
Oozie | Workflow scheduler for Hadooi� jobs | Orchestrates sequences of MapReduce, Hive, and Pig jobs |
ZooKeeper | Distributed coordination service | Used internally by HBase, Kafka, and other distributed systems for leader election and configuration |
Ambari | Cluster management and monitoring UI | Simplifies Hadoop cluster provisioning and management |
Apache Spark: In-Memory Distributed Processing
Apache Spark was developed at UC Berkeley in 2009 and donated to the Apache Software Foundation in 2013. It addressed MapReduce's core limitation — writing intermediate results to disk — by keeping data in memory across computation stages. This makes Spark 10 to 100 times faster than MapReduce for many workloads, especially iterative algorithms.
Spark Architecture
A Spark application consists of a driver program that defines the application logic and coordinates the execution, and executor processes that run on worker nodes and perform the actual computation. The cluster manager (YARN, Kubernetes, Mesos, or Spark's built-in standalone mode) allocates resources and launches executors.
Spark Architecture
A Spark application consists of a driver program that defines the application logic and coordinates the execution, and executor processes that run on worker nodes and perform the actual computation. The cluster manager (YARN, Kubernetes, Mesos, or Spark's built-in standalone mode) allocates resources and launches executors.
Component | Role |
|---|---|
Driver | Runs the main function, creates SparkContext/SparkSession, plans and schedules jobs |
SparkContext / SparkSession | Entry point to Spark; SparkSession (Spark 2.0+) unifies all previous contexts |
Executor | JVM process on worker nodes; runs tasks and caches data in memory or disk |
Cluster Manager | Allocates resources; can be YARN, Kubernetes, Mesos, or Spark standalone |
Task | The smallest unit of work; one task processes one partition of data |
Stage | A set of tasks that can run without a shuffle; separated by shuffle boundaries |
Job | A complete computation triggered by an action; consists of one or more stages |
Resilient Distributed Datasets (RDDs)
The foundational data abstraction in Spark is the Resilient Distributed Dataset (RDD). An RDD is an immutable, distributed collection of objects partitioned across the cluster. RDDs are fault-tolerant because Spark tracks the lineage of transformations applied to create each RDD — if a partition is lost, Spark can recompute it from the original source.
RDD operations are either transformations (which return a new RDD lazily, without immediate execution) or actions (which trigger actual execution and return results). This distinction is called lazy evaluation.�s�7F���3��7G&��s��v��6�G&�vvW"W�V7WF����B&WGW&�&W7V�G2��F��2���Wf�VF�����FV����w27&�F��F�֗�RW�V7WF�����2&Vf�&R'V���r�6��WFF����6�����G&�6f�&�F���2��6�VFR��f��FW"�f�D��w&�W'��W���B�����6�����7F���2��6�VFR6���V7B�6�V�B�&VGV6R�F�R��B6fT5FW�Df��R����ƃ3�FFg&�W2�BFF6WG3���3�����FW&�7&�FWfV���V�B&��&�ǒW6W2F�R�7G&��s�FFg&�R���7G&��s��7&��2��&F�W"F����r��WfV�$DG2�FFg&�W2&RF�7G&�'WFVB6���V7F���2�b&�w2v�F���VB6��V��2( B6�֖�"F�&V�F����F&�R�"�F2FFg&�R�F�RFFg&�R�&�f�FW25��Ɩ�R�W&F���2�B&V�Vf�G2g&��F�R6FǗ7BVW'��F�֗�W"�v��6�WF��F�6�ǒ�F�֗�W2W�V7WF�����2������F�R�7G&��s�FF6WB�#��7G&��s��7&��b��W�FV�G2FFg&�W2v�F�6����R�F��RG�R6fWG���66��B�f����F�����7&���FFg&�W2�BFF6WG2&RV�f�VB( BF�RFFg&�R�2VffV7F�fVǒ�V�G�VBFF6WB������F�R7&�5���GV�R���w2FFg&�W2F�&RVW&�VBW6��r5�7��F�F�&V7Fǒg&��7&�Ɩ6F���2�����r�B66W76�&�RF��Ǘ7G2f֖Ɩ"v�F�5�v�F��WB&WV�&��r���v�VFvR�b66��"�F���7&��2���ࠣƃ#��7&��7&�f�"�F����Ǘ7G3���#����7&��2F�R�F����f�"6�R7&��V�&Ɩ�r�F����Ǘ7G2F��WfW&vR7&�w2F�7G&�'WFVB&�6W76��r�vW"v�F��WB�V&��r66��"�f��7&�W��6W2F�RgV��7&��F�&�Vv��F������6�VF��rFFg&�W2�7&�5����Ɩ"��6���R�V&��r���B7&�7G&V֖�r������&6�2�7&�v�&�f��r&VG2FF�ƖW2G&�6f�&�F���2��Bw&�FW2&W7V�G3�����F&�R&�&FW#�#"6V��FF��s�#�"6V��76��s�##��F�VC��G#��F��7FW��F���F���7&��W&F�����F���F��FW67&�F�����F����G#���F�VC��F&�G���G#��FC��F�Ɨ�S��FC��FC�7&�6W76����'V��FW"�vWD�$7&VFR����FC��FC�7&VFRF�RV�G'����BF�7&���FC���G#��G#��FC�&VBFF��FC��FC�7&��&VB�77b����'VWB����6�ₓ��FC��FC���BFF��F�FFg&�Rg&��f&��W26�W&6W3��FC���G#��G#��FC�6V�V7B6��V��3��FC��FC�Fb�6V�V7B�&6��"�&6��""���FC��FC�6���6R7V6�f�26��V��3��FC���G#��G#��FC�f��FW"&�w3��FC��FC�Fb�f��FW"�Fb�vR�3���FC��FC�ǒ&�r��WfV�6��F�F���3��FC���G#��G#��FC�w&�W�Bvw&VvFS��FC��FC�Fb�w&�W'��'&Vv���"��vr�7V҂'6�W2"����FC��FC�vw&VvFR'�F��V�6�����FC���G#��G#��FC����F&�W3��FC��FC�Fc����Fc"�&7W7F��W%��B"�&�VgB"���FC��FC��W&vRFFg&�W2���W���FC���G#��G#��FC�w&�FR&W7V�G3��FC��FC�Fb�w&�FR�'VWB�&�WGWE�F�"���FC��FC�6fRF�'VWB�55b��"�F�W"f�&�G3��FC���G#���F&�G����F&�S����7&�FFg&�W2&R6��F�&�Rv�F��F2F�&�Vv�F�RF��F2���WF��B�f�"6���&W7V�G2F�Bf�B���V��'���BF�R�F2����7&������2���r��FVw&FVB��F�7&�2�"���v��6����w2�F26�FRF�'V���F�7G&�'WFVB7&�FFg&�W2v�F�֖���6��vW2���ࠣƃ#�7&�w26����V�BV6�7�7FV����#���7&��2��B�W7B&F6�&�6W76��rV�v��R( B�B&�f�FW2V�f�VB�ǗF�72�Ff�&�F�&�Vv�6WfW&���FVw&FVB6����V�G3�����F&�R&�&FW#�#"6V��FF��s�#�"6V��76��s�##��F�VC��G#��F��6����V�C��F���F��W'�6S��F���F��W�W6R66W3��F����G#���F�VC��F&�G���G#��FC�7&�6�&S��FC��FC�&6�2F�7G&�'WFVB&�6W76��rV�v��R�$DB����FC��FC���r��WfV�F�7G&�'WFVBG&�6f�&�F���2�B7F���3��FC���G#��G#��FC�7&�5���FC��FC�5���FW&f6R�BFFg&�R�FF6WB�#��FC��FC�7G'V7GW&VBFF&�6W76��s�&VF��r'VWB��4�����fRF&�W3��FC���G#��G#��FC�7&�7G&V֖�r�7G'V7GW&VB7G&V֖�s��FC��FC�&V��F��RFF&�6W76��s��FC��FC�&�6W76��r�f�7G&V�3�&V��F��RF6�&�&G3�WfV�B&�6W76��s��FC���G#��G#��FC���Ɩ#��FC��FC�66�&�R�6���R�V&��rƖ'&'���FC��FC�G&���r����FV�2���&vRFF6WG3�fVGW&RV�v��VW&��rB66�S��FC���G#��G#��FC�w&����FC��FC�w&�&�6W76��rg&�Wv�&���FC��FC�6�6���WGv�&��Ǘ6�3�vU&泲6���V7FVB6����V�G3��FC���G#��G#��FC�7&�#��FC��FC�"��FW&f6RF�7&���FC��FC�7FF�7F�6��Ǘ6�2��"��F�7G&�'WFVBFF6WG3��FC���G#���F&�G����F&�Sࠣƃ#�F��g2�7&���W�F�ffW&V�6W3���#��F&�R&�&FW#�#"6V��FF��s�#�"6V��76��s�##��F�VC��G#��F��F��V�6�����F���F��F���&VGV6S��F���F��6�R7&���F����G#���F�VC��F&�G���G#��FC�7VVC��FC��FC�6��r�&VG2�w&�FW2F�6�&WGvVV�7FvW2���FC��FC�j( 3�f7FW"�����V��'�&�6W76��r���FC���G#��G#��FC�&�6W76��r��FV���FC��FC�&F6���Ǔ��FC��FC�&F6��7G&V֖�r��FW&F�fR�����FW&7F�fR5���FC���G#��G#��FC�V6R�bW6S��FC��FC�6���W��f��7FVW�V&��r7W'fS��FC��FC��F����66��"��f�3�6��6�6RFFg&�R���FC���G#��G#��FC�fV�BF��W&�6S��FC��FC�W�6V��V�B�$DBƖ�VvR�BF�6�6�V6����G2���FC��FC�v��B�$DBƖ�VvS��V��'�6�V6����G2���FC���G#��G#��FC��V��'�&WV�&V�V�C��FC��FC���r�7G&V�2F�&�Vv�F�6����FC��FC䆖v�W"�66�W2FF���V��'����FC���G#��G#��FC��6���R�V&��s��FC��FC�Ɩ֗FVB����WB���FC��FC�W�6V��V�B���Ɩ#���FVw&FW2v�F�FVW�V&��rg&�Wv�&�2���FC���G#��G#��FC�V6�7�7FV��GW&�G���FC��FC�fW'��GW&S�W�FV�6�fRF��Ɩ�s��FC��FC��GW&S�&�FǒWf��f��s��FC���G#��G#��FC�&��'�7F�&vS��FC��FC�De3��FC��FC�De2�32�t52��W&R&��"�FV�F��R��6V&W&s��FC���G#���F&�G����F&�S������FW&�FF&6��FV7GW&W2��F���B7&�&R�gFV�W6VBF�vWF�W"&F�W"F��2�FW&�F�fW2��De2�"6��VB�&�V7B7F�&vR&�f�FW2F�RV�FW&ǖ��r7F�&vR��W"�v���R7&���F�W2F�R6��WFF�����&VGV6R�2�&vVǒ�'6��WFRf�"�WrFWfV���V�B�&W�6VB'�7&�f�"��7B&F6�v�&���G2���ࠣƃ#���FW&�&�rFF&6��FV7GW&W3���#���F�RFF&6��FV7GW&R��G66R�2Wf��fVB6�v�f�6�Fǒ6��6R�F��w2��G&�GV7F����6��VB��F�fR�&�V7B7F�&vR�����32�v��v�R6��VB7F�&vR��W&R&��"7F�&vR��2�&vVǒ&W�6VB�De2f�"���&v旦F���2&V6W6R�BFV6�W�W27F�&vRg&��6��WFF����V�&Ɩ�rV�7F�266Ɩ�r�B��vW"6�7B���vVB7&�6W'f�6W2( BFF'&�6�2�����T�"�v��v�RFF&�2��W&R�D��6�v�B( B���rFV�2F�7��W7&�6�W7FW'2���FV��Bv�F��WB��v��r��g&7G'V7GW&R������F&�Rf�&�G2Ɩ�R�7G&��s�FV�F��S��7G&��s���7G&��s�6�R�6V&W&s��7G&��s���B�7G&��s�6�R�VF���7G&��s�FB4�BG&�67F���2�66�V�Wf��WF�����BF��RG&fV�F�6��VB7F�&vR�'&��v��rFFv&V��W6R&VƖ&�ƗG�F�FF��R&6��FV7GW&W2�F��2�2v�fV�&�6RF�F�R�7G&��s���V��W6R&6��FV7GW&S��7G&��s�( BV�f�VB�Ff�&�6��&���rF�R66�&�ƗG��bFF��W2v�F�F�RVW'�W&f�&��6R�Bv�fW&��6R�bFFv&V��W6W2�����F&�R&�&FW#�#"6V��FF��s�#�"6V��76��s�##��F�VC��G#��F��&6��FV7GW&RGFW&���F���F��FW67&�F�����F���F��F���3��F����G#���F�VC��F&�G���G#��FC�FFv&V��W6S��FC��FC�7G'V7GW&VB�7W&FVBFF�F�֗�VBf�"5��ǗF�73��FC��FC�6��vf��R�&�uVW'��&VG6��gB�7��6S��FC���G#��G#��FC�FF��S��FC��FC�&rFFB�66�R���&�V7B7F�&vS��FC��FC�32�7&���De2���fS��FC���G#��G#��FC���V��W6S��FC��FC�V�f�W2��R�Bv&V��W6Rv�F�4�B�5���FC��FC�FF'&�6�2�FV�F��R��6�R�6V&W&r��33��FC���G#��G#��FC���&F&6��FV7GW&S��FC��FC�6��&��W2&F6��67W&7���B7G&V֖�r�7VVB���W'3��FC��FC�F��&F6���f�7G&V֖�s��FC���G#��G#��FC�&6��FV7GW&S��FC��FC�&W&�6W76W2��FF27G&V֖�rWfV�G3��FC��FC�f��7&�7G'V7GW&VB7G&V֖�s��FC���G#���F&�G����F&�Sࠣƃ#�v�BFF�Ǘ7G2�VVBF����s���#���FF�Ǘ7G2v�&���rv�F�&�rFFG��6�ǒ��FW&7Bv�F�F�W6R7�7FV�2F�&�Vv�'7G&7F�����W'2&F�W"F��w&�F��r��r��WfV�7&��"�&VGV6R6�FR�F�R��7B6�������FW&7F������G2&R5���FW&f6W2���fR�7&�5��&W7F��G&����&�uVW'���B��FV&���2��W�FW"v�F��7&��FF'&�6�2��FV&���2��V�FW'7F�F��rF�RV�FW&ǖ��r&6��FV7GW&R�V�2�Ǘ7G2w&�FR&WGFW"VW&�W2���FW'&WBVW'���2��BF�v��6RW&f�&��6R&�&�V�2������W�6��6WG2WfW'��Ǘ7B6��V�BV�FW'7F�B��6�VFS���r'F�F����rffV7G2VW'�W&f�&��6R�f��FW&��r��'F�F���6��V��2f��G2gV��F&�R66�2��v��6���f��W2&R&�&�V�F�2��F�7G&�'WFVB7�7FV�3���r6��V��"f�&�G2Ɩ�R'VWB�B�$2V�&�R&VF�6FRW6�F�v�v�V�F�W6R'&�F67B����2g2�6�Vff�R����3��B��rFF6�Wr�V�WfV�F�7G&�'WF����bf�VW27&�72'F�F���2�6W6W2W&f�&��6R&�GF�V�V6�2������f�"�Ǘ7G2v��v�BF�w&�FR�7&�6�FR�F�RW76V�F��6����6WB��6�VFW3�7&VF��r7&�6W76���2�&VF��r�Bw&�F��rFFg&�W2��6�����f�&�G2�'VWB�55b��4���FV�F��ǖ��rFFg&�RG&�6f�&�F���2�Bvw&VvF���2�w&�F��r7&�5�VW&�W2��BV�FW'7F�F��r��rF�GV�R��"W&f�&��6R'�F�W7F��r'F�F����r�66���r��B����7G&FVv�W2���ࠣƃ#�7V��'����#���F���B7&�&W&W6V�BF�RGv����"vV�W&F���2�b&�rFF&�6W76��r��F��w2�De27F�&vR��W"�B�&VGV6R6��WFF�����FV�W7F&Ɨ6�VBF�RF�7G&�'WFVBFF&�6W76��r&F�v��7&���&�fVBG&�F�6�ǒ���&VGV6Rw27VVB'��VW��r��FW&�VF�FRFF���V��'��B�ffW&VB�V6�&�6�W"�f�"&F6��7G&V֖�r�5���B�6���R�V&��rv�&���G2���FW&�FF&6��FV7GW&W2&�V�B6��VB�&�V7B7F�&vRv�F�7&�6��WFF����B�V�F&�Rf�&�G2F�7&VFR66�&�R�6�7B�VffV7F�fR��B�ǗF�6�ǒ6&�RFF�Ff�&�2�f�"FF�Ǘ7G2�f֖Ɩ&�G�v�F�F�W6RFV6�����v�W2( BWfV�B6��6WGV��WfV�( B�2��7&V6��vǒW�V7FVB����&v旦F���FVƖ�rv�F�FFB66�R����
Create a free reader account to keep reading.