Exploring AWS EMR (Elastic MapReduce): Evolution, Analysis, and Real-World Use Cases

Exploring AWS EMR (Elastic MapReduce): Evolution, Analysis, and Real-World Use Cases

In this article, we'll delve deep into the analytics aspect of System Design, particularly focusing on the evolution of data analytics, the role of Amazon EMR (Elastic MapReduce) in processing and analyzing data, and some fascinating use cases that showcase its power.

Amazon EMR is a managed "Big Data" service offered by AWS, designed to help organizations process and analyze vast amounts of data efficiently. But how does it accomplish this? At its core, EMR utilizes the MapReduce framework, a concept initially popularized by Google.

Understanding MapReduce

Back in 2004, Google faced challenges with large-scale web scraping, particularly in handling retry logic and other repetitive tasks across different hosts. To solve these problems, Jeffery Dean and Sanjay Ghemawat of Google introduced the "MapReduce" concept, streamlining the processing of massive datasets.


Article content

In the MapReduce paradigm:

  1. Map: Data is divided into smaller chunks, and the "Map" function processes these chunks in parallel. This phase typically involves filtering, sorting, and transforming the data.
  2. Reduce: The output from the "Map" phase is then aggregated or further processed in the "Reduce" phase, ultimately producing the final result.

The Evolution of Data Architecture

Here's a timeline summarizing the evolution of modern data architecture

Article content

Modern data architecture has evolved significantly, and here’s a brief overview:

  1. Hadoop: Utilizes MapReduce, a disk-based processing framework.Primarily designed for large-scale batch processing tasks where speed isn't the highest priority.
  2. Spark: Supports a variety of processing tasks, including batch processing, streaming, machine learning, and graph processing.Leverages in-memory processing, enabling faster data processing.Ideal for real-time data processing and interactive analytics.
  3. Presto: A query engine native to Hadoop file systems and S3, allowing hybrid workloads and federated query execution.
  4. Hudi: A newer player, Hudi (Hadoop Upserts Deletes and Incrementals), enhances Hadoop with capabilities like upserts and incremental data processing.

The Data Pipeline for EMR

In a typical EMR pipeline, data flows through the following stages:

Data SourcesEMRData Lake/Data WarehouseMachine LearningBI Tools/Reporting

Article content

Here Big Data is stored in data sources like S3, then using EMR (i.e Hadoop/Spark) , that data is processed (using map reduce functions). Using queries this processed data is later analysed and patterns or variables are identified. Using these variable, machine learning model is built and trained/tuned. Finally reports/dashboards are generated out of this.

Real-World Use Cases of EMR

Now we will look into 3 different use cases to get detail idea of how exactly this pipeline works.

1. Integral Ad Science (IAS): Digital Ad Verification -

Integral Ad Science (IAS), a global leader in digital ad verification, processes trillions of data events monthly to detect anomalies. They ensure your ads are only shown to humans on legitimate publishers, protecting your media budget from bots.

  • Data Collection: IAS collects a vast amount of data from various sources, such as web browsers, mobile devices, and ad interactions. This data includes details like user interactions, browser environment settings, device characteristics, and more. This raw data is often unstructured or semi-structured. The collected data is ingested into a data lake, typically stored on Amazon S3.
  • Data Processing on EMR: EMR can use MapReduce to process this data. For example, the "Map" function could parse the browser environment and device characteristics, while the "Reduce" function aggregates and filters patterns that are characteristic of bots.
  • Feature Engineering: From the processed data, relevant features (variables) are extracted that can be used as inputs for machine learning models. These features might include the frequency of certain browser or device characteristics, the speed and sequence of user interactions, or the network patterns observed during a session.

Here’s an example of how you might extract features using Spark SQL:

SELECT 
    user_id,
    COUNT(DISTINCT browser_type) AS browser_variety,
    AVG(time_between_clicks) AS avg_click_speed,
    COUNT(DISTINCT ip_address) AS distinct_ip_count
FROM 
    interaction_logs
WHERE 
    event_date >= '2024-09-01'
GROUP BY 
    user_id        

This query might be used to extract the number of different browsers used by a user (which could indicate bot behavior), the average speed between clicks (bots might click faster than humans), and the number of distinct IP addresses (frequent IP changes could be a sign of botnets).

  • Building Machine Learning Models - Once the features are extracted, they are fed into machine learning models. These models are trained on labeled datasets where the behaviors of known bots and humans are already categorized.

2. Warner Bro's Games -

Warner Bro's Games uses a powerful data pipeline to manage and enhance their gaming experiences.Massive amount of player data is stored in S3, and EMR is used to process this data. Amazon Redshift is used to query and analyze the player data efficiently. Amazon SageMaker enables them to apply machine learning models, helping them improve their games based on player behavior and other insights.

3. Uber’s Incremental Processing with Apache Hudi

Uber encountered inefficiencies with batch processing, particularly in updating driver earnings. The traditional method involved reprocessing large datasets every 6 to 10 hours, which was both slow and resource-intensive. They wrote and open sourced Apache Hudi to solve this problem.

Example Scenario:

  • Day 1: A driver earns $10 from a trip.
  • Day 2: The driver completes another trip, earning $8, and receives a $1 tip for "Day 1" i.e (For trip which he completed previous day).

Batch Processing Challenges:

  • Lookback Strategy: Uber used a 90-day lookback strategy, reprocessing all data from the past 90 days to include any updates. This was inefficient and could miss late-arriving data.
  • Data Consistency: Important updates, like the late $1 tip, might be overlooked if they fell outside the lookback window.

Transition to Hudi:

  • Incremental Updates: Hudi allowed Uber to process only new or updated records, including late-arriving tips, without reprocessing the entire dataset.
  • Targeted Updates: Specific records, like the $1 tip, could be updated directly, eliminating the need to recompute the entire 90-day dataset.

Outcome: Switching to Hudi significantly improved processing efficiency, reduced computation times, and ensured accurate updates to driver earnings with minimal overhead. They also use Apache Hudi to address Uber Eats challenges, such as ensuring that when restaurants update their menus, the changes are promptly reflected in food recommendations to users.



To view or add a comment, sign in

More articles by Suyash Gogte

Insights from the community

Others also viewed

Explore topics