Exploring AWS EMR (Elastic MapReduce): Evolution, Analysis, and Real-World Use Cases
In this article, we'll delve deep into the analytics aspect of System Design, particularly focusing on the evolution of data analytics, the role of Amazon EMR (Elastic MapReduce) in processing and analyzing data, and some fascinating use cases that showcase its power.
Amazon EMR is a managed "Big Data" service offered by AWS, designed to help organizations process and analyze vast amounts of data efficiently. But how does it accomplish this? At its core, EMR utilizes the MapReduce framework, a concept initially popularized by Google.
Understanding MapReduce
Back in 2004, Google faced challenges with large-scale web scraping, particularly in handling retry logic and other repetitive tasks across different hosts. To solve these problems, Jeffery Dean and Sanjay Ghemawat of Google introduced the "MapReduce" concept, streamlining the processing of massive datasets.
In the MapReduce paradigm:
The Evolution of Data Architecture
Here's a timeline summarizing the evolution of modern data architecture
Modern data architecture has evolved significantly, and here’s a brief overview:
The Data Pipeline for EMR
In a typical EMR pipeline, data flows through the following stages:
Data Sources → EMR → Data Lake/Data Warehouse → Machine Learning → BI Tools/Reporting
Here Big Data is stored in data sources like S3, then using EMR (i.e Hadoop/Spark) , that data is processed (using map reduce functions). Using queries this processed data is later analysed and patterns or variables are identified. Using these variable, machine learning model is built and trained/tuned. Finally reports/dashboards are generated out of this.
Real-World Use Cases of EMR
Now we will look into 3 different use cases to get detail idea of how exactly this pipeline works.
Recommended by LinkedIn
1. Integral Ad Science (IAS): Digital Ad Verification -
Integral Ad Science (IAS), a global leader in digital ad verification, processes trillions of data events monthly to detect anomalies. They ensure your ads are only shown to humans on legitimate publishers, protecting your media budget from bots.
Here’s an example of how you might extract features using Spark SQL:
SELECT
user_id,
COUNT(DISTINCT browser_type) AS browser_variety,
AVG(time_between_clicks) AS avg_click_speed,
COUNT(DISTINCT ip_address) AS distinct_ip_count
FROM
interaction_logs
WHERE
event_date >= '2024-09-01'
GROUP BY
user_id
This query might be used to extract the number of different browsers used by a user (which could indicate bot behavior), the average speed between clicks (bots might click faster than humans), and the number of distinct IP addresses (frequent IP changes could be a sign of botnets).
2. Warner Bro's Games -
Warner Bro's Games uses a powerful data pipeline to manage and enhance their gaming experiences.Massive amount of player data is stored in S3, and EMR is used to process this data. Amazon Redshift is used to query and analyze the player data efficiently. Amazon SageMaker enables them to apply machine learning models, helping them improve their games based on player behavior and other insights.
3. Uber’s Incremental Processing with Apache Hudi
Uber encountered inefficiencies with batch processing, particularly in updating driver earnings. The traditional method involved reprocessing large datasets every 6 to 10 hours, which was both slow and resource-intensive. They wrote and open sourced Apache Hudi to solve this problem.
Example Scenario:
Batch Processing Challenges:
Transition to Hudi:
Outcome: Switching to Hudi significantly improved processing efficiency, reduced computation times, and ensured accurate updates to driver earnings with minimal overhead. They also use Apache Hudi to address Uber Eats challenges, such as ensuring that when restaurants update their menus, the changes are promptly reflected in food recommendations to users.