From real-time streams to perpetual insights. Zalando Case Study 2!

From real-time streams to perpetual insights. Zalando Case Study 2!

In my previous article, we explored why Kafka is essential for data science, using Zalando as an example of its transformative potential in handling real-time data streams. While Kafka excels in managing real-time events and streaming pipelines, it’s important to address its limitations. In this follow-up, we delve into Hadoop, its integration with Python, and its relevance in the AI-driven data science landscape, all through the lens of Zalando’s data ecosystem.

Modern e-commerce companies like Zalando operate in a data landscape where both speed and depth of analysis are crucial (Davenport & Harris, 2017). Kafka’s power lies in its ability to handle high-velocity data streams such as user clicks, inventory changes, and order placements in real time (Apache Kafka, 2024; Zalando Tech Blog, 2023). However, the road to comprehensive, AI-driven insights does not end with Kafka. Its limitations around batch processing, complex analytics, and long-term storage call for additional technologies that can extend its capabilities.

Here, we examine these limitations and discuss how Hadoop and Python form a complementary ecosystem that supports large-scale analytics, long-term trend analysis, and the deployment of sophisticated AI models. By doing so, we aim to illustrate how Zalando’s data infrastructure can evolve from simply processing real-time events to generating actionable insights at scale.

  • Batch Processing Challenges: Kafka’s stream-first architecture is not designed for heavy batch workloads. While real-time analytics are its strong suit, analyzing historical data spanning weeks, months, or even years is beyond Kafka’s native scope (Narkhede et al., 2017).
  • Complex Analytics Gap: Kafka does not provide built-in tools for complex analytical tasks or model training. Organizations typically rely on external systems—such as Spark, Flink, or data warehouses—to conduct in-depth analyses (Kreps, 2019).
  • Storage Limitations: Kafka’s storage model, optimized for short-term retention, makes it less suitable for long-term data archiving. Retention windows are generally limited, and relying solely on Kafka for data storage can lead to an incomplete historical picture (Apache Kafka, 2024).

These shortcomings highlight the need for an integrated ecosystem—one where Hadoop can pick up where Kafka leaves off, and where Python can bridge the gap to advanced AI workflows.

Hadoop: Powering Large-Scale Batch Processing and Storage Apache Hadoop is a foundational technology for big data storage and batch analytics (White, 2015). It provides a distributed, fault-tolerant framework that can process petabytes of data using commodity hardware.

For Zalando, Hadoop’s core strengths can be critical:

  • HDFS (Hadoop Distributed File System): Offers reliable, scalable storage for extensive datasets. Zalando’s massive data—from transaction logs to product catalogs—can be stored long-term, enabling historical analyses that go far beyond real-time snapshots (Apache Hadoop, 2024).
  • MapReduce for Batch Processing: MapReduce, the original Hadoop processing engine, excels at handling large-scale batch computations (Dean & Ghemawat, 2008). Zalando can use it to extract insights from massive historical data, identifying seasonal trends, patterns in user behavior over months or years, and correlations that real-time analysis might miss.

By integrating Hadoop, Zalando complements Kafka’s real-time ingestion capabilities with robust, long-term storage and large-scale analytical capacity.

Python: The Bridge to AI-Driven Insights While Hadoop solves the batch and scale problem, Python enables a smooth integration of advanced analytics and AI models into the data pipeline (Van Rossum & Drake, 2009). Its rich ecosystem of libraries and frameworks makes Python a go-to language for data science teams.

Key Python tools that enhance Hadoop’s capabilities include:

  • PySpark: The Python API for Apache Spark allows data scientists to perform distributed, in-memory computations (Zaharia et al., 2016). By running PySpark on a Hadoop cluster, Zalando can train machine learning models efficiently, iterate quickly, and experiment with complex analytical tasks that surpass what MapReduce alone can accomplish.
  • Hadoop Streaming: This feature enables developers to write MapReduce jobs in Python, allowing direct integration of custom analytics scripts into the Hadoop environment (White, 2015). Zalando’s teams can thus implement domain-specific analyses without being locked into Java-based tooling.
  • Dask: A Python library designed for parallel computing, Dask can complement Hadoop’s distributed environment and assist in rapid prototyping. Zalando’s data scientists can test their models locally at smaller scale using Dask, then deploy at full cluster scale with Spark or MapReduce when ready (Rocklin, 2015).

Leveraging AI Within Hadoop’s Infrastructure Integrating AI into Hadoop’s ecosystem can yield powerful competitive advantages for Zalando:

  1. Personalized Recommendations: By blending real-time customer interactions (captured via Kafka) with long-term purchasing history (stored in HDFS), Zalando’s recommendation engines can factor in both immediate interests and historical preferences. AI models trained via PySpark on Hadoop clusters ensure recommendations stay relevant and continuously improve.
  2. Inventory Management: Historical sales data and product turnover rates processed in Hadoop can inform predictive models. With Python-based ML tools, Zalando’s data scientists can forecast demand more accurately, optimize stock levels, and reduce waste.
  3. Customer Sentiment Analysis: By analyzing large volumes of textual data—customer reviews, social media commentary—stored in Hadoop, Zalando’s NLP models can derive sentiment patterns. Python’s machine learning and NLP libraries, combined with distributed processing through PySpark, allow for timely sentiment insights that guide marketing strategies and product development.

Integrating Kafka and Hadoop for Optimal Results A well-orchestrated data architecture pairs Kafka’s real-time streams with Hadoop’s batch capabilities:

  • Real-Time Feeds: Kafka ingests user clicks, browsing patterns, inventory updates, and order placements as they occur.
  • Batch Analytics in Hadoop: Over time, these raw streams flow into Hadoop for large-scale aggregation, processing, and historical trend analysis.
  • AI Integration: Python tools running atop Hadoop’s distributed infrastructure then enable the training and deployment of machine learning models that inform recommendations, marketing, and operations.

Below is a detailed, technical deep-dive into how the integrated Hadoop, Kafka, and Python ecosystem can work within Zalando’s data infrastructure. This includes architectural explanations, example workflows, and sample code snippets. Though these examples are illustrative and may need adaptation to Zalando’s real environment, they capture the essence of the technologies in use.

Technical Implementation Details

A typical data pipeline leveraging Kafka, Hadoop, and Python follows these steps:

  1. Data Ingestion from Kafka into HDFS: Kafka captures real-time events such as page views, product clicks, or order confirmations. A common approach is to use frameworks like Apache Spark or Flink to consume messages from Kafka, perform initial transformations, and then store them into the Hadoop Distributed File System (HDFS).
  2. Batch Processing with Hadoop (MapReduce or Spark): Once data is in HDFS, batch analytics can be performed. Zalando’s data engineers might use classic MapReduce jobs for large-scale, non-iterative tasks or rely on Spark for more interactive, iterative workloads—especially those involving machine learning (ML) workflows.
  3. Python Integration for AI and Machine Learning: Python scripts and notebooks, often run in a distributed manner using PySpark on a Hadoop cluster, allow data scientists to build recommendation models, perform sentiment analysis, or forecast inventory needs. This leverages Python’s extensive ML and AI libraries (e.g., scikit-learn, TensorFlow, or PyTorch) integrated with the distributed computing power of Hadoop and Spark.

Example: Consuming Kafka Data with PySpark

The following Python snippet (using PySpark Structured Streaming) demonstrates how to read from Kafka topics and write the data into HDFS. Adjust broker addresses, topics, and file paths as needed.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, struct, to_json
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

#Create SparkSession
spark = SparkSession.builder \
    .appName("KafkaToHDFSExample") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()

#Define the Kafka brokers and it's topic
kafka_brokers = "kafka-broker1:9092,kafka-broker2:9092"
kafka_topic = "zalando_events"

#Define schema 
event_schema = StructType([
    StructField("event_type", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("user_id", StringType(), True),
    StructField("timestamp", TimestampType(), True)
])

#Read streaming data from Kafka
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_brokers) \
    .option("subscribe", kafka_topic) \
    .option("startingOffsets", "latest") \
    .load()

#Convert value from binary to string and parse JSON
events = df.selectExpr("CAST(value AS STRING) as json_str") \
    .select(from_json(col("json_str"), event_schema).alias("data")) \
    .select("data.*")

#Write data to HDFS in parquet format partitioned by date/hour
query = events.writeStream \
    .format("parquet") \
    .option("path", "hdfs://namenode:8020/data/zalando_events/") \
    .option("checkpointLocation", "hdfs://namenode:8020/checkpoints/zalando_events/") \
    .partitionBy("event_type") \
    .outputMode("append") \
    .start()

query.awaitTermination()
        


What this does:

  • Connects to Kafka, reads from zalando_events topic.
  • Converts the binary payload to JSON, applying a defined schema.
  • Writes streaming data into HDFS as Parquet files, partitioned by event_type for efficient querying and batch processing later.

Batch Processing with Hadoop MapReduce (Python Example)

For legacy batch jobs, Hadoop Streaming allows using Python scripts as mappers and reducers. Below is a simple MapReduce job that counts product views per product_id using Hadoop’s streaming interface.

mapper.py (reads lines of JSON, extracts product_id, and emits intermediate counts)

#!/usr/bin/env python3
import sys
import json

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        event = json.loads(line)
        if event.get("event_type") == "view":
            product_id = event.get("product_id", "unknown")
            print(f"{product_id}\t1")
    except Exception:
        # Malformed line or JSON - skip
        pass
        

reducer.py (aggregates counts by product_id)

#!/usr/bin/env python3
import sys

current_product = None
current_count = 0

for line in sys.stdin:
    line = line.strip()
    product_id, count_str = line.split("\t", 1)
    count = int(count_str)

    if current_product == product_id:
        current_count += count
    else:
        if current_product is not None:
            print(f"{current_product}\t{current_count}")
        current_product = product_id
        current_count = count

#Don't forget the last product
if current_product is not None:
    print(f"{current_product}\t{current_count}")
        

Running the Job:

hadoop jar /usr/local/hadoop/hadoop-streaming.jar \
    -input hdfs://namenode:8020/data/zalando_events/ \
    -output hdfs://namenode:8020/output/product_view_counts \
    -mapper mapper.py \
    -reducer reducer.py \
    -file mapper.py \
    -file reducer.py
        

What this does:

  • Reads event data from HDFS.
  • Mapper emits (product_id, 1) for each view event.
  • Reducer aggregates counts per product_id and writes the final tally back to HDFS.

Training a Machine Learning Model with PySpark

Once large-scale aggregation or feature engineering is complete, data scientists can use PySpark’s MLlib to train models distributed across the cluster. For example, training a recommendation model (e.g., ALS for collaborative filtering):


from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

spark = SparkSession.builder \
    .appName("ZalandoRecommendationModel") \
    .getOrCreate()

#Load aggregated data (user_id, product_id, rating or implicit feedback) from HDFS
ratings = spark.read.parquet("hdfs://namenode:8020/processed_data/user_product_interactions/")

#Split data into training and test sets
train, test = ratings.randomSplit([0.8, 0.2], seed=42)

#Configure ALS model
als = ALS(
    userCol="user_id",
    itemCol="product_id",
    ratingCol="rating",
    implicitPrefs=True,
    coldStartStrategy="drop",
    maxIter=10,
    rank=20,
    regParam=0.1
)

#Train model
model = als.fit(train)

#Evaluate model
predictions = model.transform(test)
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="rating",
    predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions)
print(f"RMSE on test data: {rmse}")

#Saving the model 
model.save("hdfs://namenode:8020/models/als_recommendation_model")
        

What this does:

  • Loads preprocessed user-product interactions from HDFS.
  • Uses ALS (Alternating Least Squares) to train a recommendation model in a distributed manner.
  • Evaluates the model on test data and saves it to HDFS for future deployment.


Performance and Scaling Tips

  • Cluster Configuration: Optimize the number of executors, cores per executor, and memory settings in Spark for efficient cluster utilization. For large-scale jobs, monitoring with tools like Ganglia or Grafana ensures resources are balanced.
  • Data Partitioning: Partitioning data by frequently queried fields (e.g., event_type, date) reduces query latency and improves batch job performance.
  • Compression and Encoding: Using columnar formats like Parquet and applying compression (Snappy, Gzip) can significantly reduce storage and I/O overhead.
  • Caching Intermediate Results: When iterative operations or repeated queries occur, caching intermediate DataFrames or RDDs in Spark memory can speed up AI workloads.

Bringing It All Together

This integrated environment—Kafka for real-time data, Hadoop/HDFS for durable storage and batch analytics, and Python with PySpark for scalable AI/ML modeling—enables Zalando’s data science teams to handle data end-to-end:

  • Real-Time→Batch→AI: Kafka streams immediate user interactions, Hadoop stores historical data, and PySpark powers iterative training of advanced ML models.
  • Seamless Integration: Python ties the ecosystem together, supporting flexible scripting, experimentation, and the adoption of cutting-edge AI frameworks.
  • Continuous Improvement: As new data streams in, models update, driving informed, timely recommendations and decisions that enhance Zalando’s competitive edge.

References

To view or add a comment, sign in

More articles by Mahdad Kiyani

Insights from the community

Others also viewed

Explore topics