Efficient and Reliable Data Processing with Hazelcast

Efficient and Reliable Data Processing with Hazelcast

In many business scenarios, it is crucial to ensure that the same data is not processed in parallel. For example:

  • A customer message should be sent only once to a given webhook endpoint to avoid duplication.
  • An uploaded file should undergo virus scanning and content validation only once because these are resource-intensive processes. This is especially important when building a cost-effective cloud system.

At the same time, achieving better throughput often requires distributing computations across multiple nodes. While purchasing an external service might be an option, you can achieve this goal with minimal effort and investment by building a solution yourself!


Hazelcast as a Solution

Hazelcast, a distributed in-memory data grid, provides a powerful way to address such challenges.

One example I encountered involved customer-agent messaging, where millions of messages were exchanged through webhook endpoints. These endpoints could be third-party services, and delivery was not always guaranteed. As a result, I needed to implement a retry mechanism without blocking other conversations that needed to be sent to different endpoints.

To support this, I had to:

  • Cache failed conversation information for fast access.
  • Ensure that the same message wasn’t sent multiple times.

I leveraged Hazelcast by storing the conversation ID and webhook endpoint ID in a distributed map, while the actual messages were stored in an external database due to their size and volume. Using Hazelcast distributed map in the retry process ensured:

  • Message Uniqueness: Messages were sent only once, even in a distributed system, as each key in the map was owned and managed by a specific member of the cluster.
  • High Availability and Fault Tolerance: Data was replicated across nodes, ensuring availability even in the event of a node failure.

Embedding Hazelcast directly into the application made it even more efficient. This approach eliminated the need for additional external services while keeping the setup simple and lightweight.

Distributed Retry Processing
Distributed Retry Message Processing

Key Features of Hazelcast Distributed Maps

1. Data Redundancy and Fault Tolerance

  • Each entry in the map can have multiple replicas across nodes, ensuring data availability even in case of node failure.
  • Only one node acts as the owner of an entry, managing its lifecycle and propagating updates to replicas.

2. Local Access and Processing Guarantees

  • If your application runs multiple instances, you can query local entries. Hazelcast ensures that no other instance processes the same data.
  • Once a conversation is restored, its entry can be safely removed from the map.


A More Complex Use Case: File Processing

Consider a scenario where users upload multiple files to a server. Only supported file types should be processed and stored in a designated storage solution like Amazon S3 or Google Cloud Storage. The workflow looks like this:

1. Virus Scanning

Before validation, all files must undergo a virus scan to ensure security. This is a resource-intensive process.

2. Content Validation

After passing the virus scan, files are validated to ensure they meet the required criteria.

3. Storage

Once a file passes both checks, it is transferred to the final storage location.

To handle varying workloads, the system must be:

  • Elastic: Dynamically scaling up or down based on the number of uploaded files.
  • Fault-Tolerant: Preventing data loss and ensuring reliability.

Hazelcast makes it possible to distribute tasks across multiple nodes efficiently, ensuring optimal resource utilization and system reliability.

Article content
File Processing Pipeline

Try It Out

If you would like to try a working example, check out the project on GitHub!

The solution can be implemented and deployed on Kubernetes, AWS, Google Cloud, or other cloud platforms to leverage auto-scaling, load balancing, and other cloud-native features. It can also run on-premises or in hybrid environments. For experimentation or fun, you can even deploy it on a Raspberry Pi 🙂

Hazelcast is an excellent tool for scenarios where data consistency, fault tolerance, and efficient task distribution are critical. Its simplicity and power make it a great choice for solving both common and complex use cases without the need for additional services.

Andras Fejes

Senior Software Development Engineer at LivePerson | Full Stack Engineer | Backend Specialist | Java, JEE, Spring, Quarkus

5mo

#Hazelcast #Architecture #DistributedProcessing #FaultTolerant

To view or add a comment, sign in

More articles by Andras Fejes

Insights from the community

Others also viewed

Explore topics