Partition of Dataset? Why it is important?
Partition of Dataset? Why it is important?
Problem: A website can become extremely popular in a relatively short time frame. The increased load might swamp and degrade the performance of the website. As volume of data increases application performance starts degrading. So There should be strategy to arrange data in such a way to optimize storage performance for a scalable, reliable high performance system.
Below are few Strategies:
Horizontal Partitioning (Sharding): Divides the dataset into rows, with each partition containing a subset of the rows. This is useful when different subsets of data are accessed or updated independently.
Range Partitioning: Divides the dataset based on a range of values. For example, data can be partitioned by date ranges, such as one partition for each month.
Recommended by LinkedIn
List Partitioning: Divides the dataset based on a predefined list of values. For example, orders might be partitioned by order status, with separate partitions for ‘pending’, ‘shipped’, and ‘delivered’ orders.
Round-Robin Partitioning: Distributes data evenly across all partitions in a cyclic manner. This method is simple and ensures a balanced load.
Static hash partitioning: Uses a hash function to determine the partition for each data item. This ensures an even distribution of data across partitions.
Consistent hashing: Consistent hashing is a distributed systems technique that operates by assigning the data objects and nodes a position on a virtual ring structure (hash ring). Consistent hashing minimizes the number of keys to be remapped when the total number of nodes changes.
Benefits of Partitioning (Why it is important?)