SlideShare a Scribd company logo
A ScyllaDB Community
Overcoming Distributed Databases
Scaling Challenges with Tablets
Dor Laor
Co-founder & CEO
Dor Laor
Co-founder of ScyllaDB
■ Something cool I’ve done: Dyed my hair
■ My perspective on P99s: Less is more
■ Another thing about me: Own a Phd in Snowboard
■ What I do away from work: Work remotely
Low p99 latency algorithm:
#1: write code
#2: Optimize
#3: if (fast) break
goto #2
Algorithm #2: rewrite in assembly
internal::memory_prefaulter::work(...) {
auto fault_in_memory = [] (char* p);
// Touch the page for write, but be sure not to modify anything
// The compilers tend to optimize things, so prefer assembly
asm volatile ("lock orb $0, %0" : "=&m"(*p));
while (start < end) {
fault_in_memory(start);
start += page_size;
}
};
Algorithm #3: Thread per core Architecture
Shards
Threads
Profit!
Most times, |server| > 1
Hmm, so local optimizations matter more!
Sometimes, |server| >> 1
Let’s Shard
Redis Shards
Cassandra Token Ring [2008]
■ Each node has one token range
■ Token ranges are assigned
randomly on node creation
■ A new node split an existing
range
■ Replica sets determined
algorithmically
Owned range
Replica set
A
D C
F
Advantages
■ Nodes can be added with no central coordination
Disadvantages
■ Nodes cannot be added without central coordination
■ Can only join/decommission one node at a time
■ Bad data distribution
● ./shardsim --nodes 12 --vnodes 1 --shards 1
maximum node overcommit: 3.42607
● Support for manual balancing by changing tokens
■ New nodes on stream from only three neighbors
● Generates uneven CPU, disk load during join
Cassandra Virtual Nodes (vnodes, 2012)
■ Each node has N (typically 32 or
256) randomly assigned tokens
Owned range
Replica set
A
A
Advantages
■ Better data distribution
$ ./shardsim --nodes 12 --vnodes 256 --shards 1
maximum node overcommit: 1.12073
Disadvantages
■ Many more replica sets
● Scans of small tables have to traverse 256*nr_nodes ranges
● Overhead can dominate scan time for near-empty tables
● 2 failures - Good likelihood of availability issue in a single zone deployment
■ Even worse for Cassandra secondary indexes (that are local to the node)
ScyllaDB Architecture: vnodes & shard per core
Homogeneous nodes Ring Architecture
2014 - ScyllaDB 0.1 shards
■ Introduce orthogonal
distribution, internal to the
node
■ Different nodes can have
different shard counts
Owned range
Replica set
Shard 0
Shard 1
Shard 2
Advantages
■ Per-shard data ownership removes need for locks
Disadvantages
■ Poor data distribution
● $ ./shardsim --nodes 12 --vnodes 256 --shards 30 --ignore-msb-bits 0
12 nodes, 256 vnodes, 30 shards
maximum node overcommit: 1.11148
maximum shard overcommit: 2.512377
■ Poor scan performance over near-empty tables
● Need to scan each vnode, then each shard’s intersection in the vnode
Owned range
Replica set
2016 - ScyllaDB 1.6 -
murmur3_ignore_msb_bits
■ Split token range into
4096*nr_shard ranges
■ Each shard owns 4096 ranges
Shard 0
Shard 1
Shard 2
Shard 0
Shard 0
Shard 0
Shard 1
Shard 1
Shard 1
Shard 2
Shard 2
Shard 2
Advantages
■ Better data distribution
● $ ./shardsim --nodes 12 --vnodes 256 --shards 30 --ignore-msb-bits 12
12 nodes, 256 vnodes, 30 shards
maximum node overcommit: 1.11791
maximum shard overcommit: 1.140437
Disadvantages
■ Even worse small-table performance
● 60 shards * 4096 per-shard subranges * 30 nodes * 256 vnodes = ???
■ Terrible repair/streaming with mismatched shard counts
■ Complexity, bugs
Hello Tablets
Tablet History: ClayDB > Bigtable > ScyllaDB
ScyllaDB Tablet
■ A small range of keys (tokens)
■ Size of 5GB
■ Dynamic, per table
■ Dynamically assigned to nodes
■ A full LSM tree {compaction, sstable files, memtable}
■ Small table -> fit in a single tablets, great scan/index
The Tablets Table
// Managed by the Raft protocol
// Replicated on every node
CREATE TABLE system.tablets (
table_id uuid,
last_token bigint,
keyspace_name text static,
replicas list<tuple<uuid, int>>,
new_replicas list<tuple<uuid, int>>,
session uuid,
stage text,
transition text,
table_name text static,
tablet_count int static,
PRIMARY KEY (table_id, last_token)
);
RAFT
Group 0
RAFT tablet table key1
key2
tablet
tablet
replica
tablet
replica
Tablets - balancing
Table starts with a few tablets.
Small tables end there
Not fragmented into tiny pieces
like with tokens
Tablets - balancing
When tablet becomes too heavy
(disk, CPU, …) it is split
Tablets - balancing
When tablet becomes too heavy
(disk, CPU, …) it is split
Tablets - balancing
The load balancer can decide to
move tablets
Tablets
Resharding is cheap.
SStables split at tablet boundary.
Reassign tablets to shards (logical operation).
Tablets
Cleanup after topology change is cheap.
Just delete SStables.
Tablet Scheduler
Scheduler globally controls movement, maintenance
operation on a per tablet basis
repair
migration
tablet 0
tablet 1
schedule schedule
repair
Backup
Tablet Scheduler
Goals:
■ Maximize throughput (saturate)
■ Keep migrations short (don’t overload)
Rules:
■ migrations-in <= 2 per shard
■ migrations-out <= 4 per shard
Sharding Considerations
■ Data distribution/fairness -> Perfect
■ Mixed node sizes, add/remove 5% in size
■ Multiple nodes operations - double capacity in minutes
■ Amount of node streams to/from single node
■ Scanning is efficient
■ Small tables are fast & efficient
■ Drivers are tablet-aware
■ But without reading the tablets table
● Driver contacts a random node/shard
● On miss, gets updated routing
information for that tablet
● Ensures fast start-up even with 100,000
entries in the tablets table
Driver Support
■ Fencing - each write is signed with topology version
■ If there is a version mismatch, the write doesn’t go through
Changes in the Data Plane
Replica
Coordinator
Topology
coordinator
R
e
q
(
V
1
)
Req(V 1
)
F
e
n
c
e
(
V
1
)
D
r
a
i
n
(
V
1
)
Req(V 2
)
Tablet
Tablet
Tablet
Tablet
Add node
Tablet
Tablet
Tablet
Tablet
Serving
reads/writes
Tablet
Tablet
Tablet
Tablet
Add node stream first tablet
Tablet
Tablet
Tablet
Tablet
Tablet
Tablet
Tablet
Tablet
Add node - start serving
Tablet
Tablet
Tablet
Tablet
Start Serving
reads/writes
Demo time
Not a demo, a test suite
Tablets => Serverless
Time
Volu
me
ie3n’s
i4i
Time
Thr
oug
hpu
t
Capacity
Required
Time
Thr
oug
hpu
t
On-demand
Base
Typeless Sizeless limitless
Serverless tests (even more)
Thank you! Let’s connect.
@DorLaor
Ad

More Related Content

Similar to Overcoming Distributed Databases Scaling Challenges with Tablets (20)

High-availability with Galera Cluster for MySQL
High-availability with Galera Cluster for MySQLHigh-availability with Galera Cluster for MySQL
High-availability with Galera Cluster for MySQL
FromDual GmbH
 
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDBScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB
 
Things you should know about Oracle truncate
Things you should know about Oracle truncateThings you should know about Oracle truncate
Things you should know about Oracle truncate
Kazuhiro Takahashi
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
FromDual GmbH
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
ScyllaDB
 
Managing terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigManaging terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets big
Selena Deckelmann
 
Managing terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigManaging terabytes: When Postgres gets big
Managing terabytes: When Postgres gets big
Selena Deckelmann
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
ScyllaDB
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
JawaharPrasad3
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
HighLoad2009
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
DataStax
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
AbdulhseynAayev1
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
Alaa Elhadba
 
MySQL HA
MySQL HAMySQL HA
MySQL HA
Kris Buytaert
 
HBase at Flurry
HBase at FlurryHBase at Flurry
HBase at Flurry
ddlatham
 
Apache Kudu
Apache KuduApache Kudu
Apache Kudu
Mike Frampton
 
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
odnoklassniki.ru
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013
 
M|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change MethodsM|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change Methods
MariaDB plc
 
High-availability with Galera Cluster for MySQL
High-availability with Galera Cluster for MySQLHigh-availability with Galera Cluster for MySQL
High-availability with Galera Cluster for MySQL
FromDual GmbH
 
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDBScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB
 
Things you should know about Oracle truncate
Things you should know about Oracle truncateThings you should know about Oracle truncate
Things you should know about Oracle truncate
Kazuhiro Takahashi
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
FromDual GmbH
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
ScyllaDB
 
Managing terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigManaging terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets big
Selena Deckelmann
 
Managing terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigManaging terabytes: When Postgres gets big
Managing terabytes: When Postgres gets big
Selena Deckelmann
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
ScyllaDB
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
JawaharPrasad3
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
DataStax
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
Alaa Elhadba
 
HBase at Flurry
HBase at FlurryHBase at Flurry
HBase at Flurry
ddlatham
 
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
odnoklassniki.ru
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013
 
M|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change MethodsM|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change Methods
MariaDB plc
 

More from ScyllaDB (20)

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
Leading a High-Stakes Database MigrationLeading a High-Stakes Database Migration
Leading a High-Stakes Database Migration
ScyllaDB
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & TradeoffsAchieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn IsarathamHow Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd ColemanHow Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor LaorScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach LivyatanReduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence LiuMigrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon WasikVector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDBObject Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
A Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr SarnaA Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
High Availability: Lessons Learned by Paul Preuveneers
High Availability: Lessons Learned by Paul PreuveneersHigh Availability: Lessons Learned by Paul Preuveneers
High Availability: Lessons Learned by Paul Preuveneers
ScyllaDB
 
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
ScyllaDB
 
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
ScyllaDB
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
Leading a High-Stakes Database MigrationLeading a High-Stakes Database Migration
Leading a High-Stakes Database Migration
ScyllaDB
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & TradeoffsAchieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn IsarathamHow Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd ColemanHow Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor LaorScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach LivyatanReduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence LiuMigrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon WasikVector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDBObject Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
A Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr SarnaA Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
High Availability: Lessons Learned by Paul Preuveneers
High Availability: Lessons Learned by Paul PreuveneersHigh Availability: Lessons Learned by Paul Preuveneers
High Availability: Lessons Learned by Paul Preuveneers
ScyllaDB
 
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
ScyllaDB
 
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
ScyllaDB
 
Ad

Recently uploaded (20)

Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdfAutomate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Precisely
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of ExchangesJignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah Innovator
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
Financial Services Technology Summit 2025
Financial Services Technology Summit 2025Financial Services Technology Summit 2025
Financial Services Technology Summit 2025
Ray Bugg
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
The Microsoft Excel Parts Presentation.pdf
The Microsoft Excel Parts Presentation.pdfThe Microsoft Excel Parts Presentation.pdf
The Microsoft Excel Parts Presentation.pdf
YvonneRoseEranista
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdfAutomate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Precisely
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of ExchangesJignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah Innovator
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
Financial Services Technology Summit 2025
Financial Services Technology Summit 2025Financial Services Technology Summit 2025
Financial Services Technology Summit 2025
Ray Bugg
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
The Microsoft Excel Parts Presentation.pdf
The Microsoft Excel Parts Presentation.pdfThe Microsoft Excel Parts Presentation.pdf
The Microsoft Excel Parts Presentation.pdf
YvonneRoseEranista
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Ad

Overcoming Distributed Databases Scaling Challenges with Tablets

  • 1. A ScyllaDB Community Overcoming Distributed Databases Scaling Challenges with Tablets Dor Laor Co-founder & CEO
  • 2. Dor Laor Co-founder of ScyllaDB ■ Something cool I’ve done: Dyed my hair ■ My perspective on P99s: Less is more ■ Another thing about me: Own a Phd in Snowboard ■ What I do away from work: Work remotely
  • 3. Low p99 latency algorithm: #1: write code #2: Optimize #3: if (fast) break goto #2
  • 4. Algorithm #2: rewrite in assembly internal::memory_prefaulter::work(...) { auto fault_in_memory = [] (char* p); // Touch the page for write, but be sure not to modify anything // The compilers tend to optimize things, so prefer assembly asm volatile ("lock orb $0, %0" : "=&m"(*p)); while (start < end) { fault_in_memory(start); start += page_size; } };
  • 5. Algorithm #3: Thread per core Architecture Shards Threads
  • 7. Most times, |server| > 1 Hmm, so local optimizations matter more!
  • 11. Cassandra Token Ring [2008] ■ Each node has one token range ■ Token ranges are assigned randomly on node creation ■ A new node split an existing range ■ Replica sets determined algorithmically Owned range Replica set A D C F
  • 12. Advantages ■ Nodes can be added with no central coordination
  • 13. Disadvantages ■ Nodes cannot be added without central coordination ■ Can only join/decommission one node at a time ■ Bad data distribution ● ./shardsim --nodes 12 --vnodes 1 --shards 1 maximum node overcommit: 3.42607 ● Support for manual balancing by changing tokens ■ New nodes on stream from only three neighbors ● Generates uneven CPU, disk load during join
  • 14. Cassandra Virtual Nodes (vnodes, 2012) ■ Each node has N (typically 32 or 256) randomly assigned tokens Owned range Replica set A A
  • 15. Advantages ■ Better data distribution $ ./shardsim --nodes 12 --vnodes 256 --shards 1 maximum node overcommit: 1.12073
  • 16. Disadvantages ■ Many more replica sets ● Scans of small tables have to traverse 256*nr_nodes ranges ● Overhead can dominate scan time for near-empty tables ● 2 failures - Good likelihood of availability issue in a single zone deployment ■ Even worse for Cassandra secondary indexes (that are local to the node)
  • 17. ScyllaDB Architecture: vnodes & shard per core Homogeneous nodes Ring Architecture
  • 18. 2014 - ScyllaDB 0.1 shards ■ Introduce orthogonal distribution, internal to the node ■ Different nodes can have different shard counts Owned range Replica set Shard 0 Shard 1 Shard 2
  • 19. Advantages ■ Per-shard data ownership removes need for locks
  • 20. Disadvantages ■ Poor data distribution ● $ ./shardsim --nodes 12 --vnodes 256 --shards 30 --ignore-msb-bits 0 12 nodes, 256 vnodes, 30 shards maximum node overcommit: 1.11148 maximum shard overcommit: 2.512377 ■ Poor scan performance over near-empty tables ● Need to scan each vnode, then each shard’s intersection in the vnode
  • 21. Owned range Replica set 2016 - ScyllaDB 1.6 - murmur3_ignore_msb_bits ■ Split token range into 4096*nr_shard ranges ■ Each shard owns 4096 ranges Shard 0 Shard 1 Shard 2 Shard 0 Shard 0 Shard 0 Shard 1 Shard 1 Shard 1 Shard 2 Shard 2 Shard 2
  • 22. Advantages ■ Better data distribution ● $ ./shardsim --nodes 12 --vnodes 256 --shards 30 --ignore-msb-bits 12 12 nodes, 256 vnodes, 30 shards maximum node overcommit: 1.11791 maximum shard overcommit: 1.140437
  • 23. Disadvantages ■ Even worse small-table performance ● 60 shards * 4096 per-shard subranges * 30 nodes * 256 vnodes = ??? ■ Terrible repair/streaming with mismatched shard counts ■ Complexity, bugs
  • 25. Tablet History: ClayDB > Bigtable > ScyllaDB
  • 26. ScyllaDB Tablet ■ A small range of keys (tokens) ■ Size of 5GB ■ Dynamic, per table ■ Dynamically assigned to nodes ■ A full LSM tree {compaction, sstable files, memtable} ■ Small table -> fit in a single tablets, great scan/index
  • 27. The Tablets Table // Managed by the Raft protocol // Replicated on every node CREATE TABLE system.tablets ( table_id uuid, last_token bigint, keyspace_name text static, replicas list<tuple<uuid, int>>, new_replicas list<tuple<uuid, int>>, session uuid, stage text, transition text, table_name text static, tablet_count int static, PRIMARY KEY (table_id, last_token) );
  • 28. RAFT Group 0 RAFT tablet table key1 key2 tablet tablet replica tablet replica
  • 29. Tablets - balancing Table starts with a few tablets. Small tables end there Not fragmented into tiny pieces like with tokens
  • 30. Tablets - balancing When tablet becomes too heavy (disk, CPU, …) it is split
  • 31. Tablets - balancing When tablet becomes too heavy (disk, CPU, …) it is split
  • 32. Tablets - balancing The load balancer can decide to move tablets
  • 33. Tablets Resharding is cheap. SStables split at tablet boundary. Reassign tablets to shards (logical operation).
  • 34. Tablets Cleanup after topology change is cheap. Just delete SStables.
  • 35. Tablet Scheduler Scheduler globally controls movement, maintenance operation on a per tablet basis repair migration tablet 0 tablet 1 schedule schedule repair Backup
  • 36. Tablet Scheduler Goals: ■ Maximize throughput (saturate) ■ Keep migrations short (don’t overload) Rules: ■ migrations-in <= 2 per shard ■ migrations-out <= 4 per shard
  • 37. Sharding Considerations ■ Data distribution/fairness -> Perfect ■ Mixed node sizes, add/remove 5% in size ■ Multiple nodes operations - double capacity in minutes ■ Amount of node streams to/from single node ■ Scanning is efficient ■ Small tables are fast & efficient
  • 38. ■ Drivers are tablet-aware ■ But without reading the tablets table ● Driver contacts a random node/shard ● On miss, gets updated routing information for that tablet ● Ensures fast start-up even with 100,000 entries in the tablets table Driver Support
  • 39. ■ Fencing - each write is signed with topology version ■ If there is a version mismatch, the write doesn’t go through Changes in the Data Plane Replica Coordinator Topology coordinator R e q ( V 1 ) Req(V 1 ) F e n c e ( V 1 ) D r a i n ( V 1 ) Req(V 2 )
  • 41. Tablet Tablet Tablet Tablet Add node stream first tablet Tablet Tablet Tablet Tablet
  • 42. Tablet Tablet Tablet Tablet Add node - start serving Tablet Tablet Tablet Tablet Start Serving reads/writes
  • 44. Not a demo, a test suite
  • 47. Thank you! Let’s connect. @DorLaor
  翻译: