SlideShare a Scribd company logo
Cassandra Data Modeling Workshop
  Matthew F. Dennis // @mdennis
Overview
●   Hopefully interactive
●   Use cases submitted via Google Moderator,
    email, IRC, etc
●   Interesting and/or common requests in the
    slides to get us started
●   Bring up others if you have them !
Data Modeling Goals
●   Keep data queried together on disk together
●   In a more general sense think about the
    efficiency of querying your data and work
    backward from there to a model in Cassandra
●   Don't try to normalize your data (contrary to
    many use cases in relational databases)
●   Usually better to keep a record that something
    happened as opposed to changing a value (not
    always advisable or possible)
ClickStream Data
                     (use case #1)

●   A ClickStream (in this context) is the sequence
    of actions a user of an application performs
●   Usually this refers to clicking links in a WebApp
●   Useful for ad selection, error recording, UI/UX
    improvement, A/B testing, debugging, et cetera
●   Not a lot of detail in the Google Moderator
    request on what the purpose of collecting the
    ClickStream data was – so I made some up
ClickStream Data Defined
●   Record actions of a user within a session for
    debugging purposes if app/browser/page/server
    crashes
Recording Sessions
●   CF for sessions a user has had
    ●   Row Key is user name/id
    ●   Column Name is session id (TimeUUID)
    ●   Column Value is empty (or length of session, or some
        aggregated details about the session after it ended)
●   CF for actual sessions
    ●   Row Key is TimeUUID session id
    ●   Column Name is timestamp/TimeUUID of each click
    ●   Column Value is details about that click (serialized)
UserSessions Column Family
              Session_01    Session_02    Session_03
              (TimeUUID)                  (TimeUUID)
    userId                  (TimeUUID)

              (empty/agg)   (empty/agg)   (empty/agg)


●   Most recent session
●   All sessions for a given time period
Sessions Column Family
                 timestamp_01 timestamp_02 timestamp_03
 SessionId
(TimeUUID)          ClickData      ClickData      ClickData
                 (json/xml/etc) (json/xml/etc) (json/xml/etc)



●   Retrieve entire session's ClickStream (row)
●   Order of clicks/events preserved
●   Retrieve ClickStream for a slice of time within the session
●   First action taken in a session
●   Most recent action taken in a session
●   Why JSON/XML/etc?
Alternatives?
Of Course
         (depends on what you want to do)
●   Secondary Indexes
●   All Sessions in one row
●   Track by time of activity instead of session
Secondary Indexes Applied
●   Drop UserSessions CF and use secondary
    indexes
●   Uses a “well known” column to record the user
    in the row; secondary index is created on that
    column
●   Doesn't work so well when storing aggregates
    about sessions in the UserSessions CF
●   Better when you want to retrieve all sessions a
    user has had
All Sessions In One Row Applied
●   Row Key is userId
●   Column Name is composite of timestamp and
    sessionId
●   Can efficiently request activity of a user across
    all sessions within a specific time range
●   Rows could potentially grow quite large, be
    careful
●   Reads will almost always require at least two
    seeks on disk
Time Period Partitioning Applied
●   Row Key is composite of userId and time “bucket”
    ●   e.g. jan_2011 or jan_01_2011 for month or day buckets respectively
●   Column Name is TimeUUID of click
●   Column Value is serialized click data
●   Avoids always requiring multiple seeks when the user has old
    data but only recent data is requested
●   Easy to lazily aggregate old activity
●   Can still efficiently request activity of a user across all
    sessions within a specific time range
Rolling Time Window Of Data Points
                    (use case #2)
●   Similar to RRDTool was the example given
●   Essentially store a series of data points within a
    rolling window
●   common request from Cassandra users for this
    and/or similar
Data Points Defined
●   Each data point has a value (or multiple values)
●   Each data point corresponds to a specific point
    in time or an interval/bucket (e.g. 5 th minute of
       th
    17 hour on some date)
Time Window Model
              System7:RenderTime

               TimeUUID0   TimeUUID1     TimeUUID2

    s7:rt        0.051       0.014          0.173

                                     Some request took 0.014 seconds to render


●   Row Key is the id of the time window data you are
    tracking (e.g. server7:render_time)
●   Column Name is timestamp (or TimeUUID) the event
    occurred at
●   Column Value is the value of the event (e.g. 0.051)
The Details
●   Cassandra TTL values are key here
    ●   When you insert each data point set the TTL to the max time
        range you will ever request; there is very little overhead to
        expiring columns
●   When querying, construct TimeUUIDs for the min/max of
    the time range in question and use them as the start/end
    in your get_slice call
●   Consider partitioning the rows by a known time period
    (e.g. “year”) if you plan on keeping a long history of data
    (NB: requires slightly more complex logic in the app if a
    time range spans such a period)
●   Very efficient queries for any window of time
Rolling Window Of Counters
                (use case #3)
●   “How to model rolling time window that contains counters with time
    buckets of monthly (12 months), weekly (4 weeks), daily (7 days),
    hourly (24 hours)? Example would be; how many times user logged
    into a system in last 24 hours, last 7 days ...”
●   Timezones and “rolling window” is what makes this interesting
Rolling Time Window Details
●   One row for every granularity you want to track
    (e.g. day, hour)
●   Row Key consists of the granularity, metric, user
    and system
●   Column Name is a “fixed” time bucket on UTC time
●   Column Values are counts of the logins in that
    bucket
●   get_slice calls to return multiple counters which
    are them summed up
Rolling Time Window Counter Model
                     user3:system5:logins:by_day

                                     20110107          ...          20110523
            U3:S5:L:D
                                        2              ...               7

    2 logins in Jan 7th 2011           7 logins on May 23rd 2011
    for user 3 on system 5               for user 3 on system 5


                    user3:system5:logins:by_hour

                                    2011010710         ...         2011052316
            U3:S5:L:H
                                        1              ...               7

one login for user 3 on system 5     2 logins for user 3 on system 5
on Jan 7th 2011 for the 10th hour   on May 23rd 2011 for the 16th hour
Rolling Time Window Queries
●   Time window is rolling and there are other
    timezones besides UTC
    ●   one get_slice for the “middle” counts
    ●   one get_slice for the “left end”
    ●   one get_slice for the “right end”
Example: logins for the past 7 days
●   Determine date/time boundaries
●   Determine UTC days that are wholly contained
    within your boundaries to select and sum
●   Select and sum counters for the remaining hours
    on either side of the UTC days
●   O(1) queries (3 in this case), can be requested
    from C* in parallel
●   NB: some timezones are annoying (e.g. 15 minute
    or 30 minutes offsets); I try to ignore them
Alternatives?
                         (of course)
●   If you're counting logins and each user doesn't login
    in hundreds of times a day, just have one row per
    user with a TimeUUID column name for the time the
    login occurred
●   Supports any timezone/range/granularity easily
●   More expensive for large ranges (e.g. year)
    regardless of granularity, so cache results (in C*)
    lazily.
●   NB: caching results for rolling windows is not usually
    helpful (because, well it's rolling and always changes)
Eventually Atomic
                            (use case #4)
●   “When there are many to many or one to many relations involved how
    to model that and also keep it atomic? for eg: one user can upload
    many pictures and those pictures can somehow be related to other
    users as well.”
●   Attempting full ACID compliance in distributed systems is a bad idea
    (and impossible in the general sense)
●   However, consistency is important and can certainly be achieved in
    C*
●   Many approaches / alternatives
●   I like transaction log approach, especially in the context of C*
Transaction Logs
                   (in this context)
●   Records what is going to be performed before it
    is actually performed
●   Performs the actions that need to be atomic (in
    the indivisible sense, not the all at once sense)
●   Marks that the actions were performed
In Cassandra
●   Serialize all actions that need to be performed
    in a single column – JSON, XML, YAML (yuck!),
    cpickle, JSO, et cetera
    ●   Row Key = randomly chosen C* node token
    ●   Column Name = TimeUUID
●   Perform actions
●   Delete Column
Configuration Details
●   Short GC_Grace on the XACT_LOG Column
    Family (e.g. 1 hour)
●   Write to XACT_LOG at CL.QUORUM or
    CL.LOCAL_QUORUM for durability (if it fails
    with an unavailable exception, pick a different
    node token and/or node and try again; same
    semantics as a traditional relational DB)
●   1M memtable ops, 1 hour memtable flush time
Failures
●   Before insert into the XACT_LOG
●   After insert, before actions
●   After insert, in middle of actions
●   After insert, after actions, before delete
●   After insert, after actions, after delete
Recovery
●   Each C* has a crond job offset from every other
    by some time period
●   Each job runs the same code: multiget_slice for
    all node tokens for all columns older than some
    time period
●   Any columns need to be replayed in their
    entirety and are deleted after replay (normally
    there are no columns because normally things
    are working normally)
XACT_LOG Comments
●   Idempotent writes are awesome (that's why this
    works so well)
●   Doesn't work so well for counters (they're not
    idempotent)
●   Clients must be able to deal with temporarily
    inconsistent data (they have to do this anyway)
●   Could use a reliable queuing service (e.g. SQS)
    instead of polling – push to SQS first, then
    XACT log.
Q?
Cassandra Data Modeling Workshop
  Matthew F. Dennis // @mdennis
Ad

More Related Content

What's hot (20)

collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
Mark Wong
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
Flink Forward
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
DataStax Academy
 
An Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL TriggersAn Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL Triggers
Jim Mlodgenski
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Michaël Figuière
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
Denish Patel
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
Jon Haddad
 
Understanding Autovacuum
Understanding AutovacuumUnderstanding Autovacuum
Understanding Autovacuum
Dan Robinson
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
Martin Zapletal
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
Flink Forward
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
Jon Haddad
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized Views
Carl Yeksigian
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
DataStax
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
Victor Coustenoble
 
Dun ddd
Dun dddDun ddd
Dun ddd
Lyuben Todorov
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
DataStax
 
The world's next top data model
The world's next top data modelThe world's next top data model
The world's next top data model
Patrick McFadin
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015
StampedeCon
 
collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
Mark Wong
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
Flink Forward
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
DataStax Academy
 
An Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL TriggersAn Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL Triggers
Jim Mlodgenski
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Michaël Figuière
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
Denish Patel
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
Jon Haddad
 
Understanding Autovacuum
Understanding AutovacuumUnderstanding Autovacuum
Understanding Autovacuum
Dan Robinson
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
Martin Zapletal
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
Flink Forward
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
Jon Haddad
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized Views
Carl Yeksigian
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
DataStax
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
DataStax
 
The world's next top data model
The world's next top data modelThe world's next top data model
The world's next top data model
Patrick McFadin
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015
StampedeCon
 

Viewers also liked (20)

Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
ebenhewitt
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
Matthew Dennis
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
nkorla1share
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling Webinar
Matthew Dennis
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
Matthew Dennis
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
Matthew Dennis
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
Dave Gardner
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
Michelle Darling
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUG
Matthew Dennis
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
Matthew Dennis
 
durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durability
Matthew Dennis
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
Matthew Dennis
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
DataStax
 
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
odnoklassniki.ru
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
DataStax Academy
 
Cassandra datamodel
Cassandra datamodelCassandra datamodel
Cassandra datamodel
lurga
 
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit SuisseCassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
DataStax Academy
 
NoSQL with Cassandra
NoSQL with CassandraNoSQL with Cassandra
NoSQL with Cassandra
Gasol Wu
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
Matthew Dennis
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
ebenhewitt
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
Matthew Dennis
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
nkorla1share
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling Webinar
Matthew Dennis
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
Matthew Dennis
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
Matthew Dennis
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
Dave Gardner
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUG
Matthew Dennis
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
Matthew Dennis
 
durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durability
Matthew Dennis
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
Matthew Dennis
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
DataStax
 
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
odnoklassniki.ru
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
DataStax Academy
 
Cassandra datamodel
Cassandra datamodelCassandra datamodel
Cassandra datamodel
lurga
 
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit SuisseCassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
DataStax Academy
 
NoSQL with Cassandra
NoSQL with CassandraNoSQL with Cassandra
NoSQL with Cassandra
Gasol Wu
 
Ad

Similar to Cassandra Data Modeling (20)

Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
vtunotesbysree
 
Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
joeyrobert
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
Eran Levy
 
Teradata Tutorial for Beginners
Teradata Tutorial for BeginnersTeradata Tutorial for Beginners
Teradata Tutorial for Beginners
rajkamaltibacademy
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas Distribuidos
Locaweb
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
Document 14 (6).pdf
Document 14 (6).pdfDocument 14 (6).pdf
Document 14 (6).pdf
RajMantry
 
Lecture 5 process concept
Lecture 5   process conceptLecture 5   process concept
Lecture 5 process concept
Pradeep Kumar TS
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstrationWebinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstration
Federico Razzoli
 
Clock report management system project report.pdf
Clock report management system project report.pdfClock report management system project report.pdf
Clock report management system project report.pdf
Kamal Acharya
 
Clock report management system project report.pdf
Clock report management system project report.pdfClock report management system project report.pdf
Clock report management system project report.pdf
Kamal Acharya
 
MySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossukMySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossuk
Valeriy Kravchuk
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
Petr Vlček
 
UNIT-2-PROCESS MANAGEMENT in opeartive system.pptx
UNIT-2-PROCESS MANAGEMENT in opeartive system.pptxUNIT-2-PROCESS MANAGEMENT in opeartive system.pptx
UNIT-2-PROCESS MANAGEMENT in opeartive system.pptx
nagarajans87
 
Operating Systems - Process Scheduling Management
Operating Systems - Process Scheduling ManagementOperating Systems - Process Scheduling Management
Operating Systems - Process Scheduling Management
Dr. Chandrakant Divate
 
Lecture 2 Processes in operating systems.pptx
Lecture 2 Processes in operating systems.pptxLecture 2 Processes in operating systems.pptx
Lecture 2 Processes in operating systems.pptx
HarrisChikunya
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
NETWAYS
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink Way
George T. C. Lai
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
vtunotesbysree
 
Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
joeyrobert
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
Eran Levy
 
Teradata Tutorial for Beginners
Teradata Tutorial for BeginnersTeradata Tutorial for Beginners
Teradata Tutorial for Beginners
rajkamaltibacademy
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas Distribuidos
Locaweb
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
Document 14 (6).pdf
Document 14 (6).pdfDocument 14 (6).pdf
Document 14 (6).pdf
RajMantry
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstrationWebinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstration
Federico Razzoli
 
Clock report management system project report.pdf
Clock report management system project report.pdfClock report management system project report.pdf
Clock report management system project report.pdf
Kamal Acharya
 
Clock report management system project report.pdf
Clock report management system project report.pdfClock report management system project report.pdf
Clock report management system project report.pdf
Kamal Acharya
 
MySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossukMySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossuk
Valeriy Kravchuk
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
Petr Vlček
 
UNIT-2-PROCESS MANAGEMENT in opeartive system.pptx
UNIT-2-PROCESS MANAGEMENT in opeartive system.pptxUNIT-2-PROCESS MANAGEMENT in opeartive system.pptx
UNIT-2-PROCESS MANAGEMENT in opeartive system.pptx
nagarajans87
 
Operating Systems - Process Scheduling Management
Operating Systems - Process Scheduling ManagementOperating Systems - Process Scheduling Management
Operating Systems - Process Scheduling Management
Dr. Chandrakant Divate
 
Lecture 2 Processes in operating systems.pptx
Lecture 2 Processes in operating systems.pptxLecture 2 Processes in operating systems.pptx
Lecture 2 Processes in operating systems.pptx
HarrisChikunya
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
NETWAYS
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink Way
George T. C. Lai
 
Ad

Recently uploaded (20)

GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and MLGyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
Gyrus AI
 
TrsLabs - Leverage the Power of UPI Payments
TrsLabs - Leverage the Power of UPI PaymentsTrsLabs - Leverage the Power of UPI Payments
TrsLabs - Leverage the Power of UPI Payments
Trs Labs
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Connect and Protect: Networks and Network Security
Connect and Protect: Networks and Network SecurityConnect and Protect: Networks and Network Security
Connect and Protect: Networks and Network Security
VICTOR MAESTRE RAMIREZ
 
Play It Safe: Manage Security Risks - Google Certificate
Play It Safe: Manage Security Risks - Google CertificatePlay It Safe: Manage Security Risks - Google Certificate
Play It Safe: Manage Security Risks - Google Certificate
VICTOR MAESTRE RAMIREZ
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and MLGyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
Gyrus AI
 
TrsLabs - Leverage the Power of UPI Payments
TrsLabs - Leverage the Power of UPI PaymentsTrsLabs - Leverage the Power of UPI Payments
TrsLabs - Leverage the Power of UPI Payments
Trs Labs
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Connect and Protect: Networks and Network Security
Connect and Protect: Networks and Network SecurityConnect and Protect: Networks and Network Security
Connect and Protect: Networks and Network Security
VICTOR MAESTRE RAMIREZ
 
Play It Safe: Manage Security Risks - Google Certificate
Play It Safe: Manage Security Risks - Google CertificatePlay It Safe: Manage Security Risks - Google Certificate
Play It Safe: Manage Security Risks - Google Certificate
VICTOR MAESTRE RAMIREZ
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 

Cassandra Data Modeling

  • 1. Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis
  • 2. Overview ● Hopefully interactive ● Use cases submitted via Google Moderator, email, IRC, etc ● Interesting and/or common requests in the slides to get us started ● Bring up others if you have them !
  • 3. Data Modeling Goals ● Keep data queried together on disk together ● In a more general sense think about the efficiency of querying your data and work backward from there to a model in Cassandra ● Don't try to normalize your data (contrary to many use cases in relational databases) ● Usually better to keep a record that something happened as opposed to changing a value (not always advisable or possible)
  • 4. ClickStream Data (use case #1) ● A ClickStream (in this context) is the sequence of actions a user of an application performs ● Usually this refers to clicking links in a WebApp ● Useful for ad selection, error recording, UI/UX improvement, A/B testing, debugging, et cetera ● Not a lot of detail in the Google Moderator request on what the purpose of collecting the ClickStream data was – so I made some up
  • 5. ClickStream Data Defined ● Record actions of a user within a session for debugging purposes if app/browser/page/server crashes
  • 6. Recording Sessions ● CF for sessions a user has had ● Row Key is user name/id ● Column Name is session id (TimeUUID) ● Column Value is empty (or length of session, or some aggregated details about the session after it ended) ● CF for actual sessions ● Row Key is TimeUUID session id ● Column Name is timestamp/TimeUUID of each click ● Column Value is details about that click (serialized)
  • 7. UserSessions Column Family Session_01 Session_02 Session_03 (TimeUUID) (TimeUUID) userId (TimeUUID) (empty/agg) (empty/agg) (empty/agg) ● Most recent session ● All sessions for a given time period
  • 8. Sessions Column Family timestamp_01 timestamp_02 timestamp_03 SessionId (TimeUUID) ClickData ClickData ClickData (json/xml/etc) (json/xml/etc) (json/xml/etc) ● Retrieve entire session's ClickStream (row) ● Order of clicks/events preserved ● Retrieve ClickStream for a slice of time within the session ● First action taken in a session ● Most recent action taken in a session ● Why JSON/XML/etc?
  • 10. Of Course (depends on what you want to do) ● Secondary Indexes ● All Sessions in one row ● Track by time of activity instead of session
  • 11. Secondary Indexes Applied ● Drop UserSessions CF and use secondary indexes ● Uses a “well known” column to record the user in the row; secondary index is created on that column ● Doesn't work so well when storing aggregates about sessions in the UserSessions CF ● Better when you want to retrieve all sessions a user has had
  • 12. All Sessions In One Row Applied ● Row Key is userId ● Column Name is composite of timestamp and sessionId ● Can efficiently request activity of a user across all sessions within a specific time range ● Rows could potentially grow quite large, be careful ● Reads will almost always require at least two seeks on disk
  • 13. Time Period Partitioning Applied ● Row Key is composite of userId and time “bucket” ● e.g. jan_2011 or jan_01_2011 for month or day buckets respectively ● Column Name is TimeUUID of click ● Column Value is serialized click data ● Avoids always requiring multiple seeks when the user has old data but only recent data is requested ● Easy to lazily aggregate old activity ● Can still efficiently request activity of a user across all sessions within a specific time range
  • 14. Rolling Time Window Of Data Points (use case #2) ● Similar to RRDTool was the example given ● Essentially store a series of data points within a rolling window ● common request from Cassandra users for this and/or similar
  • 15. Data Points Defined ● Each data point has a value (or multiple values) ● Each data point corresponds to a specific point in time or an interval/bucket (e.g. 5 th minute of th 17 hour on some date)
  • 16. Time Window Model System7:RenderTime TimeUUID0 TimeUUID1 TimeUUID2 s7:rt 0.051 0.014 0.173 Some request took 0.014 seconds to render ● Row Key is the id of the time window data you are tracking (e.g. server7:render_time) ● Column Name is timestamp (or TimeUUID) the event occurred at ● Column Value is the value of the event (e.g. 0.051)
  • 17. The Details ● Cassandra TTL values are key here ● When you insert each data point set the TTL to the max time range you will ever request; there is very little overhead to expiring columns ● When querying, construct TimeUUIDs for the min/max of the time range in question and use them as the start/end in your get_slice call ● Consider partitioning the rows by a known time period (e.g. “year”) if you plan on keeping a long history of data (NB: requires slightly more complex logic in the app if a time range spans such a period) ● Very efficient queries for any window of time
  • 18. Rolling Window Of Counters (use case #3) ● “How to model rolling time window that contains counters with time buckets of monthly (12 months), weekly (4 weeks), daily (7 days), hourly (24 hours)? Example would be; how many times user logged into a system in last 24 hours, last 7 days ...” ● Timezones and “rolling window” is what makes this interesting
  • 19. Rolling Time Window Details ● One row for every granularity you want to track (e.g. day, hour) ● Row Key consists of the granularity, metric, user and system ● Column Name is a “fixed” time bucket on UTC time ● Column Values are counts of the logins in that bucket ● get_slice calls to return multiple counters which are them summed up
  • 20. Rolling Time Window Counter Model user3:system5:logins:by_day 20110107 ... 20110523 U3:S5:L:D 2 ... 7 2 logins in Jan 7th 2011 7 logins on May 23rd 2011 for user 3 on system 5 for user 3 on system 5 user3:system5:logins:by_hour 2011010710 ... 2011052316 U3:S5:L:H 1 ... 7 one login for user 3 on system 5 2 logins for user 3 on system 5 on Jan 7th 2011 for the 10th hour on May 23rd 2011 for the 16th hour
  • 21. Rolling Time Window Queries ● Time window is rolling and there are other timezones besides UTC ● one get_slice for the “middle” counts ● one get_slice for the “left end” ● one get_slice for the “right end”
  • 22. Example: logins for the past 7 days ● Determine date/time boundaries ● Determine UTC days that are wholly contained within your boundaries to select and sum ● Select and sum counters for the remaining hours on either side of the UTC days ● O(1) queries (3 in this case), can be requested from C* in parallel ● NB: some timezones are annoying (e.g. 15 minute or 30 minutes offsets); I try to ignore them
  • 23. Alternatives? (of course) ● If you're counting logins and each user doesn't login in hundreds of times a day, just have one row per user with a TimeUUID column name for the time the login occurred ● Supports any timezone/range/granularity easily ● More expensive for large ranges (e.g. year) regardless of granularity, so cache results (in C*) lazily. ● NB: caching results for rolling windows is not usually helpful (because, well it's rolling and always changes)
  • 24. Eventually Atomic (use case #4) ● “When there are many to many or one to many relations involved how to model that and also keep it atomic? for eg: one user can upload many pictures and those pictures can somehow be related to other users as well.” ● Attempting full ACID compliance in distributed systems is a bad idea (and impossible in the general sense) ● However, consistency is important and can certainly be achieved in C* ● Many approaches / alternatives ● I like transaction log approach, especially in the context of C*
  • 25. Transaction Logs (in this context) ● Records what is going to be performed before it is actually performed ● Performs the actions that need to be atomic (in the indivisible sense, not the all at once sense) ● Marks that the actions were performed
  • 26. In Cassandra ● Serialize all actions that need to be performed in a single column – JSON, XML, YAML (yuck!), cpickle, JSO, et cetera ● Row Key = randomly chosen C* node token ● Column Name = TimeUUID ● Perform actions ● Delete Column
  • 27. Configuration Details ● Short GC_Grace on the XACT_LOG Column Family (e.g. 1 hour) ● Write to XACT_LOG at CL.QUORUM or CL.LOCAL_QUORUM for durability (if it fails with an unavailable exception, pick a different node token and/or node and try again; same semantics as a traditional relational DB) ● 1M memtable ops, 1 hour memtable flush time
  • 28. Failures ● Before insert into the XACT_LOG ● After insert, before actions ● After insert, in middle of actions ● After insert, after actions, before delete ● After insert, after actions, after delete
  • 29. Recovery ● Each C* has a crond job offset from every other by some time period ● Each job runs the same code: multiget_slice for all node tokens for all columns older than some time period ● Any columns need to be replayed in their entirety and are deleted after replay (normally there are no columns because normally things are working normally)
  • 30. XACT_LOG Comments ● Idempotent writes are awesome (that's why this works so well) ● Doesn't work so well for counters (they're not idempotent) ● Clients must be able to deal with temporarily inconsistent data (they have to do this anyway) ● Could use a reliable queuing service (e.g. SQS) instead of polling – push to SQS first, then XACT log.
  • 31. Q? Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis
  翻译: