Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Handling realtime and analytic
workloads in a single cluster
with Hadoop and Cassandra
Handling realtime and analytic
workloads in a single cluster
with Hadoop and Cassandra
Piotr Kołaczkowski
pkolaczk@datastax.com
@pkolaczk
Piotr Kołaczkowski
pkolaczk@datastax.com
@pkolaczk

Basic Cassandra + Hadoop Integration
C*
C*
C*
C*
C*
C*
C*
C*
Cassandra
Cluster
Hadoop Cluster
NameNode & JobTracker
DataNode DataNode
DataNode DataNode
DataNode DataNode
CFIF
CFOF

ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Key: ByteBuffer
Value: SortedMap<ByteBuffer, IColumn>
(column name, value, timestamp)
row key
column name

Input Key:
jim
age: 36 car: camaro gender: M
Input Value:

Input Key:
carol
age: 37 car: subaru
Input Value:

Input Key:
johnny
age: 12 gender: M
Input Value:

Input Key:
suzy
age: 10 gender: F
Input Value:

CFIF – Wide Row Support
Input Key:
jim
age: 36
Input Value:

Input Key:
jim
car: camaro
Input Value:

Input Key:
jim
gender: M
Input Value:

Input Key:
carol
age: 37
Input Value:

Input Key:
carol
car: subaru
Input Value:

CFIF – Cassandra Secondary Index Support
IndexExpression expr =
new IndexExpression(
ByteBufferUtil.bytes("car"),
IndexOperator.EQ,
ByteBufferUitl.bytes("subaru")
);
ConfigHelper.setInputRange(
job.getConfiguration(),
Arrays.asList(expr)
);

ColumnFamilyOutputFormat
● Key: ByteBuffer (row key)
● Value: List<Mutation>
– Mutation: insert or delete a column
C*
C*
C*
C*
C*
C*
C*
C*
Cassandra
Cluster
ColumnFamilyRecordWriter
write
queue
client
thrift

CFOF – Creating Mutations
ByteBuffer rowkey = ByteBufferUtil.bytes(“carol”);
Column column = new Column();
column.name = ByteBufferUtil.bytes(“age”);
column.value = ByteBufferUtil.bytes(37);
List<Mutation> mutations;
Mutation mutation = new Mutation();
mutation.column_or_supercolumn = new ColumnOrSuperColumn();
mutation.column_or_supercolumn.column = column;
mutations.add(mutation);
context.write(rowkey, mutationList);

BulkOutputFormat
Hadoop Temporary Dir
SSTable 1 SSTable 2 SSTable N...
flush
write
BulkRecordWriter
Memory Buffer

DataStax Enterprise:
Cassandra and Hadoop in a Single Cluster

Basic Features
● Single, simplified component
● Workload separation
● No SPOF
● Peer to peer
● JobTracker failover
● No additional Cassandra config

System Administrator's View
Address DC Rack Workload Status State Load Owns Token
148873535527910577765226390751398592512
101.202.204.101 Analytics rack1 Analytics(JT) Up Normal 78,96 GB 12,50% 0
101.202.204.102 Analytics rack1 Analytics(TT) Up Normal 82,65 GB 12,50% 21267647932558653966460912964485513216
101.202.204.105 Cassandra rack1 Cassandra Up Normal 67,42 GB 12,50% 85070591730234615865843651857942052864
Easy monitoring of
your nodes,
regardless of their
workload type

Wait, but where are my files?
Hadoop M/R
HDFS
Hadoop M/R
CFS
Cassandra Server

Cassandra File System Properties
● Decentralized
● Replicated
● HDFS compatible
– compatible with Hadoop filesystem utilities
– allows for running M/R programs on DSE without
any change
● Compressed

CFS Compaction
● Keeps track of deleted rows (blocks)
● When all blocks in SSTable removed,
deletes the whole SSTable
Cassandra Storage
block 1
block 2
block 3
block 4
block 5
block 6
ts 1
ts 2
block 6 block 6block 7
block 8
ts 3
ts 4
block 6block 9
block 10
X

Hive Integration
● CassandraHiveMetaStore
– stores Hive database metadata in Cassandra
– no need to run a separate RDBMS
● CassandraStorageHandler
– allows for direct access to C* tables with CFIF and
CFOF

Hive Integration – Example
CREATE EXTERNAL TABLE MyHiveTable(row_key string, col1 string, col2 string)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
TBLPROPERTIES ("cassandra.ks.name" = "MyCassandraKS");
SELECT count(*) FROM MyHiveTable;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201306041030_0001, Tracking URL = http://192.168.123.10:50030/jobdetails.jsp?jobid=job_201306041030_0001
Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=192.168.123.10:8012 -kill job_201306041030_0001
Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 1
2013-06-04 15:11:54,573 Stage-1 map = 0%, reduce = 0%
2013-06-04 15:11:58,622 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec
...
MapReduce Total cumulative CPU time: 31 seconds 910 msec
Ended Job = job_201306041030_0001
MapReduce Jobs Launched:
Job 0: Map: 9 Reduce: 1 Cumulative CPU: 31.91 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 31 seconds 910 msec
OK
1000000
Time taken: 46.246 seconds

Custom Column Mapping
CREATE EXTERNAL TABLE Users(
userid string, name string, email string, phone string)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH
SERDEPROPERTIES (
"cassandra.columns.mapping" = ":key,user_name,primary_email,home_phone");
Cassandra: row key user_name primary_email home_phone
Hive: userid name email phone

Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Recommended

More Related Content

What's hot (6)

Viewers also liked (20)

Similar to Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski (20)

More from Modern Data Stack France (20)

Recently uploaded (9)

Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski