Developing with Cassandra

In memory Key-Value:
Redis
CouchBase
File system
BerkeleyDB
Distributed file system
HDFS
Velocity
Volume
Variety
In memory NewSQL /
domain specific
VoltDB
Traditional
[SQL] RDBMS
MongoDB
Distributed DBMS
Cassandra
HBase
CAP theorem:
• Consistency
• Availability
• Partition tolerance
- pick two
Eventual consistency
Transactional?
• No
Choose the DB

Why select Cassandra?
• [relatively] easy to setup
• [relatively] easy to use
• ~zero routine ops
• it works (!!) as promised:
o real-time replication
o node/site failure recovery
o zero load writes
o double of nodes = double of speed

Because Cassandra is Fast!
But needs some time to deliver
• 12'000 WPS on a laptop
• ~0.1 / 1 ms constant latency for writes/reads

Check References:
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
Scalability

Good For:
• log-like data
TTL helps
• massive writes
1M WPS enough?
• simple real-time analytics
Not So Good For:
• dump of junk
(consider HDFS)
• OLAP
(depends on "O")
Good and Not So Good

Distributed DBMS
Just DBMS - closed monolithic solution
o not a platform to run custom code (as MongoDB);
o not an extension (as HBase);
o highly optimized
No-master, eventually consistent
NoSQL
Data model - Key-Value
http://cassandra.apache.org
Apache Cassandra

Developed at Facebook for Inbox search
Released to open source in 2008
In use:
• Netflix - main non-content data store~500 Cassandra nodes (2012)
• eBay - recommendation system"dozens of nodes", 200 TB storage (2012)
• Twitter - tweet analysis100 + TB of data
• More clients: (http://www.datastax.com/cassandrausers)
History

1.0 - October 2011
1.1 - April 2012
1.2 - January 2013
2.0 - expected this summer (2013)
June 26 2013 - 158 bugs, 89 worth to notice
Sperasoft Experience:
• hit 1 bug in production (stability issue)
• hit 1 bug in QA (in a crafted case)
Mature & Agile

Apache .tar.gz and Debian
packageshttp://cassandra.apache.org/download/
DataStax DSC - Cassandra + OpsCenter
http://planetcassandra.org/Download/DataStaxCommunityEdition
Embedded – for funct. tests on Java apps
Maven
Documentation
http://wiki.apache.org/cassandra/
http://www.datastax.com/docs
Distributions

http://www.slideshare.net/planetcassandra/5-andy-cobley-raspberry-pi
CPU: ARM 700 MHz
RAM: 500 MB
Storage: SD card
Price: $25
200 WPS!
What Hardware?

Bare metal
CPU: 8 cores (4 works too)
RAM: 16 - 64 GB (min 8 GB)
Storage: rotating disks 3 - 5 TB total (SSD better)
VM works too, but...
Storage: local disks, avoid NAS
More on Hardware

Software Yes & No (production)
?

Software for Development: All Yes
Plus more: Java, Python, C#, PHP, Ruby, Clojure, Go, R, ...

KeyspaceKeyspace
Column
family
Column
family
~ RDBMS DB / Schema
~ RDBMS table
Key1
Key2
Value
Column
Row
Clustering key
Partitioning key Map < ... , Map < ... , ... > >
Data Model

Node 3Node 2Node 1
CFCFCF
1
2
3
Client
Parallel reads,writes
1
2
3
Data on Discs

https://github.com/datastax/java-driver
Client API Options
Thrift RPC Native protocol + CQL3
Apache Thrift Custom protocol
Synchronous Asynchronous
Schema-less Static schema
Store & Forward Cursors promised in 2.0
API for any language Java; Python, C# coming
Cryptic API JDBC-like API
Supported yet Going forward

• Forget RDB design principles
• Forget abstract data model
- shape data for queries
• No joins - materialized views
• Data duplication - OK
• Remember eventual consistency
• Queries are precious
• Use right data types - timestamp, uuid
Why? Because NoSQL is a low level tool for high optimization.
Data Modeling for NoSQL

Do & Don't
Patterns and Anti-patterns

Query
Wait
Read
Query
Wait
Read
Parallel it:
slo-o-ow
Sequential Read

CREATE TABLE timeline (
event uuid,
timestamp timeuuid,
...
PRIMARY KEY (event, timestamp)
);
event:date
timestamp
...
CREATE TABLE timeline (
event uuid,
date long,
timestamp timeuuid,
...
PRIMARY KEY ((event, date), timestamp)
);
event
timestamp
...
Still bad - need sharding
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
Long rows - Cassandra handle 2G columns, but... slo-o-ow
Timeline

Insert = Update = Delete
A B C D1
Y1
Z1
1
A Y Z1
a b c d
UPDATE ... SET b = 'Y' WHERE id = 1
INSERT INTO ... SET (id, c) values (1, 'Z')
DELETE d FROM ... WHERE id = 1
SELECT * FROM ... WHERE id = 1
have to fetch 4 rows
slo-o-ow
Plan Data Immutable

http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
Q
Queue: INSERT INTO ...
SET( name, enqueued_at, payload )
VALUES ( 'Q', now(), ... )
Dequeue: DELETE payload FROM ...
WHERE name = 'Q'
AND enqueued_at = ...
Pick the next: SELECT * FROM ...
WHERE id = 1 LIMIT 1
*Q
*Q
*Q
*Q
Q
Q
*? ? ?Q
have to fetch 4 rowsslo-o-ow
Queue

Remember - eventual consistency.
Concurrent updates
=> wrong count
SELECT count(*)
FROM ... WHERE .... ;
Full scan over the selection
=>
Default 10'000 rows limit
=> wrong count
Have an integer column and increment it
CREATE TABLE count_table (
id
uuid,
value counter,
PRIMARY KEY (id)
);
...
UPDATE count_table
SET value = value + 1
WHERE id = ... ;
Counter column family
http://www.datastax.com/documentation/cassandr
a/1.2/cassandra/cql_using/use_counter_t.html
slo-o-ow
mess
mess
How Many?

CREATE TABLE blob (
id uuid,
data blob,
PRIMARY KEY (id)
);
id
chunk_no
data
CREATE TABLE blob (
id uuid,
chunk_no int,
data blob,
PRIMARY KEY (id, chunk_no)
);
id data
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
http://wiki.apache.org/cassandra/CassandraLimitations
OutOfMemory
Blobs

Unbounded Queries
Result set
Client
OutOfMemory
OutOfMemory
Cursors. Planned to 2.0.
https://issues.apache.org/jira/browse/CASSANDRA-4415
C* clients use RPC yet
Cassandra
slo-o-ow

Column names - data... yet
Keep them short.

Node 3Node 1
CF
3
Client
2
Node 3Node 2
Why??
slo-o-ow
hot spots
Limit Client to a node

Helpful Links
Sperasoft @ slideshare: http://www.slideshare.net/Sperasoft
Sperasoft @ speakerdeck: https://speakerdeck.com/sperasoft
Sperasoft @ github: https://github.com/Sperasoft/Workshop

Developing with Cassandra

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Developing with Cassandra

Similar to Developing with Cassandra (20)

More from Sperasoft

More from Sperasoft (20)

Recently uploaded

Recently uploaded (20)

Developing with Cassandra