Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store

Azure + DSE Powers O365
Per-User Store
© 2015. All Rights Reserved.

1 Introduction
2 What We Built
3 What to Pay Close Attention To
4 Deployment
5 Wrap Up
Overview

Sean Usher
Office 365
Email: seusher@microsoft.com
Twitter: @seanushermsft
Introduction
Mahesh Thiagarajan
Microsoft Azure
Email: mahthi@microsoft.com
Twitter: @_cloudguy
Ben Lackey
DataStax
Email: ben.lackey@datastax.com

Introduction – Office 365
Email
Collaboration
Document Authoring
Social Networking
Calendaring
File Storage
Business Intelligence
Etc…

Introduction – Azure
Azure is Microsoft’s cloud computing platform, a growing collection of
integrated services—analytics, computing, database, mobile, networking,
storage, and web—for moving faster, achieving more, and saving money.

What We Built - Overview
A way to understand our users and organizations at a deeper level!
• Are users happy with the service they are receiving?
• Are users fully utilizing the services they are paying us for?
• Are users hitting issues that we can proactively help them with?
• How has a user’s experience been over their lifetime?
• Can we discover insights that we aren’t even aware of?
This requires ingesting and storing a lot of data. We need to be able to
perform fast, scalable analytics on that data, or we will discover issues too
late!
Questions:

What We Built – Why Cassandra
The Good
• Low Latency ✓
• Linear Scale ✓
• Highly Available ✓
• Aggregations (Spark/Spark Streaming) ✓
• Machine Learning (Spark ML) ✓
• No Enforcement of Full Consistency ✓ ✓ ✓
The Not-So-Good
• No Hosted Option in Azure ✗
• Have to Install and Configure it Ourselves ✗

Cassandra: 12 Nodes
Analytics: 12 Nodes
VM Size: G4
Heap Size: 30 GB
GC: G1
Ingestion: 20k – 50k events/sec
Data on ephemeral SSD drives.
RF = 3 in both DCs
Cassandra: 30 Nodes
Analytics: 15 Nodes (30 within 1 month)
VM Size: G4
Heap Size: 30 GB
GC: G1
Ingestion: 200k+ events/sec
Data on ephemeral SSD drives.
RF = 3 in both DCs
What We Built – DSE Clusters
Cluster 1:
Cluster 2:

What We Built - Pipeline Evolution
RESTAPI
O365
Event Hub
Ingestion
Worker
(Azure worker role
using DataStax C#
driver)
C* Analytics
RESTAPI
O365
Kafka
C*/
Spark
Streaming
Analytics
G4 – Local SSD
Kafka: G4 – Data Disk
ZooKeeper: A7 – Data Disk
PaaS Small
G4 – Local SSD
Cluster 1:
Cluster 2:

What to Pay Close Attention To – Azure Disks
VHD Storage: No more than 40 VMs per-storage account
“… and for a Standard Tier VM, it is about 40 (20,000/500 IOPS per disk)…..”
https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/
Disk Choice:
1. Local SSD (Ephemeral) – Fast but allows data loss.
2. Data Disk (Standard Storage) – No data loss, network-attached which can add latency. 20k IOPs account Limit.
3. Data Disk (Premium Storage) – No data loss, network-attached which can add latency. Per-disk IOPs Limit.
https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-how-to-attach-disk/
https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storage-limits
VM
SSD: /dev/sdb
Storage Account
(Data Disk)
Storage Account
(OS Disk)
OS: /dev/sda

What to Pay Close Attention To – Azure VM Size
VM Size: We chose G4 nodes, but are investigating moving to D14 nodes. Having a larger number of smaller
nodes will allow for faster rebuild which can reduce recovery time.
https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-size-specs/

What to Pay Close Attention To – Azure Networking
Networking: Virtual Network (VNet) vs Public IP
1. Public IPs – Default limit of 5 per subscription. Allows geo-redundant replication over Internet.
2. VNet – Define your own subnets and IP ranges. Allows geo-redundant replication via Gateways/Express Route.
No bandwidth limit within Vnet.
1. Standard Gateway – Max 100Mbs.
2. High-Performance Gateway – Max 200Mbs.
3. Express Route – Max 10Gbs.
https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-instance-level-public-ip/
https://azure.microsoft.com/en-us/documentation/articles/vpn-gateway-vnet-vnet-rm-ps/
https://msdn.microsoft.com/en-us/library/azure/mt586720.aspx

What to Pay Close Attention To – Azure Networking
Test performance of every dependency and see if it meets the expectations of your application.
Network Performance: Iperf (https://iperf.fr/) – Test bandwidth between two VMs within various DCs
VNet
VM
10.1.0.10
Iperf -s
VM
10.1.0.11
Iperf –c 10.1.0.10
user@machine:~$ iperf -c 10.1.0.10
------------------------------------------------------------
Client connecting to 10.1.0.10, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[ 3] local 10.1.0.10 port 42892 connected with 10.1.0.10 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 45.7 GBytes 39.2 Gbits/sec

What to Pay Close Attention To – Azure Storage
Test performance of every dependency and see if it meets the expectations of your application.
Disk: SysBench (https://wiki.gentoo.org/wiki/Sysbench) – Test write throughput and IOPs
user@machine:/mnt$ sysbench --test=fileio --file-total-size=1000G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 run
sysbench 0.4.12: multi-threaded system evaluation benchmark
<….. Excess Logging Removed….>
Operations performed: 402240 Read, 268160 Write, 858065 Other = 1528465 Total
Read 6.1377Gb Written 4.0918Gb Total transferred 10.229Gb (34.917Mb/sec)
2234.67 Requests/sec executed
Test execution summary:
total time: 300.0002s
total number of events: 670400
total time taken by event execution: 16.1526
per-request statistics:
min: 0.00ms
avg: 0.02ms
max: 2.20ms
approx. 95 percentile: 0.05ms
Threads fairness:
events (avg/stddev): 670400.0000/0.00
execution time (avg/stddev): 16.1526/0.00 © 2015. All Rights Reserved.

What to Pay Close Attention To – Cassandra
Metrics!
Need to tune? Al Tobey can help - https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

SSTable Count
• Too many SSTables can lead to OOM errors and nodes becoming unavailable.
• Watch count and balance compaction throughput with system limits.
• SSTable count may spike during repairs if data is inconsistent.
Dropped Mutations
• Dropped mutations mean more repairs need to be done.
• Impact of dropped mutations can be controlled by tuning write consistency.
• Check iostat to see if disk queue is building up or write latency is high.
• iostat -x /dev/sdb 1 5
• Do drops only happen when Spark Jobs batch write? Tune Spark write throughput
(https://github.com/datastax/spark-cassandra-connector/blob/v1.2.5/doc/FAQ.md)
See memtables & flushing in Al’s Tuning Guide.

Pending Compactions
• If you aren’t keeping up with compactions, performance will suffer.
• Too many SSTables impact read speed, but also can lead to hitting OS limits. See:
• /etc/sysctl.conf - vm.max_map_count
• /etc/security/limits.d/cassandra.conf – nofile
• /etc/init.d/dse – Certain DSE versions overwrite nofile with: FD_LIMIT=100000
Heap Used
• Heap usage changes over time. What works in week one, may not work in week 10.
• We used a 20GB heap until nodes started hitting OOM when they needed 25 GB.
• Use G1 if at all possible to see GC times decrease, and use a large (25 – 30 GB) heap.
• Let G1 tune your young generation heap size.

What to Pay Close Attention To – Spark
We are still learning!
Scheduler Output:
NOT CRON!
Spark UI: Spark Job Logs:
If you don’t enable Spark UI for
security reasons, ship your Spark
logs off box for analysis.
You may also find that jobs fail to
read data because partitions are
missing or nodes are timing out.
This can indicate you are
overwhelming Cassandra.

Deployment
Use the Azure/DataStax Template
Azure will be investing in building more features into the Azure template, and you will get those easier if you use the
existing template.
https://www.youtube.com/watch?v=vacp267zLBA&noredirect=1
https://github.com/DSPN/azure-resource-manager-dse
We Didn’t Use the Template because it wasn’t ready yet. We had to write our own logic to deploy nodes and need to
transition to the template so we can get all of these new features. We are scheduling time to do this because it will
save us a lot of work!
Consider Security and Compliance: This will influence how you deploy (VNet vs Public IP), what Cassandra configuration
you use (internode encryption, require_client_auth: true), and what OS configuration you use (CIS standards).
C* Hardening: http://thelastpickle.com/blog/2015/09/30/hardening-cassandra-step-by-step-part-1-server-to-server.html
CIS Standards: https://benchmarks.cisecurity.org/downloads/show-single/?file=ubuntu1404.100

Azure Templates can:
• Ensure Idempotency
• Simplify Orchestration
• Simplify Roll-back
• Provide Cross-Resource Configuration
and Update Support
Azure Templates are:
• Source file, checked-in
• Specifies resources and dependencies
(VMs, WebSites, DBs) and connections
(config, LB sets)
• Parametized input/output
Instantiation of repeatable config.
Configuration  Resource Group
Power of Repeatability
SQL - A Website
Virtual
Machines
SQL-A
Website
[SQL CONFIG] VM (2x)
DEPENDS ON SQLDEPENDS ON SQL
SQL CONFIG

Extending the power of your VM
Enable easier management
Support partner ecosystem
Full control still with you!
Azure VM Extensions
Curated
ExtensionsAgent

Thank you
Sean Usher
Office 365
Email: seusher@microsoft.com
Twitter: @seanushermsft
Mahesh Thiagarajan
Microsoft Azure
Email: mahthi@microsoft.com
Twitter: @_cloudguy
Ben Lackey
DataStax
Email: ben.lackey@datastax.com

Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store

Similar to Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store

Editor's Notes