Processing and Analytics

Agenda Overview
10:00 AM Registration
10:30 AM Introduction to Big Data @ AWS
12:00 PM Lunch + Registration for Technical Sessions
12:30 PM Data Collection and Storage
1:45PM Real-time Event Processing
3:00PM Analytics (incl Machine Learning)
4:30 PM Open Q&A Roundtable

Collect Process Analyze
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
Primitive Patterns
EMR Redshift
Machine
Learning

Amazon Elastic MapReduce (EMR)

Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Control the cluster

The Hadoop ecosystem can run in Amazon EMR

Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS

Easy to add/remove compute capacity to your cluster
Match compute
demands with
cluster sizing
Resizable clusters

Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost

Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon
EMR clusters with no data loss
• Point multiple Amazon EMR
clusters at same data in Amazon
S3
EMR
EMR
Amazon
S3

EMRFS makes it easier to leverage S3
• Better performance and error handling options
• Transparent to applications – Use “s3://”
• Consistent view
 For consistent list and read-after-write for new puts
• Support for Amazon S3 server-side and client-side
encryption
• Faster listing using EMRFS metadata

Amazon S3 EMRFS metadata
in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number
of objects
Without
Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Fast listing of S3 objects using
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.

EMRFS - S3 client-side encryption
Amazon S3
AmazonS3encryption
clients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)

Optimize to leverage HDFS
• Iterative workloads
 If you’re processing the same dataset more than once
• Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to
copy to HDFS for processing

Pattern #1: Batch processing
GBs of logs pushed
to Amazon S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
Load subset into
Redshift DW

Pattern #2: Online data-store
Data pushed to
Amazon S3
Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database
24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency

Pattern #3: Interactive query
TBs of logs sent
daily
Logs stored in S3
Transient EMR
clusters
Hive Metastore

Example: Log Processing using Amazon EMR
• Aggregating small files using s3distcp
• Defining Hive tables with data on Amazon S3
• Interactive querying using Hue
Amazon S3
Log Bucket
Amazon
EMR
Processed and
structured log data

Months of user history Common misspellings
Data Analyzed Using EMR:
Westen
Wistin
Westan
Whestin
Automatic spelling corrections

Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Selected Amazon Redshift Customers

Clickstream Analysis for Amazon.com
• Redshift runs web log analysis for Amazon.com
 100 node Redshift Cluster
 Over one petabyte workload
 Largest table: 400TB
 2TB of data per day
• Understand customer behavior
 Who is browsing but not buying
 Which products / features are winners
 What sequence led to higher customer conversion

Redshift Performance Realized
• Scan 15 months of data: 14 minutes
 2.25 trillion rows
• Load one day worth of data: 10 minutes
 5 billion rows
• Backfill one month of data: 9.75 hours
 150 billion rows
• Pig  Amazon Redshift: 2 days to 1 hr
 10B row join with 700M rows
• Oracle  Amazon Redshift: 90 hours to 8 hrs
 Reduced number of SQLs by a factor of 3

Amazon Redshift Architecture
• Leader Node
 SQL endpoint
 Stores metadata
 Coordinates query execution
• Compute Nodes
 Local, columnar storage
 Execute queries in parallel
 Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
 Optimized for data processing
 DW1: HDD; scale from 2TB to 2PB
 DW2: SSD; scale from 160GB to 325TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

Amazon Redshift Node Types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/Year
• Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores
3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles
16 TB compressed, 2 GB/sec scan rate
• High performance at smaller storage
size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/Year
• Scale from 160GB to 256TB
DW2.L: 16 GB RAM, 2 Cores,
160 GB compressed SSD storage
DW2.8XL: 256 GB RAM, 32 Cores,
2.56 TB of compressed SSD storage

Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
Direct-attached storage
• With row storage you do
unnecessary I/O
• To get total amount, you have
to read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375

Column storage
Data compression
Zone maps
With column storage, you
only read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375

Column storage
Data compression
Zone maps
• Track the minimum and
maximum value for each block
• Skip over blocks that don’t
contain relevant data
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959

Column storage
Data compression
Zone maps
• Use local storage for
performance
• Maximize scan rates
• Automatic replication
and continuous
backup
• HDD & SSD platforms

Amazon Redshift parallelizes and
distributes everything
Query
Load
Backup/Restore
Resize

Query
Load
Backup/Restore
Resize
• Load in parallel from Amazon S3
or DynamoDB or any SSH
connection
• Data automatically distributed
and sorted according to DDL
• Scales linearly with the number of
nodes in the cluster

Query
Load
Backup/Restore
Resize
• Backups to Amazon S3 are automatic,
continuous and incremental
• Configurable system snapshot retention
period. Take user snapshots on-
demand
• Cross region backups for disaster
recovery
• Streaming restores enable you to
resume querying faster

Query
Load
Backup/Restore
Resize
• Resize while remaining online
• Provision a new cluster in the
background
• Copy data in parallel from node to
node
• Only charged for source cluster

Query
Load
Backup/Restore
Resize
• Automatic SQL endpoint
switchover via DNS
• Decommission the source
cluster
• Simple operation via Console or
API

Architecture and its Table Design
Implications

Table Distribution Styles
Distribution Key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution

Sorting Data
• In the slices (on disk), the data is sorted by a sort key
 If no sort key exists Redshift uses the data insertion order
• Choose a sort key that is frequently used in your queries
 As a query predicate (date, identifier, …)
 As a join parameter (it can also be the hash key)
• The sort key allows Redshift to avoid reading entire
blocks based on predicates
 For example, a table containing a timestamp sort key where only recent
data is accessed, will skip blocks containing “old” data

Interleaved Multi Column Sort
• Compound Sort Keys
 Optimized for applications that filter data by one leading column
• Interleaved Sort Keys (new)
 Optimized for filtering data by up to eight columns
 No storage overhead unlike an index
 Lower maintenance penalty compared to indexes

Compound Sort Keys Illustrated
• Records in Redshift are
stored in blocks.
• For this illustration, let’s
assume that four records
fill a block
• Records with a given
cust_id are all in one block
• However, records with a
given prod_id are spread
across four blocks
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks

1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved Sort Keys Illustrated
cust_id are spread
across two blocks
prod_id are also
spread across two
blocks
• Data is sorted in equal
measures for both
keys
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks

Amazon Redshift works with your
existing analysis tools
JDBC/ODBC
Amazon Redshift

Custom ODBC and JDBC Drivers
• Up to 35% higher performance than open source
drivers
• Supported by Informatica, Microstrategy, Pentaho,
Qlik, SAS, Tableau, Tibco, and others
• Will continue to support PostgreSQL open source
drivers
• Download drivers from console

User Defined Functions
• We’re enabling User Defined Functions (UDFs)
so you can add your own
 Scalar and Aggregate Functions supported
• You’ll be able to write UDFs using Python 2.7
 Syntax is largely identical to PostgreSQL UDF Syntax
 System and network calls within UDFs are prohibited
• Comes with Pandas, NumPy, and SciPy pre-
installed
 You’ll also be able import your own libraries for even
more flexibility

SELECT
INTO OUTFILE
s3cmd
COPY
Staging Prod
SQL
bcp
SQL Server
Redshift Use Case

Operational Reporting with Redshift
Amazon S3
Log Bucket
Amazon
EMR
Processed and
structured log
data
Amazon
Redshift
Operational
Reports

Amazon Web Services’ global
customer and partner conference
Learn more and register:
reinvent.awsevents.com
October 6-9, 2015 | The Venetian - Las Vegas, NV

Processing and Analytics

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Processing and Analytics

Similar to Processing and Analytics (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Processing and Analytics

Editor's Notes