AWS Analytics

agenda overview
10:00 AM Registration
10:30 AM Introduction to Big Data @ AWS
12:00 PM Lunch + Registration for Technical Sessions
12:30 PM Data Collection and Storage
1:45PM Real-time Event Processing
3:00PM Analytics (incl Machine Learning)
4:30 PM Open Q&A Roundtable

Collect Process Analyze
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
primitive patterns
EMR Redshift
Machine
Learning

Process and Analyze
• Hadoop
 Ad-hoc exploration of un-structured datasets
 Batch Processing on Large datasets
• Data Warehouses
 Analysis via Visualization tools
 Interactive querying of structured data
• Machine learning
 Predictions for what will happen
 Smart applications

Hadoop and Data Warehouses
Databases
Files
Data warehouse Data Marts Reports
Hadoop
Ad-hoc Exploration
Media
Cloud
ETL

Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Control the cluster

Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS

Easy to add and remove compute capacity on your cluster
Match compute
demands with
cluster sizing.
Resizable clusters

Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost

Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon EMR
clusters with no data loss
• Point multiple Amazon EMR clusters
at same data in Amazon S3
EMR
EMR
Amazon
S3

EMRFS makes it easier to leverage S3
• Better performance and error handling options
• Transparent to applications – Use “s3://”
• Consistent view
 For consistent list and read-after-write for new puts
• Support for Amazon S3 server-side and client-side
encryption
• Faster listing using EMRFS metadata

EMRFS - S3 client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)

Amazon S3 EMRFS metadata
in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number
of objects
Without
Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Fast listing of S3 objects using
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.

Optimize to leverage HDFS
• Iterative workloads
 If you’re processing the same dataset more than once
• Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to copy
to HDFS for processing

Pattern #1: Batch processing
GBs of logs pushed
to Amazon S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
Load subset into
Redshift DW

Pattern #2: Online data-store
Data pushed to
Amazon S3
Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database
24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency

Pattern #3: Interactive query
TBs of logs sent
daily
Logs stored in S3
Transient EMR
clusters
Hive Metastore

File formats
• Row oriented
 Text files
 Sequence files
• Writable object
 Avro data files
• Described by schema
• Columnar format
 Object Record Columnar (ORC)
 Parquet
Logical Table
Row oriented
Column oriented

Choosing the right file format
• Processing and query tools
 Hive, Impala, and Presto.
• Evolution of schema
 Avro for schema and Presto for storage.
• File format “splittability”
 Avoid JSON/XML Files. Use them as records.

Choosing the right compression
• Time sensitive: faster compressions are a better choice
• Large amount of data: use space-efficient compressions
Algorithm Splittable? Compression Ratio
Compress +
Decompress Speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast

Dealing with small files
• Reduce HDFS block size (e.g., 1 MB [default is 128 MB])
 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-
hadoop --args “-m,dfs.block.size=1048576”
• Better: use S3DistCp to combine smaller files together
 S3DistCp takes a pattern and target path to combine smaller input files
into larger ones
 Supply a target size and compression codec

DEMO: Log Processing using Amazon EMR
• Aggregating small files using s3distcp
• Defining Hive tables with data on Amazon S3
• Transforming dataset using Batch processing
• Interactive querying using Presto and Spark-Sql
Amazon S3
Log Bucket
Amazon
EMR
Processed and
structured log data

Amazon Redshift Architecture
• Leader Node
 SQL endpoint
 Stores metadata
 Coordinates query execution
• Compute Nodes
 Local, columnar storage
 Execute queries in parallel
 Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
 Optimized for data processing
 DW1: HDD; scale from 2TB to 1.6PB
 DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

Amazon Redshift Node Types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/Year
• Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores
3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles
16 TB compressed, 2 GB/sec scan rate
• High performance at smaller storage size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/Year
• Scale from 160GB to 256TB
DW2.L *New*: 16 GB RAM, 2 Cores,
160 GB compressed SSD storage
DW2.8XL *New*: 256 GB RAM, 32 Cores,
2.56 TB of compressed SSD storage

Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
Direct-attached storage
• With row storage you do
unnecessary I/O
• To get total amount, you have
to read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375

Column storage
Data compression
Zone maps
With column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375

Column storage
Data compression
Zone maps
• Track the minimum and
maximum value for each block
• Skip over blocks that don’t
contain relevant data
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959

Column storage
Data compression
Zone maps
• Use local storage for
performance
• Maximize scan rates
• Automatic replication
and continuous backup
• HDD & SSD platforms

Amazon Redshift parallelizes and
distributes everything
Query
Load
Backup/Restore
Resize

Query
Load
Backup/Restore
Resize
• Load in parallel from Amazon S3 or
DynamoDB or any SSH connection
• Data automatically distributed and
sorted according to DDL
• Scales linearly with the number of
nodes in the cluster

Query
Load
Backup/Restore
Resize
• Backups to Amazon S3 are automatic,
continuous and incremental
• Configurable system snapshot retention
period. Take user snapshots on-demand
• Cross region backups for disaster recovery
• Streaming restores enable you to resume
querying faster

Query
Load
Backup/Restore
Resize
• Resize while remaining online
• Provision a new cluster in the background
• Copy data in parallel from node to node
• Only charged for source cluster

Query
Load
Backup/Restore
Resize
• Automatic SQL endpoint
switchover via DNS
• Decommission the source cluster
• Simple operation via Console or API

Amazon Redshift works with your
existing analysis tools
JDBC/ODBC
Connect using drivers
from PostgreSQL.org
Amazon Redshift

Custom ODBC and JDBC Drivers
• Up to 35% higher performance than open source drivers
• Supported by Informatica, Microstrategy, Pentaho, Qlik,
SAS, Tableau
• Will continue to support PostgreSQL open source drivers
• Download drivers from console

User Defined Functions
• We’re enabling User Defined Functions (UDFs) so
you can add your own
 Scalar and Aggregate Functions supported
• You’ll be able to write UDFs using Python 2.7
 Syntax is largely identical to PostgreSQL UDF Syntax
 System and network calls within UDFs are prohibited
• Comes with Pandas, NumPy, and SciPy pre-installed
 You’ll also be able import your own libraries for even more
flexibility

Scalar UDF example – URL parsing
Rather than using complex REGEX expressions, you can import
standard Python URL parsing libraries and use them in your SQL

Interleaved Multi Column Sort
• Currently support Compound Sort Keys
 Optimized for applications that filter data by one leading column
• Adding support for Interleaved Sort Keys
 Optimized for filtering data by up to eight columns
 No storage overhead unlike an index
 Lower maintenance penalty compared to indexes

Compound Sort Keys Illustrated
• Records in Redshift are
stored in blocks.
• For this illustration, let’s
assume that four records fill
a block
• Records with a given cust_id
are all in one block
• However, records with a
given prod_id are spread
across four blocks
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks

1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved Sort Keys Illustrated
• Records with a given
cust_id are spread
across two blocks
• Records with a given
prod_id are also spread
across two blocks
• Data is sorted in equal
measures for both keys
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks

How to use the feature
• New keyword ‘INTERLEAVED’ when defining sort keys
 Existing syntax will still work and behavior is unchanged
 You can choose up to 8 columns to include and can query with any or
all of them
• No change needed to queries
• Benefits are significant
[ SORTKEY [ COMPOUND | INTERLEAVED ] ( column_name [, ...] ) ]

SELECT
INTO OUTFILE
s3cmd
COPY
Staging Prod
SQL
bcp
SQL Server
Redshift Use Case

Operational Reporting with Redshift
Amazon S3
Log Bucket
Amazon
EMR
Processed and
structured log data
Amazon
Redshift
Operational
Reports

AWS Analytics

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AWS Analytics

Similar to AWS Analytics (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS Analytics

Editor's Notes