Amazon Redshift 소개 및 실습
아마존웹서비스 솔루션즈 아키텍트
2017년 5월 30일
13:00 – 15:00
• Redshift 개요, 테이블 설계 및 데이터 로딩 고려사항 (30분)
• Qwiklab 실습 1 - Advanced Amazon Redshift: Data Loading (45분)
• Qwiklab 실습 2 - Advanced Amazon Redshift: Table Layout and Schema
Design (45분)
Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
a lot faster
a lot simpler
a lot cheaper
Amazon Redshift 아키텍처
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads, back
ups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS1/DS2: HDD; scale from 2 TB to 2 PB
10 GigE
Amazon Redshift의 빠른 속도를 위한 구성
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
analyze compression listing;
Table | Column | Encoding
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
Amazon Redshift의 빠른 속도를 위한 구성
Parallel and Distributed
ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
Amazon Redshift의 빠른 속도를 위한 구성
Distribution Keys
Amazon Redshift Security
Petabyte-Scale Data Warehousing Service
Amazon Redshift Table and Schema Design
Amazon Redshift는 기존의 데이터 모델 지원
Star Snowflake
적합한 데이터 타입의 선택
Redshift performance is about efficient I/O
Don’t make columns wider than necessary, e.g.:
• Avoid BIGINT for country identifier
• Avoid CHAR(MAX) for country names
• Oversizing VARCHAR impact loading and runtime performance
Use appropriate types
• Use TIMESTAMP or DATE instead of CHAR
• Use CHAR instead of VARCHAR when appropriate
Multibyte Characters
• VARCHAR data type supports UTF-8 multibyte characters up to a maximum of four bytes
• The CHAR data type does not support multibyte characters
아키텍처 및 스키마 설계 고려사항
• Redshift is a distributed system:
– A compute node contains slices
(one per core)
– A slice contains data
• Queries run on all slices in parallel:
optimal query throughput can be
achieved when data is evenly spread
across slices
테이블 분산 타입
Distribution Key All
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
All data on
every node
Same key to same location
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Round robin
분산 타입의 선택
Choose a distribution style of KEY for
• Large data tables, like a FACT table in a star schema
• Large or rapidly changing tables used in joins or aggregations
• Improved performance even if the key is not used in join column
Choose a distribution style of ALL for tables that
• Have slowly changing data
• Reasonable size (i.e., few millions but not 100’s of millions of rows)
• No common distribution key for frequent joins
• Typical use case – joined dimension table without a common distri
bution key
Choose a distribution style of EVEN for tables that are not joined
and have no aggregate queries
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
uri = /games/g1.exe
uri = /imgs/ad1.png
order_line_id = 25693
uri = /img/ad_5.img
Distribution Keys 활용 데이터 분산
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
uri = /imgs/ad1.png
order_line_id = 25693
Distribution Keys determine which data resides on which slices
uri = /games/g1.exe
uri = /img/ad_5.img
Records with same distribu
tion key for a table are on t
he same slice
Distribution Keys 활용 데이터 분산
Node 1
Slice 1 Slice 2
uri = /games/g1.exe
uri = /imgs/ad1.png
order_line_id = 25693
uri = /img/ad_5.img
Records from other tables
with the same distribution k
ey value are also on the sa
me slice
Records with same distribu
tion key for a table are on t
he same slice
Distribution Keys help with data locality for join evaluation
Node 2
Slice 3 Slice 4
Distribution Keys 활용 데이터 분산
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
uri = /games/g1.exe
uri = /imgs/ad1.png
uri = /img/ad_5.img
2M records
5M records
1M records
4M records
Poor key choices lead to uneven distribution of records…
Distribution Keys 활용 데이터 분산
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
uri = /games/g1.exe
uri = /imgs/ad1.png
uri = /img/ad_5.img
2M records
5M records
1M records
4M records
Unevenly distributed data cause processing imbalances!
Distribution Keys 활용 데이터 분산
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
uri = /games/g1.exe
uri = /imgs/ad1.png
uri = /img/ad_5.img
2M records2M records 2M records 2M records
Evenly distributed data improves query performance
Distribution Keys 활용 데이터 분산
Distribution Key의 선택
• Distribute data evenly across nodes
• Minimize data movement: Co-located Joins & Aggregates
Best Practice
• Use the joined columns for largest commonly joined tables as key (example: f
act table and large dimension table)
• Consider using Group By column as a key (GROUP BY clause)
• Never use a distribution key that causes severe data skew
• Choose a key with high cardinality; large number of discrete values
• Keys used as equality filter as your distribution key (Concentrates processing o
n one node)
Data의 정렬
The sort key helps Redshift minimize I/O
• For example, a table sorted on timestamp and queried on date ra
nge will skip all blocks not in the query range
In the slices (on disk), the data is sorted by a sort key
• If no sort key exists Redshift uses the data insertion order
Choose a sort key that is frequently used in your queries
• Primarily as a query predicate (date, identifier, …)
• Optionally choose a column frequently used for aggregates
• Optionally choose same as distribution key column for most effi
cient joins (merge join)
Don’t use too many columns per table as sort keys
SELECT SUM( S.Price * S.Quantity )
JOIN CATEGORY C ON C.ProductId = S.ProductId
JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId
Where C.CategoryId = ‘Produce’ And F.State = ‘WA’
AND S.Date Between ‘1/1/2015’ AND ‘1/31/2015’
예제 – Distribution and Sort Keys
Sort key (S) = Date
-- Total Products sold in Washington in January 2015
Dist key (F) = FranchiseID
Dist key (S) = ProductID
Dist key (C) = ProductID
쿼리를 위한 데이터베이스 최적화
Make sure your columns are compressed appropriately
Co-locate frequently joined tables using distribution keys or distribution
all to avoid data transfers between nodes
For joined tables consider using sort keys on the joined columns, allowin
g fast merge joins
Compression allows you to de-normalize without penalizing storage, simpli
fying queries and limiting joins
Vacuum and Analyze regularly
Amazon Redshift Data Loading
Petabyte-Scale Data Warehousing Serv
데이터 로딩 프로세스
Data Source Extraction Transformation Loading
Amazon R
Amazon Redshift 데이터 로딩 개요
AWS CloudCorporate Data center
Amazon S3
Amazon Elastic
Amazon Redsh
logs / files
Source DBs
AWS Direct Co
S3 Multipart Upl
AWS Import/ Ex
EC2 or On-Pre
m (using SSH)
Database Migration
AWS Lambda
AWS Datapipeline
Amazon S3로 파일 업로드
Amazon Redsh
Corporate Data center
Ensure that your d
ata resides in the s
ame Region as you
r Redshift clusters
Split the data into
multiple files to fac
ilitate parallel proc
Optionally, you can
encrypt your data
using Amazon S3
Server-Side or Clie
nt-Side Encryption
Files should be ind
ividually compress
ed using GZIP or L
Amazon S3에서 데이터 로드
Preparing Input Data Files
Uploading files to Amazon S3
Using COPY to load data from Amazon S3
구분자(Delimiters)를 사용한 입력 데이터
1|Customer#000000001|j5JsirBM9P|MOROCCO 0|MOROCCO|AFRICA|25-989-741-2988|BUILDING
2|Customer#000000002|487LW1dovn6Q4dMVym|JORDAN 1|JORDAN|MIDDLE EAST|23-768-687-3665|AUTOMOBILE
4|Customer#000000004|4u58h f|EGYPT 4|EGYPT|MIDDLE EAST|14-128-190-5944|MACHINERY
Example of pipe (‘|’) delimited file
CREATE TABLE customer (
c_custkey integer not null,
c_name varchar(25) not null,
c_address varchar(25) not null,
c_city varchar(10) not null,
c_nation varchar(15) not null,
c_region varchar(12) not null,
c_phone varchar(15) not null,
c_mktsegment varchar(10) not null
Copy customer from ‘s3://mydata/client.txt’
Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_key>’
Delimiter ‘|’;
Fixed-width를 사용한 입력 데이터
1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING
2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE
4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINERY
CREATE TABLE customer (
c_custkey integer not null,
c_name varchar(25) not null,
c_address varchar(25) not null,
c_city varchar(10) not null,
c_nation varchar(15) not null,
c_region varchar(12) not null,
c_phone varchar(15) not null,
c_mktsegment varchar(10) not null
Copy customer from ‘s3://mydata/client.txt’
Credentials ‘aws_access_key_id=<your-access-key>;
fixedwidth ‘0:3, 1:25, 2:25, 3:10, 4:15, 5:12, 6:15, 7:10
JSON 포맷을 사용한 입력 데이터
COPY uses a jsonpaths text file to parse JSON data
JSONPath expressions specify the path to JSON name elements
Each JSONPath expression corresponds to a column in the Amazon Redshift t
arget table
Suppose you want to load the VENUE table with the following content
{ "id": 15, "name": "Gillette Stadium", "location": [ "Foxborough", "MA" ],
"seats": 68756 } { "id": 15, "name": "McAfee Coliseum", "location": [
"Oakland", "MA" ], "seats": 63026 }
You would use the following jsonpaths file to parse the JSON data.
{ "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]",
"$['location'][1]", "$['seats']" ] }
데이터 파일의 분할
Slice 0
Slice 1
Slice 0
Slice 1
Node 0
Node 1
2 XL Compute Nodes
Copy customer from ‘s3://mydata/client.txt’
Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_key>’
Delimiter ‘|’;
Use the COPY command
Each slice can load one file at a time
A single input file means only one slice i
s ingesting data
Instead of 100MB/s, you’re only getting
쓰루풋 최대 활용을 위해 복수의 입력 파일 사용
Use the COPY command
You need at least as many input fil
es as you have slices
With 16 input files, all slices are wor
king so you maximize throughput
Get 100MB/s per node; scale linearly
as you add nodes
쓰루풋 최대 활용을 위해 복수의 입력 파일 사용
Manifest Files 사용한 데이터 로드
Use manifest to loads all required files
Supply JSON-formatted text file that lists the files to be loaded
Can load files from different buckets or wit different prefix
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
AWS Database Migration Service (AWS DMS)
Supports both homogenous and heterogeneous data replication.
Supported database sources include:
(1) Oracle, (2) SQL Server, (3) MySQL, (4) Amazon Aurora, (5) PostgreSQL, and
(6) ODBC. All sources are supported on-premises, in EC2, and RDS.
Supported database targets include:
(1) Amazon Aurora, (2) Oracle, (3) SQL Server, (4) MySQL, (5) PostgreSQL, and
(6) Amazon Redshift. All Oracle, SQL Server, MySQL and Postgres targets are
supported on-premises, in EC2 and RDS.
Keep your apps running during the migration
Application Users
Start a replication instance
Connect to source and target databases
Select tables, schemas, or databases
Let AWS Database Migration Service
create tables, load data, and keep
them in sync
Switch applications over to the target
at your convenience
AWS DMS – 온라인 마이그레이션
Database Migration Service

