2017 AWS DB Day | Amazon Redshift 소개 및 실습

Amazon Redshift 소개 및 실습
김상필
아마존웹서비스 솔루션즈 아키텍트
2017년 5월 30일
13:00 – 15:00

Agenda
• Redshift 개요, 테이블 설계 및 데이터 로딩 고려사항 (30분)
• Qwiklab 실습 1 - Advanced Amazon Redshift: Data Loading (45분)
• Qwiklab 실습 2 - Advanced Amazon Redshift: Table Layout and Schema
Design (45분)

Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper

Amazon Redshift 아키텍처
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads, back
ups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS1/DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)

Amazon Redshift의 빠른 속도를 위한 구성
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959

Parallel and Distributed
Query
Load
Export
Backup
Restore
Resize

ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
Distribution Keys

Amazon Redshift Security
Petabyte-Scale Data Warehousing Service
Amazon Redshift Table and Schema Design

Amazon Redshift는 기존의 데이터 모델 지원
Star Snowflake

적합한 데이터 타입의 선택
Redshift performance is about efficient I/O
Don’t make columns wider than necessary, e.g.:
• Avoid BIGINT for country identifier
• Avoid CHAR(MAX) for country names
• Oversizing VARCHAR impact loading and runtime performance
Use appropriate types
• Use TIMESTAMP or DATE instead of CHAR
• Use CHAR instead of VARCHAR when appropriate
Multibyte Characters
• VARCHAR data type supports UTF-8 multibyte characters up to a maximum of four bytes
• The CHAR data type does not support multibyte characters

아키텍처 및 스키마 설계 고려사항
• Redshift is a distributed system:
– A compute node contains slices
(one per core)
– A slice contains data
• Queries run on all slices in parallel:
optimal query throughput can be
achieved when data is evenly spread
across slices

테이블 분산 타입
Distribution Key All
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
All data on
every node
Same key to same location
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Even
Round robin
distribution

분산 타입의 선택
Choose a distribution style of KEY for
• Large data tables, like a FACT table in a star schema
• Large or rapidly changing tables used in joins or aggregations
• Improved performance even if the key is not used in join column
Choose a distribution style of ALL for tables that
• Have slowly changing data
• Reasonable size (i.e., few millions but not 100’s of millions of rows)
• No common distribution key for frequent joins
• Typical use case – joined dimension table without a common distri
bution key
Choose a distribution style of EVEN for tables that are not joined
and have no aggregate queries

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
user_profile
user_id=1234
name=janet
…
user_profile
user_id=6789
name=fred
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
user_profile
user_id=2345
name=bill
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
user_profile
user_id=4312
name=fred
…
order_line
order_line_id = 25693
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
Distribution Keys 활용 데이터 분산

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
user_profile
user_id=1234
name=janet
…
user_profile
user_id=6789
name=fred
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
user_profile
user_id=2345
name=bill
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
user_profile
user_id=4312
name=fred
…
order_line
…
Distribution Keys determine which data resides on which slices
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
Records with same distribu
tion key for a table are on t
he same slice

Node 1
Slice 1 Slice 2
cloudfront
uri = /games/g1.exe
user_id=1234
…
user_profile
user_id=1234
name=janet
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
user_profile
user_id=2345
name=bill
…
order_line
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
Records from other tables
with the same distribution k
ey value are also on the sa
me slice
Records with same distribu
tion key for a table are on t
he same slice
Distribution Keys help with data locality for join evaluation
Node 2
Slice 3 Slice 4
user_profile
user_id=6789
name=fred
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
user_profile
user_id=4312
name=fred
…

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Poor key choices lead to uneven distribution of records…

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Unevenly distributed data cause processing imbalances!

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records2M records 2M records 2M records
Evenly distributed data improves query performance

Distribution Key의 선택
Goal
• Distribute data evenly across nodes
• Minimize data movement: Co-located Joins & Aggregates
Best Practice
• Use the joined columns for largest commonly joined tables as key (example: f
act table and large dimension table)
• Consider using Group By column as a key (GROUP BY clause)
• Never use a distribution key that causes severe data skew
• Choose a key with high cardinality; large number of discrete values
Avoid
• Keys used as equality filter as your distribution key (Concentrates processing o
n one node)

Data의 정렬
The sort key helps Redshift minimize I/O
• For example, a table sorted on timestamp and queried on date ra
nge will skip all blocks not in the query range
In the slices (on disk), the data is sorted by a sort key
• If no sort key exists Redshift uses the data insertion order
Choose a sort key that is frequently used in your queries
• Primarily as a query predicate (date, identifier, …)
• Optionally choose a column frequently used for aggregates
• Optionally choose same as distribution key column for most effi
cient joins (merge join)
Don’t use too many columns per table as sort keys

SELECT SUM( S.Price * S.Quantity )
FROM SALES S
JOIN CATEGORY C ON C.ProductId = S.ProductId
JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId
Where C.CategoryId = ‘Produce’ And F.State = ‘WA’
AND S.Date Between ‘1/1/2015’ AND ‘1/31/2015’
예제 – Distribution and Sort Keys
Sort key (S) = Date
-- Total Products sold in Washington in January 2015
Dist key (F) = FranchiseID
Dist key (S) = ProductID
Dist key (C) = ProductID

쿼리를 위한 데이터베이스 최적화
Make sure your columns are compressed appropriately
Co-locate frequently joined tables using distribution keys or distribution
all to avoid data transfers between nodes
For joined tables consider using sort keys on the joined columns, allowin
g fast merge joins
Compression allows you to de-normalize without penalizing storage, simpli
fying queries and limiting joins
Vacuum and Analyze regularly

Amazon Redshift Data Loading
Petabyte-Scale Data Warehousing Serv
ice

데이터 로딩 프로세스
Data Source Extraction Transformation Loading
Amazon R
edshift
Target
Focus

Amazon Redshift 데이터 로딩 개요
AWS CloudCorporate Data center
Amazon
DynamoDB
Amazon S3
Data
Volume
Amazon Elastic
MapReduce
Amazon
RDS
Amazon Redsh
ift
Amazon
Glacier
logs / files
Source DBs
VPN
Connection
AWS Direct Co
nnect
S3 Multipart Upl
oad
AWS Import/ Ex
port
EC2 or On-Pre
m (using SSH)
Database Migration
Service
Kinesis
AWS Lambda
AWS Datapipeline

Amazon S3로 파일 업로드
Amazon Redsh
iftmydata
Client.txt
Corporate Data center
Region
Ensure that your d
ata resides in the s
ame Region as you
r Redshift clusters
Split the data into
multiple files to fac
ilitate parallel proc
essing
Optionally, you can
encrypt your data
using Amazon S3
Server-Side or Clie
nt-Side Encryption
Client.txt.1
Client.txt.2
Client.txt.3
Client.txt.4
Files should be ind
ividually compress
ed using GZIP or L
ZOP

Amazon S3에서 데이터 로드
Preparing Input Data Files
Uploading files to Amazon S3
Using COPY to load data from Amazon S3

Fixed-width를 사용한 입력 데이터
1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING
2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE
3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE
4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINERY
CREATE TABLE customer (
c_custkey integer not null,
c_name varchar(25) not null,
c_address varchar(25) not null,
c_city varchar(10) not null,
c_nation varchar(15) not null,
c_region varchar(12) not null,
c_phone varchar(15) not null,
c_mktsegment varchar(10) not null
);
Credentials ‘aws_access_key_id=<your-access-key>;
aws_secret_access_key=<your_secret_key>’
fixedwidth ‘0:3, 1:25, 2:25, 3:10, 4:15, 5:12, 6:15, 7:10
’;
Client.txt

JSON 포맷을 사용한 입력 데이터
COPY uses a jsonpaths text file to parse JSON data
JSONPath expressions specify the path to JSON name elements
Each JSONPath expression corresponds to a column in the Amazon Redshift t
arget table
Suppose you want to load the VENUE table with the following content
{ "id": 15, "name": "Gillette Stadium", "location": [ "Foxborough", "MA" ],
"seats": 68756 } { "id": 15, "name": "McAfee Coliseum", "location": [
"Oakland", "MA" ], "seats": 63026 }
You would use the following jsonpaths file to parse the JSON data.
{ "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]",
"$['location'][1]", "$['seats']" ] }

데이터 파일의 분할
Slice 0
Slice 1
Slice 0
Slice 1
Client.txt.1
Client.txt.2
Client.txt.3
Client.txt.4
Node 0
Node 1
2 XL Compute Nodes
Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_key>’
Delimiter ‘|’;
mydata

Use the COPY command
Each slice can load one file at a time
A single input file means only one slice i
s ingesting data
Instead of 100MB/s, you’re only getting
6.25MB/s
쓰루풋 최대 활용을 위해 복수의 입력 파일 사용

Use the COPY command
You need at least as many input fil
es as you have slices
With 16 input files, all slices are wor
king so you maximize throughput
Get 100MB/s per node; scale linearly
as you add nodes
쓰루풋 최대 활용을 위해 복수의 입력 파일 사용

Manifest Files 사용한 데이터 로드
Use manifest to loads all required files
Supply JSON-formatted text file that lists the files to be loaded
Can load files from different buckets or wit different prefix
{
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
]
}

AWS Database Migration Service (AWS DMS)
Supports both homogenous and heterogeneous data replication.
Supported database sources include:
(1) Oracle, (2) SQL Server, (3) MySQL, (4) Amazon Aurora, (5) PostgreSQL, and
(6) ODBC. All sources are supported on-premises, in EC2, and RDS.
Supported database targets include:
(1) Amazon Aurora, (2) Oracle, (3) SQL Server, (4) MySQL, (5) PostgreSQL, and
(6) Amazon Redshift. All Oracle, SQL Server, MySQL and Postgres targets are
supported on-premises, in EC2 and RDS.
Keep your apps running during the migration

Customer
Premises
Application Users
AWS
Internet
VPN
Start a replication instance
Connect to source and target databases
Select tables, schemas, or databases
Let AWS Database Migration Service
create tables, load data, and keep
them in sync
Switch applications over to the target
at your convenience
AWS DMS – 온라인 마이그레이션
AWS
Database Migration Service

2017 AWS DB Day | Amazon Redshift 소개 및 실습

More Related Content

What's hot

What's hot (20)

Similar to 2017 AWS DB Day | Amazon Redshift 소개 및 실습

Similar to 2017 AWS DB Day | Amazon Redshift 소개 및 실습 (20)

More from Amazon Web Services Korea

More from Amazon Web Services Korea (20)

Recently uploaded

Recently uploaded (20)

2017 AWS DB Day | Amazon Redshift 소개 및 실습