Hw09 Terapot Email Archiving With Hadoop

Next Revolution
Toward Open Platform

Terapot: Massive Email Archiving
with Hadoop & Friends
- Commercial Hadoop Application

Jaesun Han
Founder & CEO of NexR
jshan@nexrcorp.com

#2
About NexR

Offering Hadoop & Cloud Computing Platform and Services

Hadoop & Cloud Computing Services
Hadoop Provisioning & Management

Academic Support
Massive Email Archiving MapReduce Workflow
Program

Massive Data Storage & Processing Platform

Cloud Computing Platform
(Compatible with Amazon AWS)

icube-cc icube-sc
(Compute) (Storage)

#3
What is Email Archiving?

 The Objectives of Email Archiving
- Regulatory compliance
- e-Discovery: Litigation and legal discovery
- E-mail backup and disaster recovery
- Messaging system & storage optimization
- Monitoring of internal and external e-mail content

#4
The Architecture of Email Archiving

Data Acquisition Data Processing Data Access
Journaling Indexing Search
Mailbox Crawling Filtering Discovery

Email
Servers
Journaling Crawling

Search employee
Indexing Indexes
Email Archiving
Server
Discovery auditor
administrator

Archival Storage
email data

#5
The Challenges of Email Archiving

 Explosive growth of digital data
- 6 times (988XB) in 2010 than 2006
- 95% (939 XB) unstructured data including email
- Increasing the cost and complexity of archiving
 Requiring scalable & low cost archiving

 Reinforcement of data retention regulation
- Retention, Disposal, e-Discovery, Security
- HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,
OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX
 Requiring scalable archiving & fast discovery

 Needs for intelligent data management
- Knowledge management from email data
- Filtering, monitoring, data mining, etc
 Requiring integration with intelligent system

#6
New Requirements of Email Archiving

 High Scalability

 Low Cost

 High Performance

 Intelligence

#7
Terapot: When Hadoop Met Email Archiving…
 Scale-out architecture with Hadoop
- Hadoop HDFS for archiving email data
- Hadoop MapReduce for crawling & indexing
- Apache Lucene for search & discovery

Email
Servers

Distributed Crawling
Journaling

Hadoop MapReduce
(Crawling, Indexing, etc)

Journaling Hadoop HDFS
Server (Archiving)

Distributed Search & Discovery

#8
Features of Terapot

 Distributed Massive Email Archiving
 High Scalability by Shared-Nothing Architecture
- Thousands of servers, billions of emails
 Low Cost by Inexpensive Hardware
- Entry servers under $5,000
 High Performance by Parallelism
- Fast search under 1-2 seconds for each user account
- Fast discovery in parallel with MapReduce
 Intelligence by Data Mining
- Contact network analysis, content analysis, statistics
 Support Both On-premise Version and Cloud(hosted)
Version
 Development with Various Open Source Software

#9
The Architecture of Terapot
Terapot Clients Email Sources
HTTP/
SOAP REST JSON POP3 Mail NAS/
FTP/SFTP
Server Server NFS
Server

Terapot Frontend

MR Workflow Manager MailServer Search Gateway Analyzer

Batch processing Analysis 4 key
Real-Time
Crawling Indexing Merging Searching ETL Mining components
Indexing

Hadoop MapReduce, Lucene, & Hive

HDFS
(email)
Local
(index)

#10
Batch Processing Component
Email Sources

HDFS

Crawling Archiving policies
(MR)  An archive file per user
An archive file per user  Several archive files per crawling
(sequence file)

configured
period
Indexing
(MR)
a temporary index file
per user
(lucene index file)
Local file system

Merging shard 1 shard 0
Search
a merged index file
(for backing up)
index shard
(3 copy replication)

#11
Real-Time Indexing Component

Journaling
Server

Forwarding Database
Memory
Indexing Real-Time Archiving
Indexing

Crawling
Real-Time
HDFS
Index
Flushing
archive
Batch
Processing index

Component

#12
Search & Discovery Component

Search
Gateway
Locating
index shards
Distributed
Search

Assigning
shards

Search Nodes Real-Time
copy index shards Indexing Nodes
to local file system
Zookeeper
Updating
shard status HDFS
index shards

#13
Data Analysis Component

 Personal contact network analysis Mining
 Domain statistics
Engine

Hive queries

ETL (MR) Analyzer
Extract-Transform- Hive
Web
Load
Reporter

MR MR MR MR MR
reports

generating
reports

email archive files Hive table analysis results database

HDFS

#14
Installation & Quantitative Analysis

Quantitative Analysis
2  Assuming
HA master - 1000 employees
nodes - 16 emails per day for each person
- 215KB (content 142 KB + attachment 73 KB)
for average email size
- 1.25 GB per year for 1 employee
 Storage
10 - index size: about 80% of email
- compression ratio: about 50 %
worker  Disk volume required for 1 year
nodes - email archive (HDFS): 1881 GB
(datanode, - indexes (HDFS + Local): 4559 GB
tasktracker, - total: about 6.4 TB per year
searcher,
etc)  40 TB may cover 6 years archiving

Description Qty

Intel Xeon Nehalem 2
CPU
E5504 2.0GHz (8 cores)
DDR3 2GB PC3-10600 9
Memory
Registered Dimm (18GB)
4
HDD 1TB 7200 RPM SATA2
(4TB)

For more information
- www.nexrcorp.com
- www.terapot.com
- jshan@nexrcorp.com
- @jaesun_han

www.nexrcorp.com
Hadoop & Cloud Computing
Company

Hw09 Terapot Email Archiving With Hadoop

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hw09 Terapot Email Archiving With Hadoop

Similar to Hw09 Terapot Email Archiving With Hadoop (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Hw09 Terapot Email Archiving With Hadoop