Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
Mesos-based Data Infrastructure @ DoubanZhong Bo Tian
How to build an elastic and efficient platform to support various Big Data and Machine Learning tasks is a challenge for a lot of corporations. In this presentation, Zhongbo Tian will give an overview of the Mesos-based core infrastructure of Douban, and demonstrate how to integrate the platform with state-of-art Big Data/ML technologies.
Greenplum is leading MPP database technology for OLAP and ad-hoc workload. With more than 10 years R&D, Greenplum now become a bigdata platform, using it, you could do OLAP, Mixed workload, advanced analytics, machine learning, Text analysis, GIS/Geospatial analysis, Grapth analysis over various dataset no matter it is managed by Greenplum, Hadoop, S3, Gemfire, Database etc.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
GeekNight 22.0 Multi-paradigm programming in Scala and AkkaGeekNightHyderabad
This document discusses a presentation on multi-paradigm programming using Scala and Akka. The agenda includes demonstrating how Scala can blend object-oriented and functional programming, and how to build scalable actor systems using Scala with Akka. It also covers why to consider functional programming paradigms, how functional programming improves code, and key concepts of functional programming like immutability and referential transparency. The document provides examples of pattern matching in Scala and discusses some Scala features that are important for using Akka.
Making the Case for Hadoop in a Large Enterprise-British AirwaysDataWorks Summit
Making the Case for Hadoop in a Large Enterprise
British Airways
Alan Spanos
Data Exploitation Manager
British Airways
Jay Aubby
Architect
British Airways
This document summarizes a presentation given by Jim Vogt, President and CEO of Zettaset, on making Hadoop work in business units. It outlines how customer focus is shifting to higher layers of the big data stack like analytics and applications. While Hadoop's value proposition has expanded, enterprises face issues with security, reliability, integration and reliance on professional services. The document discusses use cases in financial services, healthcare and retail payments and how meeting requirements like data security, availability and multi-tenancy is key to Hadoop adoption. It concludes that focus needs to be on business applications over database mechanics with comprehensive security and simplified integration into existing systems and processes.
Talend Winter 17 enables IT to transform the data lake into qualified, clean data that anyone can use, so everyone can make more informed and faster decisions
The presentations covers mostly three key areas how Talend helps you get the most from your data lake.
Talend Data Preparation now has Big Data support so anyone can access trusted data in the lake and turn data into insight
New Talend Data Stewardship app helps IT and Business to collaborate on data quality problems and guide resolution. It empowers the business to ensure data integrity at the source.
3. And we all know that there is an amazing amount of innovation going on in the market today. Talend enables you to stay on the cutting edge of big data and cloud innovation with the flexibility to leverage pretty much anything out there in the market, such as Spark 2.0, AWS, Salesforce, MapR and more ….
La plateforme d'intégration de données de Talend dispose de nouvelles fonctionnalités de préparation et de gouvernance des données en libre service afin de transformer les data lakes en données qualifiées, propres et utilisables par tous
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
The document discusses various options for integrating Hadoop with an existing enterprise data warehouse (EDW). It describes 7 options: 1) Teradata Unified Data Architecture, 2) using an existing EDW with a new Apache Hadoop cluster, 3) using an existing EDW with a new Cloudera Hadoop cluster, 4) using an existing EDW with a new Hortonworks Hadoop cluster, 5) IBM PureData, 6) Oracle Big Data Appliance, and 7) SAP HANA for Hadoop integration. Each option involves using the existing EDW for structured data and Hadoop for unstructured/semi-structured data, with analytics capabilities available across both platforms.
Enterprise Data Management - Data Lake - A PerspectiveSaurav Mukherjee
This document discusses the evolution of the enterprise data management over the years, the challenges of the current CTOs and chief enterprise architects, and the concept of the Data Lake as a means to tackle such challenges. It also talks about some reference architectures and recommended tool set in today’s context.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
Building the Enterprise Data Lake: A look at architecturemark madsen
The document discusses considerations for building an enterprise data lake. It notes that traditional data warehousing approaches do not scale well for new data sources like sensors and streaming data. It advocates adopting a data lake approach with separate systems for data acquisition, management, and access instead of a monolithic architecture. A data lake requires a distributed architecture and platform services to support various data flows, formats, and processing needs. The data architecture should not enforce models or limitations upfront but rather allow for evolution and change over time.
This document discusses strategies for successfully utilizing a data lake. It notes that creating a data lake is just the beginning and that challenges include data governance, metadata management, access, and effective use of the data. The document advocates for data democratization through discovery, accessibility, and usability. It also discusses best practices like self-service BI and automated workload migration from data warehouses to reduce costs and risks. The key is to address the "data lake dilemma" of these challenges to avoid a "data swamp" and slow adoption.
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
This document discusses optimizing a traditional enterprise data warehouse (EDW) architecture with Hortonworks Data Platform (HDP). It provides examples of how HDP can be used to archive cold data, offload expensive ETL processes, and enrich the EDW with new data sources. Specific customer case studies show cost savings ranging from $6-15 million by moving portions of the EDW workload to HDP. The presentation also outlines a solution model and roadmap for implementing an optimized modern data architecture.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
This document discusses organizing data in a data lake or "data reservoir". It describes the changing data landscape with multiple platforms for different analytical workloads. It outlines issues with the current siloed approach to data integration and management. The document introduces the concept of a data reservoir - a collaborative, governed environment for rapidly producing information. Key capabilities of a data reservoir include data collection, classification, governance, refinery, consumption, and virtualization. It describes how a data reservoir uses zones to organize data at different stages and uses workflows and an information catalog to manage the information production process across the reservoir.
Speaker: Geetha Balasundaram, Developer at ThoughtWorks
From tools and technology to people and requirements, what's different in the data engineering space? App development is traditional now. All enterprises want to become data-guided. Data lake is good start yet the know-hows and do-hows are so many.
Experiences from building a data lake in the retail domain, the talk will be covering.
- What is this vast new space of data engineering,
- Why it is critical to think in terms of data rather than features
- How important it is to understand these technologies and create a data lake that is usable and insightful to business
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
This document provides an overview of big data architectural patterns and best practices on AWS. It discusses challenges of big data and how to simplify big data processing. It covers ingestion, storage, analysis and visualization technologies to use as well as design patterns. Key technologies discussed include Amazon Kinesis, DynamoDB, S3, Redshift, EMR, Lambda and design approaches like decoupled data bus and using the right tool for each job.
This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
Meaning making – separating signal from noise. How do we transform the customer's next input into an action that creates a positive customer experience? We make the data more intelligent, so that it is able to guide our actions. The Data Lake builds on Big Data strengths by automating many of the manual development tasks, providing several self-service features to end-users, and an intelligent management layer to organize it all. This results in lower cost to create solutions, "smart" analytics, and faster time to business value.
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...acelyc1112009
A presentation in Apache Pegasus meetup in 2022 from Wei Wang.
Apache Pegasus is a horizontally scalable, strongly consistent and high-performance key-value store.
Know more about Pegasus https://pegasus.apache.org, https://github.com/apache/incubator-pegasus
2. 2
Agenda 大綱
• EDW 的現況與挑戰
• 何謂 Enterprise Data Lake (EDL)
• EDL 需要具有哪些特性
• 演化中的 Hadoop 生態系
• Hadoop 生態系如何實現Enterprise Data Lake
• Take Away
3. 3
邁向數據驅動的企業
現狀
Bring Data to Compute
未來
Bring Compute to Data
Relative size complexity
Data
Information-centric
businesses use all data:
Multi-structured,
internal external data
of all types
Compute
Compute
Compute
Process-centric
businesses use:
• Structured data mainly
• Internal data only
• “Important” data only
Compute
Compute
Compute
Data
Data
Data
Data
4. 4
EDW 現況與挑戰
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Data ArchivesEDWs Marts SearchServers Document Stores Storage
Complex Architecture
• Many special-purpose
systems
• Moving data around
• No complete views
Visibility
• Leaving data behind
• Risk and compliance
• High cost of storage
Time to Data
• Up-front modeling
• Transforms slow
• Transforms lose data
Cost of Analytics
• Existing systems strained
• No agility
• BI backlog
4
1
2
3
5. 5
何謂 Enterprise Data Lake
• A Data lake is a large storage repository that holds
data until it is needed”
• Put an end to data silos
• Hadoop allows to hold data in their original native
format rather than forcing integration of large
volumes of data up front.
https://en.wikipedia.org/wiki/Data_lake
6. 6
EDWs
Marts
Storage
Search
Servers
Documents
Archives
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Multi-workload analytic platform
• Bring applications to data
• Combine different workloads on
common data (i.e. SQL + Search)
• True BI agility
4
1
3
4
Data Lake is a reference architecture
Active archive
• Full fidelity original data
• Indefinite time, any source
• Lowest cost storage
1
Data management,
transformations
• One source of data for all analytics
• Persisted state of transformed data
• Significantly faster cheaper
2
Self-service exploratory BI
• Simple search + BI tools
• “Schema on read” agility
• Reduce BI user backlog
requests Big Data
Platform
3
1
2
9. 9
Hadoop Spark
YARN (Resource Management)
Spark on YARN
Spark
Streaming
GraphX MLlib
HDFS, Hbase (Data Store)
HivePig
Impala
MapReduce2
SparkSQL
Search
Core Hadoop
Support Spark compon
Not support yet add-on
10. 10
Cloudera is a member of, and aligned with, the broader Spark
community
Spark:
• Will replace MapReduce as the general purpose Hadoop
framework
– Broad community and vendor adoption
– Hadoop ecosystem integration (native 3rd party)
• Goes beyond data science/machine learning
– Cloudera working on Spark Core, Streaming, Security,
YARN, and MLlib
• Does not replace special purpose frameworks
– One size does not fit all for SQL, Search, Graph, Stream
Cloudera’s Position on Spark
11. 11
Spark Engineering in Cloudera
• Cloudera embraced Spark in early 2014
• One Platform Initiative (Mike’s Blog)
• Engineering with Intel to broaden Spark
ecosystem
– Hive-on-Spark
– Pig-on-Spark
– Spark-over-YARN
– Spark Streaming Reliability
– General Spark Optimization
– Security, Compliance Audit
12. 12
Hive on Spark
• Technology
– Hive: “standard” SQL tool in Hadoop
– Spark: next-gen distributed processing
framework
– Hive + Spark
• Performance
• Minimum feature gap
• Industry
– A lot of customers heavily invest in Hive
– Want to leverage the Spark engine
13. 13
Design Principles
• No or limited impact on Hive’s existing code path
• Maximize code reuse
• Minimum feature customization
• Low future maintenance cost
15. 15
Current Status
• All functionality in Hive is implemented
• First round of optimization is completed
– Map join, SMB
– Split generation and grouping
– CBO, vectorization
• More optimization and benchmarking coming
• Beta in CDH
– https://issues.apache.org/jira/browse/HIVE-7292
16. 16
• Data Lake Building Blocks
• 建立 Data Lake 的常見痛點
– 資料來源雜
– 資料表太寬
– 查詢速度慢
• 導入 Data Lake 的前置行動
– 目前關聯式資料庫的極限是幾個欄位?
– 未來如何驗證評測解決方案? 如何產生範例資料集 ?
Hadoop 生態系如何實現 EDL
18. 18
Components of the Data Lake
• Process Tools
• Hive: SQL like query and analysis
• Pig: Transformation for big data
• Sqoop: Extracts external source and load to Hadoop
• MR/Spark: General-purpose cluster computing
framework
• Spark-Stream: On-the-fly ETL
• NoSQL
• HBase
• Search
• Log Streaming
• Kafka/Scribe
• Flume
• Fluentd
• Language
• Python/Java/R/Scala
41. 41
TPC-DS 定義了評測用的資料表 Schema
[master:21000]
show
databases;
Query:
show
databases
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
_impala_builtins
|
|
default
|
|
tpcds
|
|
tpcds_parquet
|
|
tpcds_rcfile
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Fetched
5
row(s)
in
0.03s
[master:21000]
use
tpcds;
Query:
use
tpcds
[master:21000]
show
tables;
Query:
show
tables
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
customer
|
|
customer_address
|
|
customer_demographics
|
|
date_dim
|
|
household_demographics
|
|
inventory
|
|
item
|
|
promotion
|
|
store
|
|
store_sales
|
|
time_dim
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Fetched
11
row(s)
in
0.01s
42. 42
TPC-DS 也定義了評測用的 SQL 查詢
-‐-‐
start
query
1
in
stream
0
using
template
query27.tpl
select
i_item_id,
s_state,
-‐-‐
grouping(s_state)
g_state,
avg(ss_quantity)
agg1,
avg(ss_list_price)
agg2,
avg(ss_coupon_amt)
agg3,
avg(ss_sales_price)
agg4
from
store_sales,
customer_demographics,
date_dim,
store,
item
Where
ss_sold_date_sk
=
d_date_sk
and
ss_item_sk
=
i_item_sk
and
ss_store_sk
=
s_store_sk
and
ss_cdemo_sk
=
cd_demo_sk
and
cd_gender
=
'F'
and
cd_marital_status
=
'W'
and
cd_education_status
=
'Primary'
and
d_year
=
1998
and
s_state
in
('WI',
'CA',
'TX',
'FL',
'WA',
'TN')
and
ss_sold_date_sk
between
2450815
and
2451
-‐-‐
partition
key
filter
group
by
-‐-‐
rollup
(i_item_id,
s_state)
i_item_id,
s_state
order
by
i_item_id,
s_state
limit
100;
-‐-‐
end
query
1
in
stream
0
using
template
quer
43. 43
TPC-DS 也有提供產生指定筆數資料的工具
取樣資料表:資料量 3.8G, 文字檔格式, 沒有壓縮, 3 千萬筆資料
[master:21000]
show
table
stats
customer;
Query:
show
table
stats
customer
+-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
#Rows
|
#Files
|
Size
|
Bytes
Cached
|
Format
|
Incremental
stats
|
+-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
-‐1
|
1
|
3.81GB
|
NOT
CACHED
|
TEXT
|
false
|
+-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Fetched
1
row(s)
in
0.00s
[master:21000]
select
count(*)
from
customer;
Query:
select
count(*)
from
customer
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
count(*)
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
30000000
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Fetched
1
row(s)
in
0.77s
使用相同的資料集與
查詢語句,較容易進行
不同技術的評選