SlideShare a Scribd company logo
Kite SDK: HBase Datasets
Ryan Blue, Software Engineer
What problem is Kite solving?
©2014 Cloudera, Inc. All rights reserved.
• Accessibility
• Hadoop is flexible, but low level
• Should be easy to use, without being an expert
Kite SDK
©2014 Cloudera, Inc. All rights reserved.
• A set of off-the-shelf tools
• Based on experience and best practices
• Lets you focus on your problem
• Helps you solve new challenges
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Focus on using your data, not managing it
• You shouldn’t have to maintain data files
• This is the first thing you need
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application
Database
Data files
Your code
Provided
Maintained by the database
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application Application
Database
Data files
Data files HBase
Your code
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application ApplicationApplication
Database
Data files
Data files
Kite Data
HBase
Data files HBase
Maintained by the Kite
Kite Datasets: Goals
©2014 Cloudera, Inc. All rights reserved.
• Think in terms of data, not files
• Describe your data and Kite does the right thing
• Should work consistently across the platform
• Reliable
Kite Datasets: Compatibility
©2014 Cloudera, Inc. All rights reserved.
Project HDFS (avro) HDFS (parquet) HBase
Flume Sink 1.0 1.0 1.0
MapReduce 1.0 1.0 1.0
Crunch 1.0 1.0 1.0
Hive 1.0 1.0 1.1
Impala 1.0 1.0 *
* depends on common HBase encoding format
Kite Datasets: What is it?
©2014 Cloudera, Inc. All rights reserved.
• A high-level API for data management
• Work with records and datasets
• Not files, directories, or byte arrays
• Standard descriptions for records and storage
• Schemas describe records
• Partition strategies describe layout
• Opinionated
Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Rating --jar app.jar 
--output rating.avsc
Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Rating --jar app.jar 
--output rating.avsc
1. Describe your layout
dataset partition-config ts:year ts:month ts:day 
--schema rating.avsc --output ymd.json
Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Rating --jar app.jar 
--output rating.avsc
1. Describe your layout
dataset partition-config ts:year ts:month ts:day 
--schema rating.avsc --output ymd.json
1. Create a dataset
dataset create ratings --schema rating.avsc 
--partition-by ymd.json
Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
datasets/
└── ratings/
├── year=1997/
│ ├── month=09/
│ │ ├── day=20/
│ │ ├── ...
│ │ └── day=30/
│ ├── month=10/
│ │ ├── day=01/
│ │ ├── ...
Kite SDK: HBase Datasets
Ryan Blue, Software Engineer
Kite HBase: Background
©2014 Cloudera, Inc. All rights reserved.
Application ApplicationApplication
Database
Data files
Data files
Kite Data
HBase
Data files HBase
Maintained by the Kite
Kite HBase: Background
©2014 Cloudera, Inc. All rights reserved.
• Rows identified by keys, managed by HBase
• Columns are organized as cells
• Cells are identified by column family, qualifier
• The catch: everything is a byte array
family name ...
row key last first ...
buzz@pixar.com Lightyear Buzz ...
• Uniform interaction with HBase and HDFS datasets
• Need to make keys from records
• Need configuration to map fields to cells
Kite HBase
©2014 Cloudera, Inc. All rights reserved.
Kite HBase: Partitioning
©2014 Cloudera, Inc. All rights reserved.
• Use partition strategy to define unique keys
• Kite builds the key from each record
• Kite translates keys to HBase row id bytes
Kite HBase: Partitioning
©2014 Cloudera, Inc. All rights reserved.
• Partition strategy produces a storage key
• HDFS partitioning uses a group key
1403028411014 => (2014, 6, 17)
• HBase partitioning uses a unique key
• Grouping is done dynamically by HBase
1403028411014 => (1403028411014)
Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
• Define key format from data
$ ./dataset partition-config --schema user.avsc 
email:copy
Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
• Define key format from data
$ ./dataset partition-config --schema user.avsc 
email:copy
[ {
"source" : "email", "type" : "identity",
"name" : "email_copy"
} ]
Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
$ ./dataset partition-config --schema user.avsc 
email:hash[16] email:copy
Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
$ ./dataset partition-config --schema user.avsc 
email:hash[16] email:copy
[ {
"source" : "email", "type" : "hash",
"buckets" : 16, "name" : "email_hash"
}, {
"source" : "email", "type" : "identity",
"name" : "email_copy"
} ]
Kite HBase: Partitioning
©2014 Cloudera, Inc. All rights reserved.
• Use partition strategy to define unique keys
• Kite builds the key from each record
• Kite translates keys to HBase row id bytes
• Some operations require keys
Kite HBase: Field mapping
©2014 Cloudera, Inc. All rights reserved.
• Configure the column family and qualifier for a field
{ "email": "buzz@pixar.com",
"firstName": "Buzz", ... }
family name ...
row key last first ...
buzz@pixar.com Lightyear Buzz ...
Kite HBase: Basic column mapping
©2014 Cloudera, Inc. All rights reserved.
column
{ "source": "firstName", "type": "column",
"family": "name", "qualifier": "first" }
Kite HBase: Counter mapping
©2014 Cloudera, Inc. All rights reserved.
column
{ "source": "firstName", "type": "column",
"family": "name", "qualifier": "first" }
counter (can be incremented)
{ "source": "visits", "type": "counter"
"family": "counts", "qualifier": "visits"}
Kite HBase: Key mapping
©2014 Cloudera, Inc. All rights reserved.
key (stored in the row key using identity)
{ "source": "email", "type": "key" }
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
[
{ "source": "email",
"type": "key" },
...
]
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
buzz@pixar.co
m
Lightyear Buzz 315 true
[
{ "source": "email",
"type": "key" },
...
]
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
[
{ "source": "lastName",
"type": "column",
"family": "name",
"qualifier": "last" },
...
]
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
buzz@pixar.com Lightyear Buzz 315 true
[
{ "source": "lastName",
"type": "column",
"family": "name",
"qualifier": "last" },
...
]
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "visits",
"type" : "long"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
buzz@pixar.com Lightyear Buzz 315 true
[
{ "source": "visits",
"type": "counter",
"family": "counts",
"qualifier": "visits" },
...
]
• Working with a dataset in HBase does not change
• Readers / writers are backed by scans
• CLI tools work:
dataset csv-import pixar_users.csv users --use-hbase
• Additional methods on RandomAccessDataset
• get, put, delete, increment
Kite HBase: Interaction
©2014 Cloudera, Inc. All rights reserved.
RandomAccessDataset<User> users = ...;
Key buzzEmailKey = new Key.Builder()
.add("email", "buzz@pixar.com")
.build();
User buzz = users.get(buzzEmailKey);
buzz.addPreference("flash", true);
users.put(buzz);
Kite HBase: Interaction using keys
©2014 Cloudera, Inc. All rights reserved.
• Versioning and concurrency
• Additional occVersion type, like a counter
• Rejects a put if the record has changed
• Key-as-column mapping
• Stores maps or records in a column family
• Uses the key or field name as the qualifier
Kite HBase: More features
©2014 Cloudera, Inc. All rights reserved.
• Translation between objects and byte arrays in Kite
• Configuration to define key format
• Configuration to define how fields are stored
• Decreases the code and time required to
experiment
• Key format and column mappings are hard
• Try out configurations to find the right one
Kite HBase: Conclusion
©2014 Cloudera, Inc. All rights reserved.
Questions
©2014 Cloudera, Inc. All rights reserved.
Ryan Blue: blue@cloudera.com
Kite mailing list: cdk-dev@cloudera.org

More Related Content

What's hot

Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Gruter
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
markgrover
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
Scott Leberknight
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
Gruter
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
markgrover
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
DataWorks Summit/Hadoop Summit
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Gruter
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
DataWorks Summit
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 

What's hot (20)

Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 

Similar to Kite SDK: Working with Datasets

HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
HBaseCon
 
HBaseCon 2014-Just the Basics
HBaseCon 2014-Just the BasicsHBaseCon 2014-Just the Basics
HBaseCon 2014-Just the Basics
Jesse Anderson
 
Build AWS CloudFormation Custom Resources (DEV417-R2) - AWS re:Invent 2018
Build AWS CloudFormation Custom Resources (DEV417-R2) - AWS re:Invent 2018Build AWS CloudFormation Custom Resources (DEV417-R2) - AWS re:Invent 2018
Build AWS CloudFormation Custom Resources (DEV417-R2) - AWS re:Invent 2018
Amazon Web Services
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
Amazon Web Services
 
Working with Terraform on Azure
Working with Terraform on AzureWorking with Terraform on Azure
Working with Terraform on Azure
tombuildsstuff
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
Alejandro Fernandez
 
EWD 3 Training Course Part 26: Event-driven Indexing
EWD 3 Training Course Part 26: Event-driven IndexingEWD 3 Training Course Part 26: Event-driven Indexing
EWD 3 Training Course Part 26: Event-driven Indexing
Rob Tweed
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
Alejandro Fernandez
 
Ambari Views - Overview
Ambari Views - OverviewAmbari Views - Overview
Ambari Views - Overview
Hortonworks
 
The Future of Securing Access Controls in Information Security
The Future of Securing Access Controls in Information SecurityThe Future of Securing Access Controls in Information Security
The Future of Securing Access Controls in Information Security
Amazon Web Services
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
Dustin Vannoy
 
Elegant Rest Design Webinar
Elegant Rest Design WebinarElegant Rest Design Webinar
Elegant Rest Design Webinar
Stormpath
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
Amazon Web Services
 
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine YardHow I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
SV Ruby on Rails Meetup
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Sungmin Kim
 
AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit
AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS SummitAWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit
AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit
Amazon Web Services
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Amazon Web Services
 

Similar to Kite SDK: Working with Datasets (20)

HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
HBaseCon 2014-Just the Basics
HBaseCon 2014-Just the BasicsHBaseCon 2014-Just the Basics
HBaseCon 2014-Just the Basics
 
Build AWS CloudFormation Custom Resources (DEV417-R2) - AWS re:Invent 2018
Build AWS CloudFormation Custom Resources (DEV417-R2) - AWS re:Invent 2018Build AWS CloudFormation Custom Resources (DEV417-R2) - AWS re:Invent 2018
Build AWS CloudFormation Custom Resources (DEV417-R2) - AWS re:Invent 2018
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
Working with Terraform on Azure
Working with Terraform on AzureWorking with Terraform on Azure
Working with Terraform on Azure
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
EWD 3 Training Course Part 26: Event-driven Indexing
EWD 3 Training Course Part 26: Event-driven IndexingEWD 3 Training Course Part 26: Event-driven Indexing
EWD 3 Training Course Part 26: Event-driven Indexing
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Ambari Views - Overview
Ambari Views - OverviewAmbari Views - Overview
Ambari Views - Overview
 
The Future of Securing Access Controls in Information Security
The Future of Securing Access Controls in Information SecurityThe Future of Securing Access Controls in Information Security
The Future of Securing Access Controls in Information Security
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
 
Elegant Rest Design Webinar
Elegant Rest Design WebinarElegant Rest Design Webinar
Elegant Rest Design Webinar
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine YardHow I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit
AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS SummitAWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit
AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio, Inc.
 
B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024
vmsdeptcom
 
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
CS Kwak
 
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdfTop 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Banibro IT Solutions
 
04. Ruby Operators Slides - Ruby Core Teaching
04. Ruby Operators Slides - Ruby Core Teaching04. Ruby Operators Slides - Ruby Core Teaching
04. Ruby Operators Slides - Ruby Core Teaching
quanhoangd129
 
Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...
Q-Advise
 
vSAN_Tutorial_Presentation with important topics
vSAN_Tutorial_Presentation with important  topicsvSAN_Tutorial_Presentation with important  topics
vSAN_Tutorial_Presentation with important topics
abhilashspt
 
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Andre Hora
 
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
pyxgy
 
07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching
quanhoangd129
 
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence
dorinIonescu
 
How to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at ScaleHow to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at Scale
Anchore
 
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
Shane Coughlan
 
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
lead93317
 
Predicting Test Results without Execution (FSE 2024)
Predicting Test Results without Execution (FSE 2024)Predicting Test Results without Execution (FSE 2024)
Predicting Test Results without Execution (FSE 2024)
Andre Hora
 
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
Andre Hora
 
02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
quanhoangd129
 
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
NMahendiran
 
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
David D. Scott
 
How Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application DevelopmentHow Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application Development
MohammedIrfan308637
 

Recently uploaded (20)

Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
 
B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024
 
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
 
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdfTop 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
 
04. Ruby Operators Slides - Ruby Core Teaching
04. Ruby Operators Slides - Ruby Core Teaching04. Ruby Operators Slides - Ruby Core Teaching
04. Ruby Operators Slides - Ruby Core Teaching
 
Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...
 
vSAN_Tutorial_Presentation with important topics
vSAN_Tutorial_Presentation with important  topicsvSAN_Tutorial_Presentation with important  topics
vSAN_Tutorial_Presentation with important topics
 
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
 
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
 
07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching
 
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence
 
How to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at ScaleHow to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at Scale
 
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
 
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
 
Predicting Test Results without Execution (FSE 2024)
Predicting Test Results without Execution (FSE 2024)Predicting Test Results without Execution (FSE 2024)
Predicting Test Results without Execution (FSE 2024)
 
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
 
02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
 
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
 
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
 
How Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application DevelopmentHow Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application Development
 

Kite SDK: Working with Datasets

  • 1. Kite SDK: HBase Datasets Ryan Blue, Software Engineer
  • 2. What problem is Kite solving? ©2014 Cloudera, Inc. All rights reserved. • Accessibility • Hadoop is flexible, but low level • Should be easy to use, without being an expert
  • 3. Kite SDK ©2014 Cloudera, Inc. All rights reserved. • A set of off-the-shelf tools • Based on experience and best practices • Lets you focus on your problem • Helps you solve new challenges
  • 4. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Focus on using your data, not managing it • You shouldn’t have to maintain data files • This is the first thing you need
  • 5. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application Database Data files Your code Provided Maintained by the database
  • 6. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application Application Database Data files Data files HBase Your code
  • 7. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application ApplicationApplication Database Data files Data files Kite Data HBase Data files HBase Maintained by the Kite
  • 8. Kite Datasets: Goals ©2014 Cloudera, Inc. All rights reserved. • Think in terms of data, not files • Describe your data and Kite does the right thing • Should work consistently across the platform • Reliable
  • 9. Kite Datasets: Compatibility ©2014 Cloudera, Inc. All rights reserved. Project HDFS (avro) HDFS (parquet) HBase Flume Sink 1.0 1.0 1.0 MapReduce 1.0 1.0 1.0 Crunch 1.0 1.0 1.0 Hive 1.0 1.0 1.1 Impala 1.0 1.0 * * depends on common HBase encoding format
  • 10. Kite Datasets: What is it? ©2014 Cloudera, Inc. All rights reserved. • A high-level API for data management • Work with records and datasets • Not files, directories, or byte arrays • Standard descriptions for records and storage • Schemas describe records • Partition strategies describe layout • Opinionated
  • 11. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc
  • 12. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc 1. Describe your layout dataset partition-config ts:year ts:month ts:day --schema rating.avsc --output ymd.json
  • 13. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc 1. Describe your layout dataset partition-config ts:year ts:month ts:day --schema rating.avsc --output ymd.json 1. Create a dataset dataset create ratings --schema rating.avsc --partition-by ymd.json
  • 14. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. datasets/ └── ratings/ ├── year=1997/ │ ├── month=09/ │ │ ├── day=20/ │ │ ├── ... │ │ └── day=30/ │ ├── month=10/ │ │ ├── day=01/ │ │ ├── ...
  • 15. Kite SDK: HBase Datasets Ryan Blue, Software Engineer
  • 16. Kite HBase: Background ©2014 Cloudera, Inc. All rights reserved. Application ApplicationApplication Database Data files Data files Kite Data HBase Data files HBase Maintained by the Kite
  • 17. Kite HBase: Background ©2014 Cloudera, Inc. All rights reserved. • Rows identified by keys, managed by HBase • Columns are organized as cells • Cells are identified by column family, qualifier • The catch: everything is a byte array family name ... row key last first ... buzz@pixar.com Lightyear Buzz ...
  • 18. • Uniform interaction with HBase and HDFS datasets • Need to make keys from records • Need configuration to map fields to cells Kite HBase ©2014 Cloudera, Inc. All rights reserved.
  • 19. Kite HBase: Partitioning ©2014 Cloudera, Inc. All rights reserved. • Use partition strategy to define unique keys • Kite builds the key from each record • Kite translates keys to HBase row id bytes
  • 20. Kite HBase: Partitioning ©2014 Cloudera, Inc. All rights reserved. • Partition strategy produces a storage key • HDFS partitioning uses a group key 1403028411014 => (2014, 6, 17) • HBase partitioning uses a unique key • Grouping is done dynamically by HBase 1403028411014 => (1403028411014)
  • 21. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. • Define key format from data $ ./dataset partition-config --schema user.avsc email:copy
  • 22. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. • Define key format from data $ ./dataset partition-config --schema user.avsc email:copy [ { "source" : "email", "type" : "identity", "name" : "email_copy" } ]
  • 23. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. $ ./dataset partition-config --schema user.avsc email:hash[16] email:copy
  • 24. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. $ ./dataset partition-config --schema user.avsc email:hash[16] email:copy [ { "source" : "email", "type" : "hash", "buckets" : 16, "name" : "email_hash" }, { "source" : "email", "type" : "identity", "name" : "email_copy" } ]
  • 25. Kite HBase: Partitioning ©2014 Cloudera, Inc. All rights reserved. • Use partition strategy to define unique keys • Kite builds the key from each record • Kite translates keys to HBase row id bytes • Some operations require keys
  • 26. Kite HBase: Field mapping ©2014 Cloudera, Inc. All rights reserved. • Configure the column family and qualifier for a field { "email": "buzz@pixar.com", "firstName": "Buzz", ... } family name ... row key last first ... buzz@pixar.com Lightyear Buzz ...
  • 27. Kite HBase: Basic column mapping ©2014 Cloudera, Inc. All rights reserved. column { "source": "firstName", "type": "column", "family": "name", "qualifier": "first" }
  • 28. Kite HBase: Counter mapping ©2014 Cloudera, Inc. All rights reserved. column { "source": "firstName", "type": "column", "family": "name", "qualifier": "first" } counter (can be incremented) { "source": "visits", "type": "counter" "family": "counts", "qualifier": "visits"}
  • 29. Kite HBase: Key mapping ©2014 Cloudera, Inc. All rights reserved. key (stored in the row key using identity) { "source": "email", "type": "key" }
  • 30. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved.
  • 31. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. [ { "source": "email", "type": "key" }, ... ]
  • 32. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.co m Lightyear Buzz 315 true [ { "source": "email", "type": "key" }, ... ]
  • 33. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved.
  • 34. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. [ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ... ]
  • 35. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ... ]
  • 36. { "type" : "record", "name" : "User", "fields" : [ { "name" : "visits", "type" : "long" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "visits", "type": "counter", "family": "counts", "qualifier": "visits" }, ... ]
  • 37. • Working with a dataset in HBase does not change • Readers / writers are backed by scans • CLI tools work: dataset csv-import pixar_users.csv users --use-hbase • Additional methods on RandomAccessDataset • get, put, delete, increment Kite HBase: Interaction ©2014 Cloudera, Inc. All rights reserved.
  • 38. RandomAccessDataset<User> users = ...; Key buzzEmailKey = new Key.Builder() .add("email", "buzz@pixar.com") .build(); User buzz = users.get(buzzEmailKey); buzz.addPreference("flash", true); users.put(buzz); Kite HBase: Interaction using keys ©2014 Cloudera, Inc. All rights reserved.
  • 39. • Versioning and concurrency • Additional occVersion type, like a counter • Rejects a put if the record has changed • Key-as-column mapping • Stores maps or records in a column family • Uses the key or field name as the qualifier Kite HBase: More features ©2014 Cloudera, Inc. All rights reserved.
  • 40. • Translation between objects and byte arrays in Kite • Configuration to define key format • Configuration to define how fields are stored • Decreases the code and time required to experiment • Key format and column mappings are hard • Try out configurations to find the right one Kite HBase: Conclusion ©2014 Cloudera, Inc. All rights reserved.
  • 41. Questions ©2014 Cloudera, Inc. All rights reserved. Ryan Blue: blue@cloudera.com Kite mailing list: cdk-dev@cloudera.org