This is the hands-on-lab document I created accompanying my presentation at the Information On Demand 2013 conference for Session Number 1687 - Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL.
*Contact me for data files*
This lab has 3 independant parts:
Part I - Creating Big SQL Tables and Loading Data
(exploring different ways to create and load HBase tables with Big SQL. includes an optional section on HBase access via JAQL)
Part II - Query Handling
(how to query HBase tables with Big SQL)
Part III - Connecting to Big SQL Server via JDBC
(using BIRT, a business intelligence and reporting tool, to run a simple report on a tpch orders table showcasing use of the BigSQL JDBC driver)
Big SQL 3.0 is a SQL-on-Hadoop solution that provides SQL access to data stored in Hadoop. It uses the same table definitions and metadata as Hive, accessing data already stored in Hadoop without requiring a proprietary format. Big SQL extends Hive's syntax with features like primary keys and foreign keys. Tables in Big SQL and Hive represent views of data stored in Hadoop rather than separate storage structures.
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
Silicon Valley Code Camp -- October 11, 2014.
Session: Getting started with Hadoop on the Cloud.
Hadoop and Cloud is an almost perfect marriage. Hadoop is a distributed computing framework that leverages a cluster built on commodity hardware. The Cloud simplifies provisioning of machines and software. Getting started with Hadoop on the Cloud makes it simple to provision your environment quickly and actually get started using Hadoop. IBM Bluemix has democratized Hadoop for the masses! This session will provide a brief introduction to what Hadoop is, how does cloud work and will then focus on how to get started via a series of demos. We will conclude with a discussion around the tutorials and public datasets - all of the tools needed to get you started quickly.
Learn more about BigInsights for Hadoop: https://developer.ibm.com/hadoop/
This document provides an overview of working with Apache HBase through IBM InfoSphere BigInsights. It discusses starting and monitoring the HBase server, creating tables and loading sample data, and exploring different design options for schemas and queries. The hands-on lab demonstrates creating a simple HBase table, inserting rows of data, and modifying column family properties like compression and caching. It presents both conceptual and physical views of how HBase stores data.
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019Dave Stokes
MySQL is a relational database management system. The document provides an introduction to MySQL, including:
- MySQL is available in both community and enterprise editions. The community edition is free to use while the enterprise edition starts at $5K/4 core CPU before discounts.
- Data in MySQL is organized into tables within schemas (or databases). Tables contain rows of data organized into columns.
- Structured Query Language (SQL) is used to interact with MySQL databases. Common SQL commands include SELECT to retrieve data, INSERT to add data, UPDATE to modify data, and DELETE to remove data.
- JOIN clauses allow retrieving data from multiple tables by linking them together on common columns. This helps normalize data
Oracle 21c: New Features and Enhancements of Data Pump & TTSChristian Gohmann
At the end of the year 2020, Oracle released 21c on its Cloud infrastructure. The on-premises version will follow later this year. As with every new Oracle version, the Data Pump utility gets new features or enhancements for existing features.
This presentation gives an overview of the enhancements of Data Pump and Transportable Tablespaces. The following list is an excerpt of the points I will talk about
- Simultaneous use of EXCLUDE and INCLUDE
- Parallelized import of metadata during a TTS import operation
- Checksum support for dump files
- Direct access to Oracle Cloud Object Store for exports and imports
Hbase in action - Chapter 09: Deploying HBasephanleson
Hbase in action - Chapter 09: Deploying HBase
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
MongoDB® is a matured NoSQL database product with an ever growing adoption. Many big enterprise and internet companies such as Cisco, EBay, Disney etc. are now running large scale mongoDB production deployments. With its increased adoption, mongoDB® has enabled developers to build new types of applications for cloud, mobile and social technologies. This makes mongoDB® developers an invaluable resource for companies looking to innovate in each of these areas.
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoDave Stokes
This document discusses several new features in MySQL 8 including:
1. A new transactional data dictionary that stores metadata instead of files for improved simplicity and crash safety.
2. The addition of histograms to help the query optimizer understand data distributions without indexes for better query planning.
3. Resource groups that allow assigning threads to groups with specific CPU and memory limits to control resource usage.
4. Enhancements to JSON support like in-place updates and new functions for improved flexibility with semi-structured data.
HBase In Action - Chapter 10 - Operationsphanleson
HBase In Action - Chapter 10: Operations
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
Oracle Database 12c - New Features for Developers and DBAsAlex Zaballa
This document summarizes a presentation about new features in Oracle Database 12c for developers and DBAs. It introduces JSON support, data redaction, SQL query row limits and offsets, invisible columns, extended data types, session level sequences, and more. Demo sections are included to illustrate several of the new features.
HBase In Action - Chapter 04: HBase table designphanleson
HBase In Action - Chapter 04: HBase table design
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
This document provides an overview of Oracle 12c Sharded Database Management. It defines what sharding is, how it works, and the benefits it provides such as extreme scalability, fault isolation, and cost reduction. It discusses Oracle's implementation of sharding using database partitioning and Global Data Services (GDS). Key concepts covered include shards, chunks, consistent hashing, and how Oracle supports operations across shards through GDS request routing.
The document discusses using XML data stored in a SQL Server database to power a web application for a company called Acme Traders. It includes details about the database structure, queries needed for the application, security requirements, and other considerations. Multiple choice questions are also included about indexing, replication, archiving historical data, and other SQL Server topics related to the scenario.
This document provides a collection of 17 frequently asked questions (FAQs) about Oracle database concepts. It includes concise definitions and explanations of key terms such as Oracle, Oracle database, Oracle instance, parameter file, system global area, program global area, user account, schema, user role, and more. It also provides sample scripts and is intended as a learning and interview preparation guide for Oracle DBAs.
This document discusses the architecture of Oracle's Exadata Database Machine. It describes the key components which provide high performance and availability, including:
- Shared storage using Exadata Storage Servers and Automatic Storage Management (ASM) for redundancy.
- A shared InfiniBand network for fast, low-latency interconnect between database and storage servers.
- A shared cache within the Real Application Clusters (RAC) environment.
- A cluster of up to 8 database servers each with 80 CPU cores and 256GB memory.
This document discusses the creation of a multitenant container database (CDB) and pluggable databases (PDBs) in Oracle Database 12c. It covers creating a CDB using Oracle Universal Installer, Database Configuration Assistant, or manually. The manual process involves setting enable_pluggable_database to true, adding clauses to the CREATE DATABASE command, and running a script that creates the root and seed PDBs. The document also provides commands to validate if a database is a CDB and view its containers.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
This document summarizes a presentation about optimizing HBase performance through caching. It discusses how baseline tests showed low cache hit rates and CPU/memory utilization. Reducing the table block size improved cache hits but increased overhead. Adding an off-heap bucket cache to store table data minimized JVM garbage collection latency spikes and improved memory utilization by caching frequently accessed data outside the Java heap. Configuration parameters for the bucket cache are also outlined.
The document discusses different types of block caches in HBase including LruBlockCache, SlabCache, and BucketCache. It explains that block caching improves performance by storing frequently accessed blocks in faster memory rather than slower disk storage. Each block cache has its own configuration options and memory usage characteristics. Benchmark results show that the off-heap BucketCache provides strong performance due to its use of off-heap memory for the L2 cache.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
Big Data: Big SQL web tooling (Data Server Manager) self-study labCynthia Saracco
This hands-on lab introduces you to Data Server Manager, a Web tool for querying and monitoring your Big SQL database. Data Server Manager (DSM) and Big SQL support select Apache Hadoop platforms.
Big Data: Explore Hadoop and BigInsights self-study labCynthia Saracco
Want a quick tour of Apache Hadoop and InfoSphere BigInsights (IBM's Hadoop distribution)? Follow this self-study lab to get hands-on experience with HDFS, MapReduce jobs, BigSheets, Big SQL, and more. This lab was tested against the free BigInsights Quick Start Edition 3.0 VMware image.
The document provides information about MySQL, including that it is an open source database software that is widely used. It describes how to install and configure MySQL on Linux, and provides examples of common SQL queries like creating tables, inserting/updating/deleting data, and exporting/importing databases. Key topics covered include the benefits of MySQL, installing it on Linux, basic configuration, and using SQL statements to define schemas and manipulate data.
The document discusses PowerCenter 9.x upgrade strategies presented by Softpath at the Atlanta User Group. It introduces the presenters and provides an overview of Softpath. Various upgrade approaches - such as zero downtime, parallel, cloned, and in-place upgrades - are presented along with their benefits, risks, and time requirements. The stages of an upgrade including planning, preparation work, installation, testing, and production implementation are also outlined.
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...Leons Petražickis
This document provides instructions for completing a hands-on lab to explore Hadoop and big data technologies including HDFS, MapReduce, Pig, Hive, and Jaql. The lab uses a dataset from Google Books to demonstrate word counting and generating histograms of word lengths. Key steps include using Hadoop commands to interact with HDFS, running the WordCount MapReduce program, writing Pig scripts to analyze the data, and using Hive to load the data and generate results. The overall goal is to gain experience using these big data technologies on a Hadoop cluster.
This document provides instructions on migrating objects in Informatica Power Center 9.0.1. It discusses the different types of Informatica repositories and how to create and configure a repository. It then describes how to migrate objects between repositories or folders using drag and drop or XML export/import. The key steps involve connecting to the source and target repositories, selecting the object to migrate, resolving any conflicts, and verifying the migrated object in the target location.
This document provides an overview of performance tuning the MySQL server. It discusses where to find server configuration and status information, how to analyze what the database is doing using status variables, and which configuration variables can be tuned for optimization, including global, per-session, and storage engine variables. Key areas covered include memory usage, query analysis, indexing strategies, and tuning storage engines like InnoDB and MyISAM.
DBA, LEVEL III TTLM Monitoring and Administering Database.docxseifusisay06
The document provides information about monitoring, administering, and tuning a SQL Server database, including:
1) Steps for installing and configuring SQL Server.
2) The importance of database monitoring to track performance and ensure availability.
3) Tools that can be used for database monitoring and performance tuning.
4) Activities involved in database maintenance and the different editions of SQL Server 2008.
5) Methods for installing SQL Server, including local, unattended, and remote installations.
The document provides step-by-step instructions for installing Apache, MySQL, PHP, and Perl on Windows. It details downloading and installing each program, including configuring settings and paths. Basic tests are outlined to check the installation of each component. Finally, a PHP script is presented that connects to a MySQL database to test the full installation.
The document provides step-by-step instructions for installing Apache, MySQL, PHP, and Perl on Windows. It details downloading and installing each program, including configuring settings and paths. Basic tests are outlined to check the installation of each component. Finally, a PHP script is presented that connects to a MySQL database to test the full installation.
This document provides instructions for installing Apache, MySQL, PHP, and Perl on Windows. It details downloading and installing each application, including accepting defaults, modifying directories during installation, and testing functionality. Configuration steps are also described, such as editing files to integrate Apache and PHP and enable MySQL support in PHP. The summary focuses on the core installation and configuration process across the applications.
The document provides step-by-step instructions for installing Apache, MySQL, PHP, and Perl on Windows. It details downloading and installing each program, including configuring settings and paths. Basic tests are outlined to check the installation of each component. Finally, a PHP script is provided that connects to a MySQL database to test the full setup.
Drupal Continuous Integration with Jenkins - DeployJohn Smith
This document describes setting up Jenkins jobs to automate deploying code from a Git repository to different environments. It includes:
1. Creating a simple job that deploys code to a single server/environment using a deployment script.
2. Creating a generic job that deploys code to multiple servers/environments using parameters for the repository, branch, and environment.
3. A sample deployment script that would run on servers to check out the appropriate code from Git based on the job parameters.
The document discusses table partitioning and sharding in PostgreSQL as approaches to improve performance and scalability as data volumes grow over time. Table partitioning involves splitting a master table into multiple child tables or partitions based on a partition function to distribute data. Sharding distributes partitions across multiple database servers. The document provides steps to implement table partitioning and sharding in PostgreSQL using the Citus extension to distribute a sample sales table across a master and worker node.
A user guide that introduces a new User Interface to HPE NonStop SQL/MX DBS.
SQL/MX DBS is a solution that provides a multi-tenant database environment where the databases are isolated from each other while still sharing common resources such as compute power, storage, and network capacity. However, while the databases share the storage, each database uses dedicated, unshared, devices which prevents them from encountering database bottlenecks such as database cache and lock-space. Cache and lock space are part of the NonStop SQL Data Access Managers which are dedicated to only one database and not shared with others.
Percona Cluster with Master_Slave for Disaster RecoveryRam Gautam
The document describes setting up asynchronous master-slave database replication between a production database cluster and a disaster recovery database cluster using Percona tools. It provides configuration details for the master and slave databases including enabling binary logging and setting the server IDs. The process involves taking a backup of the master database using Innobackupex, preparing the backup, and copying it to the slave database server. Replication is then started by configuring the master to replicate and the slave as a replica.
How to install Vertica in a single node.Anil Maharjan
1. The document provides steps to install Vertica in a single node for testing purposes. This includes downloading Vertica, installing it on a Linux VM, and resolving errors during installation.
2. It also covers installing the Vertica Management Console to monitor the Vertica database.
3. Instructions are provided for connecting Tableau to the Vertica database to analyze and visualize data.
The Windows Logging Cheat Sheet is the definitive guide on learning where to start with Windows Logging. How to Enable, Configure, Gather and Harvest events so you can catch a hacker in the act.
DB2 is a multi-platform database server that can scale from laptops to large systems handling terabytes of data. It provides tools for extending capabilities to support multimedia, is fully integrated for web access, and supports universal access and multiple platforms. The tutorial covered key DB2 concepts like instances, schemas, tables, and indexes. It demonstrated how to use Control Center and other GUIs to perform tasks like creating databases and tables, querying data, and setting user privileges. Java applications can also access DB2 data through JDBC.
WebSphere Portal Version 6.0 Web Content Management and DB2 Tuning GuideTan Nguyen Phi
This document provides tuning guidelines for IBM WebSphere Portal Version 6.0 with Web Content Management (WCM) running on DB2. It describes how to optimize the application server, database, and various configuration parameters. Specific recommendations are given for JVM settings like heap size, enabling large pages, and pinning clusters. Database tuning tips include setting registry variables and configuration parameters. Ongoing maintenance activities like monitoring, reorganizations, and statistics collection are also outlined.
Similar to Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL (20)
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
What's New in Teams Calling, Meetings, Devices June 2024
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
1. Hands on Lab
Adding Value to HBase with IBM InfoSphere
BigInsights and BigSQL
Session Number 1687
Piotr Pruski, IBM, piotr.pruski@ca.ibm.com (
Benjamin Leonhardi, IBM
@ppruski)
1
2. Table of Contents
Lab Setup ............................................................................................................................ 3
Getting Started .................................................................................................................... 3
Administering the Big SQL and HBase Servers................................................................. 4
Part I – Creating Big SQL Tables and Loading Data ......................................................... 6
Background ..................................................................................................................... 6
One-to-one Mapping....................................................................................................... 9
Adding New JDBC Drivers ...................................................................................... 11
One-to-one Mapping with UNIQUE Clause................................................................. 13
Many-to-one Mapping (Composite Keys and Dense Columns)................................... 16
Why do we need many-to-one mapping? ................................................................. 17
Data Collation Problem............................................................................................. 19
Many-to-one Mapping with Binary Encoding.............................................................. 20
Many-to-one Mapping with HBase Pre-created Regions and External Tables ............ 22
Load Data: Error Handling ........................................................................................... 26
[OPTIONAL] HBase Access via JAQL ....................................................................... 27
PART II – A – Query Handling........................................................................................ 31
The Data........................................................................................................................ 31
Projection Pushdown .................................................................................................... 33
Predicate Pushdown ...................................................................................................... 34
Point Scan ................................................................................................................. 34
Partial Row Scan....................................................................................................... 35
Range Scan................................................................................................................ 35
Full Table Scan ......................................................................................................... 36
Automatic Index Usage................................................................................................. 37
Pushing Down Filters into HBase................................................................................. 38
Table Access Hints ....................................................................................................... 39
Accessmode .............................................................................................................. 39
PART II – B – Connecting to Big SQL Server via JDBC ................................................ 40
Business Intelligence and Reporting via BIRT............................................................. 41
Communities ..................................................................................................................... 48
Thank You! ....................................................................................................................... 48
Acknowledgements and Disclaimers................................................................................ 49
2
3. Lab Setup
This lab exercise uses the IBM InfoSphere BigInsights Quick Start Edition, v2.1. The Quick
Start Edition uses a non-warranted program license, and is not for production use.
The purpose of the Quick Start Edition is for experimenting with the features of InfoSphere
BigInsights, while being able to use real data and run real applications. The Quick Start
Edition puts no data limit on the cluster and there is no time limit on the license.
The following table outlines the users and passwords that are pre-configured on the image:
username
root
biadmin
db2inst1
password
password
biadmin
password
Getting Started
To prepare for the contents of this lab, you must go through the following process to start all
of the Hadoop components.
1. Start the VMware image by clicking the “Power on this virtual machine” button in
VMware Workstation if the VM is not already on.
2. Log into the VMware virtual machine using the following information
user: biadmin
password: biadmin
3. Double-click on the BigInsights Shell folder icon from the desktop of the Quick Start
VM. This view provides you with quick links to access the following functions that will be
used throughout the course of this exercise:
Big SQL Shell
HBase Shell
Jaql Shell
Linux gnome-terminal
3
4. 4. Open the Terminal (gnome-terminal) and start the Hadoop components (daemons).
Linux Terminal
start-all.sh
Note: This command may take a few minutes to finish.
Once all components have started successfully as shown below you may move to the next
section.
…
[INFO] Progress - 100%
[INFO] DeployManager - Start; SUCCEEDED components: [zookeeper, hadoop, derby, hive,
hbase, bigsql, oozie, orchestrator, console, httpfs]; Consumes : 174625ms
Administering the Big SQL and HBase Servers
BigInsights provides both command-line tools and a user interface to manage the Big SQL
and HBase servers. In this section, we will briefly go over the user interface which is part of
BigInsights Web Console.
1. Bring up the BigInsights web console by double clicking on the BigInsights
WebConsole icon on the desktop of the VM and open the Cluster Status tab. Select
HBase to view the status of HBase master and region servers.
2. Similarly, click on Big SQL from the same tab to view its status.
4
5. 3. Use hbase-master and hbase-regionserver web interfaces to visualize tables, regions
and other metrics. Go to the BigInsights Welcome tab and select “Access Secure
Cluster Servers.” You may need to enable pop-ups from the site when prompted.
Alternatively, point browser to the following bottom two URL’s noted in the image below.
Some interesting information from the web interfaces are:
HBase root directory
• This can be used to find the size of an HBase table.
List of tables with descriptions.
5
6. Each table displays lists of regions with start and end keys.
• This information can be used to compact or split tables as needed.
Metrics for each region server.
• These can be used to determine if there are hot regions which are serving
the majority of requests to a table. Such regions can be split. It also helps
determine the effects and effectiveness of block cache, bloom filters and
memory settings.
4. Perform a health check of HBase and Big SQL which is different from the status checks
done above. It verifies the health of the functionality. From the Linux gnome-terminal,
issue the following commands.
Linux Terminal
$BIGINSIGHTS_HOME/bin/healthcheck.sh hbase
[INFO] DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ]
[INFO] Progress - Health check hbase
[INFO] Deployer - Try to start hbase if hbase service is stopped...
[INFO] Deployer - Double check whether hbase is started successfully...
[INFO] @bivm - hbase-master(active) started, pid 6627
[INFO] @bivm - hbase-regionserver started, pid 6745
[INFO] Deployer - hbase service started
[INFO] Deployer - hbase service is healthy
[INFO] Progress - 100%
[INFO] DeployManager - Health check; SUCCEEDED components: [hbase]; Consumes :
26335ms
Linux Terminal
$BIGINSIGHTS_HOME/bin/healthcheck.sh bigsql
[INFO]
[INFO]
[INFO]
[INFO]
[INFO]
[INFO]
[INFO]
1121ms
DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ]
Progress - Health check bigsql
@bivm - bigsql-server already running, pid 6949
Deployer - Ping Check Success: bivm/192.168.230.137:7052
@bivm - bigsql is healthy
Progress - 100%
DeployManager - Health check; SUCCEEDED components: [bigsql]; Consumes :
Part I – Creating Big SQL Tables and Loading Data
In this part of the lab, our main goal is to demonstrate a migration of a table from a relational
database to Big Insights using Big SQL over HBase. We will understand how HBase handles
row keys and some pitfalls that users may encounter when moving data from a relational
database to HBase tables. We will also try some useful options like pre-creating regions to
see how it can help with data loading and queries. We will also explore various ways to load
data.
Background
6
7. In this lab, we will use one table from the Great Outdoors Sales Data Warehouse model
(GOSALESDW), SLS_SALES_FACT.
The details of the tables along with its primary key information are depicted in the figure
below.
SLS_SALES_FACT
PK
PK
PK
PK
PK
PK
PK
ORDER_DAY_KEY
ORGANIZATION_KEY
EMPLOYEE_KEY
RETAILER_KEY
RETAILER_SITE_KEY
PROMOTION_KEY
ORDER_METHOD_KEY
SALES_ORDER_KEY
SHIP_DAY_KEY
CLOSE_DAY_KEY
QUANTITY
UNIT_COST
UNIT_PRICE
UNIT_SALE_PRICE
GROSS_MARGIN
SALE_TOTAL
GROSS_PROFIT
There is an instance of DB2 contained on this image which contains this table with data
already loaded that we will use in our migration.
From the Linux gnome-terminal, switch to the DB2 instance user as shown below.
Linux Terminal
su - db2inst1
Note: The password for the db2inst1 is password. Enter this when prompted.
As db2inst1, connect to the pre-created database, gosales.
Linux Terminal
db2 CONNECT TO gosales
Upon successful connection, you should see the following output on the terminal.
Database Connection Information
Database server
SQL authorization ID
Local database alias
= DB2/LINUXX8664 10.5.0
= DB2INST1
= GOSALES
Issue the following command to list all of the tables contained in this database.
7
8. Linux Terminal
db2 LIST TABLES
Note: Here you will see three tables. Each one is essentially the same except with one key difference –
the amount of data that is contained within them. The remaining instructions in this lab exercise will use the
SLS_SALES_FACT_10P table simply for the fact that it has a smaller amount of data and will be faster to
work with for demonstration purposes. If you would like to use the larger tables with more data feel free to
do so but just remember to change the names appropriately.
Table/View
------------------------------SLS_SALES_FACT
SLS_SALES_FACT_10P
SLS_SALES_FACT_25P
Schema
--------------DB2INST1
DB2INST1
DB2INST1
Type
----T
T
T
Creation time
-------------------------2013-08-22-14.51.27.228148
2013-08-22-14.54.01.622569
2013-08-22-14.55.46.416787
3 record(s) selected.
Examine how many rows we have in this table to ensure later everything will be migrated
properly. Issue the following select statement.
Linux Terminal
db2 "SELECT COUNT(*) FROM sls_sales_fact_10p"
You should expect 44603 rows in this table.
1
----------44603
1 record(s) selected.
Use the following describe command to view all of the columns and data types that are
contained within this table.
Linux Terminal
db2 "DESCRIBE TABLE sls_sales_fact_10p"
8
9. Column name
------------------------------ORDER_DAY_KEY
ORGANIZATION_KEY
EMPLOYEE_KEY
RETAILER_KEY
RETAILER_SITE_KEY
PRODUCT_KEY
PROMOTION_KEY
ORDER_METHOD_KEY
SALES_ORDER_KEY
SHIP_DAY_KEY
CLOSE_DAY_KEY
QUANTITY
UNIT_COST
UNIT_PRICE
UNIT_SALE_PRICE
GROSS_MARGIN
SALE_TOTAL
GROSS_PROFIT
Data type
Column
schema
Data type name
Length
Scale Nulls
--------- ------------------- ---------- ----- ----SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
DECIMAL
DECIMAL
DECIMAL
DOUBLE
DECIMAL
DECIMAL
4
4
4
4
4
4
4
4
4
4
4
4
19
19
19
8
19
19
0
0
0
0
0
0
0
0
0
0
0
0
2
2
2
0
2
2
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
18 record(s) selected.
One-to-one Mapping
In this section, we will use Big SQL to do a one-to-one mapping of the columns in the
relational DB2 table to an HBase table row key and columns. This is not a recommended
approach; however, the goal of this exercise is to demonstrate the inefficiency and pitfalls
that can occur with such a mapping.
Big SQL supports both, one-to-one and many-to-one mappings.
In a one-to-one mapping, the HBase row key and each HBase column is mapped to a single
SQL column. In the following example, the HBase row key is mapped to the SQL column id.
Similarly, the cq_name column within the cf_data column family is mapped to the SQL
column ‘name’ and so on.
To begin, first create a schema to keep our tables organized. Open the BigSQL Shell from
the BigInsights Shell folder on desktop and use the create schema command to create a
schema named gosalesdw.
BigSQL Shell
CREATE SCHEMA gosalesdw;
9
10. Issue the following command in the same BigSQL shell that is open. This DDL statement will
create the SQL table with the one-to-one mapping of what we have in our relational DB2
source. Notice all the column names are the same with the same data types. The column
mapping section requires a mapping for the row key. HBase columns are identified using
family:qualifier.
BigSQL Shell
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT
(
ORDER_DAY_KEY
int,
ORGANIZATION_KEY int,
EMPLOYEE_KEY
int,
RETAILER_KEY
int,
RETAILER_SITE_KEY int,
PRODUCT_KEY
int,
PROMOTION_KEY
int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY
int,
SHIP_DAY_KEY
int,
CLOSE_DAY_KEY
int,
QUANTITY
int,
UNIT_COST
decimal(19,2),
UNIT_PRICE
decimal(19,2),
UNIT_SALE_PRICE
decimal(19,2),
GROSS_MARGIN
double,
SALE_TOTAL
decimal(19,2),
GROSS_PROFIT
decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY),
cf_data:cq_ORGANIZATION_KEY
mapped by (ORGANIZATION_KEY),
cf_data:cq_EMPLOYEE_KEY
mapped by (EMPLOYEE_KEY),
cf_data:cq_RETAILER_KEY
mapped by (RETAILER_KEY),
cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY),
cf_data:cq_PRODUCT_KEY
mapped by (PRODUCT_KEY),
cf_data:cq_PROMOTION_KEY
mapped by (PROMOTION_KEY),
cf_data:cq_ORDER_METHOD_KEY
mapped by (ORDER_METHOD_KEY),
cf_data:cq_SALES_ORDER_KEY
mapped by (SALES_ORDER_KEY),
cf_data:cq_SHIP_DAY_KEY
mapped by (SHIP_DAY_KEY),
cf_data:cq_CLOSE_DAY_KEY
mapped by (CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_UNIT_COST
mapped by (UNIT_COST),
cf_data:cq_UNIT_PRICE
mapped by (UNIT_PRICE),
cf_data:cq_UNIT_SALE_PRICE
mapped by (UNIT_SALE_PRICE),
cf_data:cq_GROSS_MARGIN
mapped by (GROSS_MARGIN),
cf_data:cq_SALE_TOTAL
mapped by (SALE_TOTAL),
cf_data:cq_GROSS_PROFIT
mapped by (GROSS_PROFIT)
);
Big SQL supports a load from source command that can be used to load data from
warehouse sources which we’ll use first. It also supports loading data from delimited files
using a load hbase command which we will use later.
10
11. Adding New JDBC Drivers
The load from source command uses Sqoop internally to do the load. Therefore, before
using the load command from a BigSQL shell, we need first add the driver for the JDBC
source into 1) the Sqoop library directory, and 2) the JSQSH terminal shared directory.
From a Linux gnome-terminal, issue the following command (as biadmin) to add the JDBC
driver JAR file to access the database to the $SQOOP_HOME/lib directory.
Linux Terminal
cp /opt/ibm/db2/V10.5/java/db2jcc.jar $SQOOP_HOME/lib
From the BigSQL shell, examine the drivers currently loaded for the JSQSH terminal.
BigSQL Shell
drivers
Terminate the BigSQL shell with the quit command.
BigSQL Shell
quit
Copy the same DB2 driver to the JSQSH share directory with the following command.
Linux Terminal
cp /opt/ibm/db2/V10.5/java/db2jcc.jar
$BIGINSIGHTS_HOME/bigsql/jsqsh/share/
When a user adds new drivers, the Big SQL server must be restarted. You could do this
either from the web console, or use the follow command from the Linux gnome-terminal.
Linux Terminal
stop.sh bigsql && start.sh bigsql
Open the BigSQL Shell from the BigInsights Shell folder on desktop once again since it was
closed in our earlier step with the quit command and check if in fact the driver was loaded
into JSQSH.
BigSQL Shell
drivers
Now that the drivers have been set, the load can finally take place. The load from
source statement extracts data from a source outside of an InfoSphere BigInsights cluster
(DB2 in this case) and loads that data into an InfoSphere BigInsights HBase (or Hive) table.
Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the
SLS_SALES_FACT table we have defined in BigSQL.
BigSQL Shell
11
12. LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/GOSALES'
WITH PARAMETERS (user = 'db2inst1',password = 'password') FROM TABLE
SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY INTO HBASE TABLE
gosalesdw.sls_sales_fact APPEND;
You should expect to load 44603 rows which is the same number of rows that the select
count statement on the original DB2 table verified earlier.
44603 rows affected (total: 1m37.74s)
Try to verify this with a select count statement as shown.
BigSQL Shell
SELECT COUNT(*) FROM gosalesdw.sls_sales_fact;
Notice there is a discrepancy between the results from the load operation and the select
count statement.
+----+
|
|
+----+
| 33 |
+----+
1 row in results(first row: 3.13s; total: 3.13s)
Also verify from an HBase shell. Open the HBase Shell from the BigInsights Shell folder on
desktop and issue the following count command to verify the number of rows.
HBase Shell
count 'gosalesdw.sls_sales_fact'
It should be apparent that the results from the Big SQL statement and HBase commands
conform to one another.
33 row(s) in 0.7000 seconds
However, this doesn’t yet explain why there is a mismatch between the number of loaded
rows and the number of retrieved rows when we query the table.
The load (and insert -- to be examined later) command behaves like upsert. Meaning, if a
row with the same row key exists, HBase will write the new value as a new version for that
column/cell. When querying the table, only latest value is returned by Big SQL.
In many cases, this behaviour could be confusing. As with our case, we tried to load data
with repeating values for a row key from a DB2 table with 44603 rows, the load reported
44603 rows affected. However, the select count(*) showed fewer rows; 33 to be exact. No
errors are thrown in such scenarios therefore it is always recommended to cross-check the
number of rows by querying the table as we did.
Now that we understand that all the rows are actually versioned in HBase, we can examine a
possible way to retrieve all versions of a particular row.
12
13. First, from the BigSQL shell, issue the following select query with a predicate on the order
day key. In the original table, there are most likely many tuples with the same order day key.
BigSQL Shell
SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE
order_day_key = 20070720;
As expected, we only retrieve one row, which is the latest or newest version of the row
inserted into HBase with the specified order day key.
+------------------+
| organization_key |
+------------------+
|
11171 |
+------------------+
33 row(s) in 0.7000 seconds
Using the HBase shell, we can retrieve previous versions for a row key. Use the following
get command to get the top 4 versions of the row with row key 20070720.
HBase Shell
get 'gosalesdw.sls_sales_fact', '20070720', {COLUMN =>
'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}
Since the previous command specified only 4 versions (VERSIONS => 4), we only retrieve
4 rows in the output.
COLUMN
cf_data:cq_ORGANIZATION_KEY
value=11171
cf_data:cq_ORGANIZATION_KEY
value=11171
cf_data:cq_ORGANIZATION_KEY
value=11171
cf_data:cq_ORGANIZATION_KEY
value=11171
4 row(s) in 0.0360 seconds
CELL
timestamp=1383365546430,
timestamp=1383365546429,
timestamp=1383365546428,
timestamp=1383365546427,
Optionally try the same command again specifying a larger version number. For example,
VERSIONS => 100.
Either way, most likely, this is not the intended behaviour that users may expect when
performing such migration. They probably wanted to get all the data into the HBase table
without versioned cells. There are a couple of solutions for this. One is to define the table
with a composite row key to enforce uniqueness which will be explored later in this lab.
Another option, outlined in the next section, is to force each row key to be unique by
appending a UUID.
One-to-one Mapping with UNIQUE Clause
13
14. Another option while performing such a migration is to use the force key unique option
when creating the table using BigSQL syntax. This option will force the load to add a UUID to
the row key. It helps to prevent versioning of cells. However, this method is quite inefficient
as it stores more data and also makes queries slower.
Issue the following command in the BigSQL shell. This statement will create the SQL table
with the one-to-one mapping of what we have in our relational DB2 source. This DDL
statement is almost identical to what was seen in the previous section with one exception:
the force key unique clause is specified for the column mapping of the row key.
BigSQL Shell
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE
(
ORDER_DAY_KEY
int,
ORGANIZATION_KEY int,
EMPLOYEE_KEY
int,
RETAILER_KEY
int,
RETAILER_SITE_KEY int,
PRODUCT_KEY
int,
PROMOTION_KEY
int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY
int,
SHIP_DAY_KEY
int,
CLOSE_DAY_KEY
int,
QUANTITY
int,
UNIT_COST
decimal(19,2),
UNIT_PRICE
decimal(19,2),
UNIT_SALE_PRICE
decimal(19,2),
GROSS_MARGIN
double,
SALE_TOTAL
decimal(19,2),
GROSS_PROFIT
decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY) force key unique,
cf_data:cq_ORGANIZATION_KEY
mapped by (ORGANIZATION_KEY),
cf_data:cq_EMPLOYEE_KEY
mapped by (EMPLOYEE_KEY),
cf_data:cq_RETAILER_KEY
mapped by (RETAILER_KEY),
cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY),
cf_data:cq_PRODUCT_KEY
mapped by (PRODUCT_KEY),
cf_data:cq_PROMOTION_KEY
mapped by (PROMOTION_KEY),
cf_data:cq_ORDER_METHOD_KEY
mapped by (ORDER_METHOD_KEY),
cf_data:cq_SALES_ORDER_KEY
mapped by (SALES_ORDER_KEY),
cf_data:cq_SHIP_DAY_KEY
mapped by (SHIP_DAY_KEY),
cf_data:cq_CLOSE_DAY_KEY
mapped by (CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_UNIT_COST
mapped by (UNIT_COST),
cf_data:cq_UNIT_PRICE
mapped by (UNIT_PRICE),
cf_data:cq_UNIT_SALE_PRICE
mapped by (UNIT_SALE_PRICE),
cf_data:cq_GROSS_MARGIN
mapped by (GROSS_MARGIN),
cf_data:cq_SALE_TOTAL
mapped by (SALE_TOTAL),
cf_data:cq_GROSS_PROFIT
mapped by (GROSS_PROFIT)
);
14
15. In the previous section, we used the load from source command to get the data from
our table on DB2 source into HBase. This may not always be feasible which is why in this
section we explore another loading statement, load hbase. This will load data into HBase
using flat files – which perhaps is an export of the data form the relational source.
Issue the following statement which will load data from a file into an InfoSphere BigInsights
HBase table.
BigSQL Shell
LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY 't' INTO TABLE
gosalesdw.sls_sales_fact_unique;
Note: The load hbase command can take in an optional list of columns. If no column list is specified, it
will use the column ordering in table definition. The input file can be on DFS or on the local file system
where Big SQL server is running.
Once again, you should expect to load 44603 rows which is the same number of rows that
the select count statement on the original DB2 table verified.
44603 rows affected (total: 26.95s)
Verify the number of rows loaded with a select count statement as shown.
BigSQL Shell
SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique;
This time there is no discrepancy between the results from the load operation and the select
count statement.
+-------+
|
|
+-------+
| 44603 |
+-------+
1 row in results(first row: 1.61s; total: 1.61s)
Issue the same count from the HBase shell to be sure.
HBase Shell
count 'gosalesdw.sls_sales_fact_unique'
The values are persistent across load, select, and count.
...
44603 row(s) in 6.8490 seconds
As in the previous section, from the BigSQL shell, issue the following select query with a
predicate on the order day key.
BigSQL Shell
15
16. SELECT organization_key FROM gosalesdw.sls_sales_fact_unique WHERE
order_day_key = 20070720;
In the previous section, only one row was returned for the specified date. This time, expect to
see 1405 rows since the rows are now forced to be unique due to our clause in the create
statement and therefore no versioning should be applied.
1405 rows in results(first row: 0.47s; total: 0.58s)
Once again, as in the previous section, we can check from the HBase shell if there are
multiple versions of the cells. Issue the following get statement to attempt to retrieve the top
4 versions of the row with row key 20070720.
HBase Shell
get 'gosalesdw.sls_sales_fact_unique', '20070720', {COLUMN =>
'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}
Zero rows are returned because the row key of 20070720 doesn’t exist. This is due to the
fact we have appended the UUID to each row key; (20070720 + UUID).
COLUMN
0 row(s) in 0.0850 seconds
CELL
Therefore, instead, issue the follow HBase command to do a scan versus a get. This will
scan the table using the first part of the row key. We are also indicating scanner
specifications of start and stop row values to only return the results we are interested in
retrieving.
HBase Shell
scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => '20070720',
STOPROW => '20070721'}
Notice there are no discrepancies between the results from Big SQL select and HBase scan.
1405 row(s) in 12.1350 seconds
Many-to-one Mapping (Composite Keys and Dense Columns)
This section is dedicated to the other option of trying to enforce uniqueness of the cells and
that is to define a table with a composite row key (aka many-to-one mapping).
In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row
key or a column). There are two terms that may be used frequently: composite key and
dense column. A composite key is an HBase row key that is mapped to multiple SQL
columns. A dense column is an HBase column that is mapped to multiple SQL columns.
In the following example, the row key contains two parts – userid and account number. Each
part corresponds to a SQL column. Similarly, the HBase columns are mapped to multiple
16
17. SQL columns. Note that we can have a mix. For example, we can have a composite key, a
dense column and a non-dense column or any mix of these.
key
11111_ac11
userid
acc_no
Column Family: cf_data
cq_acct
cq_names
fname1_lname1
first_na
me
HBase
11111#11#0.25
last_na
me
balanc
min_ba
intere
SQL
Issue the following DDL statement from the BigSQL shell which represents all entities from
our relational table using a many-to-one mapping. Take notice of the column mapping
section where multiple columns can be mapped to single family:qualifier’s.
BigSQL Shell
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE
(
ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY
int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int,
QUANTITY int,
UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE
decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2),
GROSS_PROFIT decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY,
RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY,
ORDER_METHOD_KEY),
cf_data:cq_OTHER_KEYS
mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY,
CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_DOLLAR_VALUES
mapped by (UNIT_COST, UNIT_PRICE,
UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT)
);
Why do we need many-to-one mapping?
HBase stores a lot of information for each value. For each value stored, a key consisting of
the row key, column family name, column qualifier and timestamp are also stored. This
means a lot of duplicate information is kept.
HBase is very verbose and it is primarily intended for sparse data. In most cases, data in the
relational world is not sparse. If we were to store each SQL column individually on HBase, as
in our previous two sections, the required storage space would exponentially grow. When
querying that data back, the query also returns the entire key (meaning, the row key, column
family, and column qualifier) for each value. As an example, after loading data into this table
we will examine the storage space for each of the three tables created thus far.
17
18. As in the previous section, issue the following statement which will load data from a file into
the InfoSphere BigInsights HBase table.
BigSQL Shell
LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY 't' INTO TABLE
gosalesdw.sls_sales_fact_dense;
Notice, the number of rows loaded into a table with many-to-one mapping remains the same
even though we are storing less data! This statement also executes much faster than the
previous load for this exact reason.
44603 rows affected (total: 3.42s)
Issue the same statements and commands from both the BigSQL and HBase shell’s as in
the previous two sections to verify that the number of rows is the same as in the original
dataset. All of the results should be the same as in the previous section.
BigSQL Shell
SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense;
+-------+
|
|
+-------+
| 44603 |
+-------+
1 row in results(first row: 0.93s; total: 0.93s)
BigSQL Shell
SELECT organization_key FROM gosalesdw.sls_sales_fact_dense WHERE
order_day_key = 20070720;
1405 rows in results(first row: 0.65s; total: 0.68s)
HBase Shell
scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW
=> '20070721'}
1405 row(s) in 4.3830 seconds
As noted earlier, one-to-one mapping leads to use of too much storage space for the same
data mapped using composite keys or dense column where the HBase row key or HBase
column(s) are made up of multiple relational table columns. This is because HBase would
repeat row key, column family name, column name and timestamp for each column value.
For relational data which is usually dense, this would cause an explosion in the required
storage space.
Issue the following command as biadmin from a Linux gnome-terminal to check the directory
sizes for the three tables we created thus far.
18
19. Linux Terminal
hadoop fs -du /hbase/
…
17731926
3188
47906322
…
hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact
hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense
hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique
Notice that the dense table is significantly smaller than the others. The table in which we
forced uniqueness is the largest since it needs to append a UUID to each row key.
Data Collation Problem
All data represented thus far has been stored as strings. That is the default encoding on
HBase tables created by BigSQL. Therefore, numeric data is not collated correctly. HBase
uses lexicographic ordering, so you may run into cases where a query returns wrong results.
The following scenario walks through a situation where data is not collated correctly.
Using the Big SQL insert into hbase statement, add the following row to the
sls_sales_fact_dense table we previously defined and loaded data into. Notice that the date
we are specifying as part of the ORDER_DAY_KEY column (which has data type int) is a
lager numerical value and does not conform to any date standard since it contains an extra
digit.
BigSQL Shell
INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY,
ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY,
PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (200707201, 11171,
4428, 7109, 5588, 30265, 5501, 605);
Note: Insert command is available for HBase tables. However, it is not a supported feature
Issue a scan on the table with the following start and stop criteria.
HBase Shell
scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW
=> '20070721'}
Take notice of the last three rows/cells returned from the output of this scan. The newly
added row shows up in the scan even though its integer value is not between 20070720 and
20070721.
19
20. 200707201x0011171x004428x007109x005588x003 column=cf_data:cq_DOLLAR_VALUES,
timestamp=1376692067977, value=
0264x005501x00605
200707201x0011171x004428x007109x005588x003 column=cf_data:cq_OTHER_KEYS,
timestamp=1376692067977, value=
0264x005501x00605
200707201x0011171x004428x007109x005588x003 column=cf_data:cq_QUANTITY,
timestamp=1376692067977, value=
0264x005501x00605
1406 row(s) in 4.2400 seconds
Now insert another row into the table with the following command. This time we are
conforming to the date format of YYYYMMDD and incrementing the day by 1 from the last
value returned in the table; i.e., 20070721.
BigSQL Shell
INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY,
ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY,
PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (20070721, 11171,
4428, 7109, 5588, 30265, 5501, 605);
Issue another scan on the table. Keep in mind to increase the stoprow criteria by 1 day.
HBase Shell
scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW
=> '20070722'}
Now notice that the newly added row is included as part of the result set, and the row with
ORDER_DAY_KEY of 200707201 is after the row with ORDER_DAY_KEY of 20070721.
This is an example of numeric data is not collated properly. Meaning, the rows are not being
stored in numerical order as one might expect but rather in byte lexicographical order.
200707201x0011171x004428x007109x005588x003
timestamp=1376692067977, value=
0264x005501x00605
200707201x0011171x004428x007109x005588x003
timestamp=1376692067977, value=
0264x005501x00605
200707201x0011171x004428x007109x005588x003
timestamp=1376692067977, value=
0264x005501x00605
20070721x0011171x004428x007109x005588x0030
timestamp=1376692480966, value=
265x005501x00605
20070721x0011171x004428x007109x005588x0030
timestamp=1376692480966, value=
265x005501x00605
20070721x0011171x004428x007109x005588x0030
timestamp=1376692480966, value=
265x005501x00605
1407 row(s) in 2.8840 seconds
column=cf_data:cq_DOLLAR_VALUES,
column=cf_data:cq_OTHER_KEYS,
column=cf_data:cq_QUANTITY,
column=cf_data:cq_DOLLAR_VALUES,
column=cf_data:cq_OTHER_KEYS,
column=cf_data:cq_QUANTITY,
Many-to-one Mapping with Binary Encoding
20
21. Big SQL supports two types of data encodings: string and binary. Each HBase entity can
also have its own encoding. For example, a row key can be encoded as a string, one HBase
column can be encoded as binary and another as string.
String is the default encoding used in Big SQL HBase tables. The value is converted to string
and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity,
separators are used to delimit data. The default separator is the null byte. As it is the lowest
byte, it maintains data collation and allows range queries and partial row scans to work
correctly.
Binary encoding in Big SQL is sort-able, so numeric data including negative number collate
properly. It handles separators internally and avoids issues of separators existing within data
by escaping it.
Issue the following DDL statement from the BigSQL shell to create a dense table as we did
in the previous section, but this time overriding the default encoding to binary.
BigSQL Shell
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY
(
ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY
int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int,
QUANTITY int,
UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE
decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2),
GROSS_PROFIT decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY,
RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY,
ORDER_METHOD_KEY),
cf_data:cq_OTHER_KEYS
mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY,
CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_DOLLAR_VALUES
mapped by (UNIT_COST, UNIT_PRICE,
UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT)
)
default encoding binary;
Once again, use the load hbase data command to load the data into the table. This time
we are adding the DISABLE WAL clause. By using the option to disable WAL (write-ahead
log), writes into HBase can be sped up. However, this is not a safe option. Turning off WAL
can result in data loss if a region server crashes. Another possible option to speed up load is
to increase the write buffer size.
BigSQL Shell
LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY 't' INTO TABLE
gosalesdw.sls_sales_fact_dense_binary DISABLE WAL;
21
22. 44603 rows affected (total: 5.54s)
Issue a select statement on the newly created and loaded table with binary encoding,
sls_sales_fact_dense_binary.
BigSQL Shell
SELECT * FROM gosalesdw.sls_sales_fact_dense_binary
go –m discard;
Note: The “go –m discard” option is used so that the results of the command will not be displayed in
the terminal.
44603 rows in results(first row: 0.35s; total: 2.89s)
Issue another select statement on the previous table that has string encoding,
sls_sales_fact_dense.
BigSQL Shell
SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense
go –m discard;
44605 rows in results(first row: 0.31s; total: 3.1s)
One main point to see here is that the query can return faster. (Numeric types are also
collated properly).
Note: You will probably not see much, if any, performance differences in this lab exercise since we are
working with such a small dataset.
There is no custom serialization/deserialization logic required for string encoding. This
makes it portable in the case one would want to use another application to read data in
HBase tables. A main use case for string encoding is when someone wants to map existing
data. Delimited data is a very common form of storing data and it can be easily mapped
using Big SQL string encoding. However, parsing strings is expensive and queries with data
encoded as strings are slow. Also, numeric data is not collated correctly as seen.
Queries on data encoded as binary have faster response times. Numeric data, including
negative numbers, are also collated correctly with binary encoding. The downside is you get
data encoded by Big SQL logic and may not be portable as-is.
Many-to-one Mapping with HBase Pre-created Regions and
External Tables
HBase automatically handles splitting regions when they reach a set limit. In some scenarios
like bulk loading, it is more efficient to pre-create regions so that the load operation can take
place in parallel. The data for sales is 4 months, April through July for the year 2007. We can
pre-create regions by specifying splits in create table command.
22
23. In this section, we will create a table within the HBase shell with pre-defined splits, not using
any Big SQL features at first. Than we will showcase how users can map existing data in
HBase to Big SQL which can prove to be a very common practice. This is made possible by
creating what is called external tables.
Start by issuing the following statement in the HBase shell. This will create the
sls_sales_fact_dense_split table with pre-defined region splits for April through July in 2007.
HBase Shell
create 'gosalesdw.sls_sales_fact_dense_split', {NAME => 'cf_data',
REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION =>
'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS =>
'0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER
=> 'NONE', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE =>
'65536'}, {SPLITS => ['200704', '200705', '200706', '200707']}
Issue the following list command on the HBase shell to verify the newly created table.
HBase Shell
list
Note that if we were to list the tables from the Big SQL shell, we would not see this table
because we have not made any association yet to Big SQL.
Open and point a browser to the following URL: http://bivm:60010/. Scroll down and click on
the table we had just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split.
23
24. Examine the pre-created regions for this table as we had defined when creating the table.
Execute the following create external hbase command to map the existing table we have just
created in HBase to Big SQL. Some thing to note about the command:
The create table statement allows specifying a different name for SQL table through
hbase table name clause. Using external tables, you can also create multiple views
of same HBase table. For example, one table can map to few columns and another
table to another set of columns etc.
Notice the column mapping section of the create table statement allows specifying a
different separator for each column and row key.
Another place where external tables can be used is to map tables created using Hive
HBase storage handler. These cannot be directly read using Big SQL storage
handler.
BigSQL Shell
24
25. CREATE EXTERNAL HBASE TABLE
GOSALESDW.EXTERNAL_SLS_SALES_FACT_DENSE_SPLIT
(
ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY
int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int,
QUANTITY int,
UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE
decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2),
GROSS_PROFIT decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY,
RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY,
ORDER_METHOD_KEY) SEPARATOR '-',
cf_data:cq_OTHER_KEYS
mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY,
CLOSE_DAY_KEY) SEPARATOR '/',
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_DOLLAR_VALUES
mapped by (UNIT_COST, UNIT_PRICE,
UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) SEPARATOR '|'
)
HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split';
The data in external tables is not validated at creation time. For example, if a column in the
external table contains data with separators incorrectly defined, the query results would be
unpredictable.
Note: External tables are not owned by Big SQL and hence cannot be dropped via Big SQL. Also,
secondary indexes cannot be created via Big SQL on external tables.
Use the following command to load the external table we have defined.
BigSQL Shell
LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY 't' INTO TABLE
gosalesdw.external_sls_sales_fact_dense_split;
44603 rows affected (total: 1m57.2s)
Verify that the same number of rows loaded is also the same number of rows returned by
querying the external SQL table.
BigSQL Shell
SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split;
25
26. +--------+
|
|
+--------+
| 446023 |
+--------+
1 row in results(first row: 6.44s; total: 6.46s)
Verify the same from the HBase shell directly on the underlying HBase table.
HBase Shell
count 'gosalesdw.sls_sales_fact_dense_split'
...
44603 row(s) in 9.1620 seconds
Issue a get command from the HBase shell specifying the row key as follows. Notice the
separator between each part of the row key is a “-” which is what we defined when originally
creating the external table.
HBase Shell
get 'gosalesdw.sls_sales_fact_dense_split', '20070720-11171-4428-71095588-30263-5501-605'
In the following output you can also see the other seperators we defined for the external
table. “|” for the cq_DOLLAR_VALUE, and “/” for cq_QUANTITY.
COLUMN
cf_data:cq_DOLLAR_VALUES
value=33.59|62.65|62.65|0.4638|1566.25|726.50
cf_data:cq_OTHER_KEYS
value=481896/20070723/20070723
cf_data:cq_QUANTITY
3 row(s) in 0. 0610 seconds
CELL
timestamp=1376690502630,
timestamp=1376690502630,
timestamp=1376690502630, value=25
Of course in Big SQL we don't need to specify the separators such as “-” when querying
against the table as with the command below.
BigSQL Shell
SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split WHERE
ORDER_DAY_KEY = 20070720 AND ORGANIZATION_KEY = 11171 AND EMPLOYEE_KEY
= 4428 AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND
PRODUCT_KEY = 30263 AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY =
605;
Load Data: Error Handling
In this final section of the part of the lab, we will examine how to handle errors during the
load operation.
The load hbase command has an option to continue past errors. The LOG ERROR ROWS
IN FILE clause can be used to specify a file name to log any rows that could not be loaded
26
27. because of errors. Some of the common errors are invalid numeric types, and a separator
existing within the data for string encoding.
Linux Terminal
hadoop fs -cat /user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt
2007072a
…
…
…
b0070720
…
…
…
2007-07-20
…
…
…
20070720
…
…
…
20070721
…
…
…
11171
…
…
…
…
…
…
…
…
…
…
…
… …
11171
…
…
…
…
…
…
…
…
…
…
…
11171
…
…
…
…
…
…
…
…
…
…
…
11-71
…
…
…
…
…
…
…
…
…
…
…
11171
…
…
…
…
…
…
…
…
…
…
…
… …
… …
… …
… …
Note that separator appearing within the data is an issue with string encoding.
Knowing there are errors with the input data, proceed to issue the following load command,
specifying a directory and file where to put the “bad” rows.
BigSQL Shell
LOAD HBASE DATA INPATH
'/user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt' DELIMITED FIELDS
TERMINATED BY 't' INTO TABLE
gosalesdw.external_sls_sales_fact_dense_split LOG ERROR ROWS IN FILE
'/tmp/SLS_SALES_FACT_load.err';
In this example, 4 rows did not get loaded because of errors. Note that load reports all the
rows that passed through it
1 row affected (total: 2.74s)
Examine the specified file in the load command to view the rows which we not loaded.
Linux Terminal
hadoop fs -cat /tmp/SLS_SALES_FACT_load.err
"2007072a","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
"b0070720","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
"2007-07-20","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
"20070720","11-71","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
[OPTIONAL] HBase Access via JAQL
Jaql has an HBase module that can be used to create and insert data into HBase tables and
query them efficiently using multiple modes - local mode that directly access HBase as well
as map reduce mode. It allows specifying query optimization options similar to what is
available in hbase shell. The capability to transparently use map reduce jobs makes it work
well with bigger tables. At the same time, users can force local mode when they run point or
27
28. range queries. It allows use of a SQL language subset termed as Jaql SQL which provides
the capability to join, perform grouping and other aggregations on tables. It also provides
access to data from different sources such as relational DBMS and different formats like
delimited files, Avro and anything that is supported by Jaql. The results of the query can be
written in different formats to HDFS and read by other BigInsights applications like BigSheets
for further analysis. In this section, we’ll first pull information from our relational DMBS and
than go over use of Jaql HBase module, specifically the additional features that it provides.
Start by opening a Jaql shell. You can open the same (JSQSH) terminal that was used for
Big SQL by adding the “--jaql" option as shown below. This is a much better environment to
work with than the standard Jaql Shell as it provides features like previous command using
the up arrow key and you can also traverse through your commands using the left/right arrow
keys.
Linux Terminal
/opt/ibm/biginsights/bigsql/jsqsh/bin/jsqsh --jaql;
Once in the JSQSH shell with Jaql option, load the dbms::jdbc driver with the following
command.
BigSQL/JAQL Shell
import dbms::jdbc;
Add the JDBC driver JAR file to the classpath.
BigSQL/JAQL Shell
addRelativeClassPath(getSystemSearchPath(),
'/opt/ibm/db2/V10.5/java/db2jcc.jar');
Supply the connection information.
BigSQL/JAQL Shell
db := jdbc::connect(
driver = 'com.ibm.db2.jcc.DB2Driver',
url = 'jdbc:db2://localhost:50000/gosales',
properties = {user: "db2inst1", password: "password"} );
Specify the rows to be retrieved with a SQL select statement.
BigSQL/JAQL Shell
DESC := jdbc::prepare( db, query =
"SELECT * FROM db2inst1.sls_sales_fact_10p");
In many-to-one mapping for row key, we went over creation of a composite key. In the next
few steps, we will use Jaql to load the same data using a composite key and dense columns.
We’ll pack all columns that make up primary key of the relational table into a HBase row key,
and we’ll also pack other columns into dense HBase columns.
Define a variable to read the original data from the relational JDBC source. This converts
each tuple of the table into a JSON record.
28
29. BigSQL/JAQL Shell
ssf = localRead(DESC);
Transform the record into the required format. Essentially we are doing the same procedure
as when we defined the many-to-one mapping in the previous sections. For the first element,
which we will use for HBase row key, concatenate the values of the columns that form the
primary key of the sales fact table using a “-” separator. For the remaining columns, pack
them into other dense HBase columns: cq_OTHER_KEYS (using “/” separator),
cq_QUANTITY, and cq_DOLLAR_VALUES (using “|” separator).
BigSQL/JAQL Shell
ssft = ssf -> transform [$."ORDER_DAY_KEY", $."ORGANIZATION_KEY",
$."EMPLOYEE_KEY", $."RETAILER_KEY", $."RETAILER_SITE_KEY",
$."PRODUCT_KEY", $."PROMOTION_KEY", $."ORDER_METHOD_KEY",
$."SALES_ORDER_KEY", $."SHIP_DAY_KEY", $."CLOSE_DAY_KEY", $."QUANTITY",
$."UNIT_COST", $."UNIT_PRICE", $."UNIT_SALE_PRICE", $."GROSS_MARGIN",
$."SALE_TOTAL", $."GROSS_PROFIT"] -> transform
{
key: strcat($[0],"-",$[1],"-",$[2],"-",$[3],"-",$[4],"-",$[5],"",$[6],"-",$[7]),
cf_data: {
cq_OTHER_KEYS: strcat($[8],"/",$[9],"/",$[10]),
cq_QUANTITY: strcat($[11]),
cq_DOLLAR_VALUES:
strcat($[12],"|",$[13],"|",$[14],"|",$[15],"|",$[16],"|",$[17])
}
};
Verify the data is in the correct format by querying the first record.
BigSQL/JAQL Shell
ssft -> top 1;
{
"key": "20070418-11114-4415-7314-5794-30124-5501-605",
"cf_data": {
"cq_OTHER_KEYS": "254121/20070423/20070423",
"cq_QUANTITY": "60",
"cq_DOLLAR_VALUES": "610.00m|1359.72m|1291.73m|0.5278|77503.80m|40903.80m"
}
}
(1 row in 2.40s)
Now we have the data ready to be written into HBase. First import the hbase module which
prepares jaql by loading required jars and preparing the environment using the HBase
configuration files.
BigSQL/JAQL Shell
import hbase(*);
Use hbaseString to define a schema for the HBase table. The HBase table does not get
created until something is written into it. An array of records that match the specified schema
29
30. should be used to write into the HBase table. The data types correspond to how Jaql will
interpret the data.
BigSQL/JAQL Shell
SSFHT = hbaseString('sales_fact2', schema { key: string, cf_data?: {*:
string}}, create=true, replace=true, rowBatchSize=10000,
colBatchSize=200 );
Note: As this (could be) a big table, specify rowBatchSize and colBatchSize which will be used for
scanner catching and column batch size by the internal HBase scan object. The column batch size is useful
when there are a huge number of columns in rows.
Write to the table using the previously created ssft array which matches the specified
schema.
BigSQL/JAQL Shell
ssft -> write(SSFHT);
A write operation will create the HBase table, and populate it with the input data. To confirm,
use hbase shell to count (or scan) the table and verify the data was written with the right
number of rows.
HBase Shell
count 'sales_fact2'
44603 row(s) in 3.6230 seconds
To read the contents of the HBase table using Jaql, use read on the hbaseString. In the
following command we are also passing the read directly into a count function to verify the
right number of rows.
BigSQL/JAQL Shell
count(read(SSFHT));
44603
To query for rows matching a particular order day key 20070720, use setKeyRange for the
partial range query. Use localRead for point and range queries as Jaql is tuned for local
execution and performs efficiently.
BigSQL/JAQL Shell
localRead(SSFHT -> setKeyRange('20070720', '20070721'));
Perform the same query using HBase shell. Both complete in similar amount of time.
HBase Shell
scan 'sales_fact2', {STARTROW => '20070720', STOPROW => '20070721',
CACHE => 10000}
30
31. To query for a row when we have the values for all primary key columns, we can construct
the entire row key and perform a point query.
BigSQL/JAQL Shell
localRead(SSFHT -> setKey('20070720-11171-4428-7109-5588-30263-5501605'));
Identically, this is what the statement would look like from the HBase shell.
HBase Shell
get 'sales_fact2', '20070720-11171-4428-7109-5588-30263-5501-605'
To use a filter from Jaql, use setFilter function along with addFilter. In the below
case, the predicate is on quantity column which is the leading part of the dense column and
hence can be used in the predicate.
BigSQL/JAQL Shell
read(SSFHT -> setFilter([addFilter(filterType.SingleColumnValueFilter,
HBaseKeyArrayToBinary(["481896/"]),
compareOp.equal,
comparators.BinaryPrefixComparator,
"cf_data",
"cq_OTHER_KEYS",
true
)
])
);
PART II – A – Query Handling
Efficiently querying HBase requires pushing as much to the server(s) as possible. This
includes projection pushdown or fetching the minimal set of columns that are required by the
query. It also includes pushing down query predicates into the server as scan limits, filters,
index lookups, etc. Setting scan limits is extremely powerful as it can help to narrow down
regions we need to scan. With a full row key, HBase can quickly pinpoint the region and the
row. With partial keys and key ranges (upper, lower limits or both), HBase can narrow down
regions or eliminate regions which fall outside the range.
Indexes help to leverage this key lookup but they use two tables to achieve this. Filters
cannot eliminate regions but some have capability to skip within a region. They help to
narrow down the data set returned to the client.
With limited metadata/statistics about HBase tables, supporting a variety of hints helps
improve query efficiency.
The Data
This section describes the schema which the sample data will use to demonstrate the effects
of pushdown from Big SQL.
31
32. We will use a tpch table: orders table with 150,000 rows defined using the mapping shown
below.
Issue the following command from a Big SQL shell to create the orders table. Notice this
table has a many-to-one mapping, meaning there is a composite key and dense columns.
BigSQL Shell
CREATE HBASE TABLE ORDERS
(
O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1),
O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15),
O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79)
)
column mapping
(
key
mapped by (O_CUSTKEY,O_ORDERKEY),
cf:d mapped by
(O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_CO
MMENT),
cf:od mapped by (O_ORDERDATE)
)
default encoding binary;
Load the sample data into the newly created table by issuing the following command.
Note: As in Part I, there are three sample sets provided for you. Each one is essentially the same except
with one key difference. This is in the amount of data that is contained within them. The remaining
instructions in this lab exercise will use the orders.10p.tbl dataset simply for the fact that it has a smaller
amount of data and will be faster to work with for demonstration purposes. If you would like to use the larger
tables with more data feel free to do so but just remember to change the names appropriately.
BigSQL Shell
LOAD HBASE DATA INPATH 'tpch/orders.10p.tbl' DELIMITED FIELDS
TERMINATED BY '|' INTO TABLE ORDERS;
150000 rows affected (total: 21.52s)
In next set of sections, we examine the output from Big SQL log files to point out what you
can check for to confirm pushdown from Big SQL. To view log messages, you may have to
first change logging levels using the below commands.
BigSQL Shell
log com.ibm.jaql.modules.hcat.mapred.JaqlHBaseInputFormat info;
BigSQL Shell
log com.ibm.jaql.modules.hcat.hbase info;
Note that columns are pushed down at HBase level. So in many-to-one mappings, if the
query requires only one part of a dense column with many parts, the entire value for dense
32
33. column will be returned. Therefore it is efficient to pack together columns that are usually
queried together.
Use the following command to tail the Big SQL log file. Keep this open in a terminal
throughout this entire part of this lab. We will be referring to it quite often to see what is
going on behind the scenes when running certain commands.
Linux Terminal
tail -f /var/ibm/biginsights/bigsql/logs/bigsql.log
Projection Pushdown
The first query here does a SELECT * and requests all HBase columns used in the table
mapping. The original HBase table could have a lot more columns; we may have defined an
external table mapping to just a few columns. In such cases, only the HBase columns used
in mapping will be retrieved.
BigSQL Shell
SELECT * FROM orders
go -m discard;
150000 rows in results(first row: 1.73s; total: 10.69s)
In the Big SQL log file, we can see that we returned data from both columns.
BigSQL Log
…
…HBase scan details:{…, families={cf=[d, od]}, …, stopRow=, startRow=,
totalColumns=2, …}
This second query request only one HBase column:
BigSQL Shell
SELECT o_totalprice FROM orders
go -m discard;
Notice that the query returns much faster since we are returning much less data.
150000 rows in results(first row: 0.27s; total: 2.83s)
Verify from the log file that this query only executed against one column.
BigSQL Log
…
…HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, totalColumns=1,
…}
The third query request only one HBase column also.
33
34. BigSQL Shell
SELECT o_orderdate FROM orders
go -m discard;
Although this query actually returns lesser data, it actually has higher response time because
serialization/deserialization of type timestamp is expensive.
150000 rows in results(first row: 0.37s; total: 4.5s)
BigSQL Log
…
…HBase scan details:{…, families={cf=[od]}, …, stopRow=, startRow=, totalColumns=1,
…}
Predicate Pushdown
Point Scan
Identifying and using point scans is the most effective optimization for queries into HBase.
For converting to point scan, we need to get the predicate value covering the full row key.
This could come in as multiple predicates as Big SQL supports composite keys.
The query analyzer in Big SQL is capable of combining multiple predicates to identify a full
row scan. Currently, this analysis happens at run time in the storage handler. At that point,
the decision of whether or not to use map reduce has already been made. To bypass map
reduce, a user has to provide explicit local mode access hints currently.
In the example below, the command “set force local on” makes sure all queries
executing in the session do not use map reduce.
BigSQL Shell
set force local on;
Issue the following select statement that will provide predicates for the columns that
comprise of the full row key. They are custkey and orderkey.
BigSQL Shell
select o_orderkey, o_totalprice from orders where o_custkey=4 and
o_orderkey=5612065;
+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
5612065 | 71845.25781 |
+------------+--------------+
1 row in results(first row: 0.18s; total: 0.18s)
If we check the logs, you can see that Big SQL successfully took both predicates specified
and combined them to do a row scan using all parts of the composite key.
34
35. BigSQL Log
…
… Found a row scan by combining all composite key parts.
… Found a row scan from row key parts
… HBase filter list created using AND.
… HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1):
[PrefixFilter x01x80x00x00x04], …,
stopRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!,
startRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!, totalColumns=1, …}
Partial Row Scan
This section shows the capability of Big SQL server to process predicates on leading parts of
row key – and not necessarily the full row key as in the previous section.
Issue the following example query that provides a predicate for the first part of the row key,
custkey.
BigSQL Shell
select o_orderkey, o_totalprice from orders where o_custkey=4;
+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
5453440 | 17938.41016 |
|
5612065 | 71845.25781 |
+------------+--------------+
2 rows in results(first row: 0.19s; total: 0.19s)
Checking the logs, you can see the predicate on first part of row key is converted to a range
scan. The stop row in the scan is non-inclusive. So it is internally appended with lowest
possible byte to cover the partial range.
BigSQL Log
…
… Found a row scan that uses the first 1 part(s) of composite key.
… Found a row scan from row key parts
… HBase filter list created using AND.
… HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1):
[PrefixFilter x01x80x00x00x04], …, stopRow=x01x80x00x00x04xFF,
startRow=x01x80x00x00x01, totalColumns=1, …}
Range Scan
When there are range predicates, we can set the start or stop row or both.
In our example query below we have a ‘less than’ predicate; therefore we only know the stop
row. However, even setting this will help eliminate regions with row keys that fall above the
stop row. Issue the following command.
BigSQL Shell
select o_orderkey, o_totalprice from orders where o_custkey < 15;
35
36. +------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
5453440 | 17938.41016 |
|
5612065 | 71845.25781 |
|
5805349 | 255145.51562 |
|
5987111 | 97765.57812 |
|
5692738 | 143292.53125 |
|
5885190 | 125285.42969 |
|
5693440 | 117319.15625 |
|
5880160 | 198773.68750 |
|
5414466 | 149205.60938 |
|
5534435 | 136184.51562 |
|
5566567 | 56285.71094 |
+------------+--------------+
11 rows in results(first row: 0.22s; total: 0.22s)
Notice in the log file that similarly to the previous section, we are also only using the first part
of the composite key since we are specifying custkey as the predicate. However, in this case
since we only know the stop row (less than 3), there is no value for the start row portion of
the scan.
BigSQL Log
…
… Found a row scan that uses the first 1 part(s) of composite key.
… Found a row scan from row key parts
…
… HBase scan details:{…, families={cf=[d]}, …, stopRow=x01x80x00x00x0F,
startRow=, totalColumns=1, …}
Full Table Scan
This section simply shows an example of what happens when none of the predicates can be
pushed down to HBase
In this example query, the predicate (orderkey) is on non-leading part of row key and
therefore is not pushed down. Issue the command to see this will result in a full table scan.
BigSQL Shell
select o_orderkey, o_totalprice from orders where o_orderkey=5612065;
+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
5612065 | 71845.25781 |
+------------+--------------+
1 row in results(first row: 1.90s; total: 1.90s)
As can be determined by examining the logs, in cases where none of the predicates can be
pushed to HBase, a full table scan is required. Meaning there are no specified values for
either start or stop row.
BigSQL Log
36
37. …
… HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, …}
Automatic Index Usage
This section will demonstrate the benefits of an index lookup.
Before creating an index, let’s first execute a query that will invoke a full table scan so we
can do a comparison later to see the performance benefits we can achieve by creating an
index on particular column(s). Notice we are specifying a predicate on the clerk column
which is the middle part of a dense column defined.
BigSQL Shell
SELECT * FROM orders WHERE o_clerk='Clerk#000000999'
go -m discard;
154 rows in results(first row: 2.40s; total: 4.32s)
As you can see below in the log file, there is no usage of an index.
BigSQL Log
…
… indexScanInfo: [isIndexScan: false], valuesInfo: [minValue: undefined,
minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo:
[numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false],
indexScanCandidateInfo: [hasIndexScanCandidate: false]]
…
Issue the following command to create the index on the clerk column which is the middle part
of a dense column in table. This creates a new table to store index data. The index table
stores the column value and row key it appears in.
BigSQL Shell
CREATE INDEX ix_clerk ON TABLE orders (o_clerk) AS 'hbase';
Note:
The create index statement will create the new index table which uses
<base_table_name>_<index_name> as its name, it deploys the coprocessor, populates the index table
using map reduce index builder. The “as hbase” clause indicates the type of index handler to use. For
HBase, we have a separate index handler.
0 rows affected (total: 1m17.47s)
Re-issue the exact same command as we did earlier.
BigSQL Shell
SELECT * FROM orders WHERE o_clerk='Clerk#000000999'
go -m discard;
37
38. After creating the index and issuing the same select statement, Big SQL will automatically
take advantage of the index that was created and avoids a full table scan which results in a
much faster response time.
154 rows in results(first row: 0.73s; total: 0.74s)
You can verify in the log file that Big SQL. In this case the index table is scanned for all
matching rows that start with value of predicate clerk, in this case Clerk#000000999. From
the matching row(s), the row key(s) of base table are extracted and get requests are batched
and sent to the data table.
BigSQL Log
…
… indexScanInfo: [isIndexScan: true, keyLookupType: point_query, indexDetails:
JaqlHBaseIndex[indexName: ix_clerk, indexSpec: {"bin_terminator": "#","columns":
[{"cf": "cf","col": "o_clerk","cq": "d","from_dense": "true"}],"comp_seperator":
"%","composite": "false","key_seperator": "/","name": "ix_clerk"}, numColumns: 1,
columns: [Ljava.lang.String;@3ced3ced, startValue: x01Clerk#000000999x00,
stopValue: x01Clerk#000000999x00]], valuesInfo: [minValue: [B@4b834b83,
minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo:
[numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false],
indexScanCandidateInfo: [hasIndexScanCandidate: true, indexScanCandidate:
IndexScanCandidate[columnName: o_clerk,indexColValue: [B@4cda4cda,[operator:
=,isVariableLength: false,type: null,encoding: BINARY]]]
… Found an index scan from index scan candidates. Details:
… Index name: ix_clerk
…
… Index query details: [indexSpec:ix_clerk, startValueBytes: #Clerk#000000999,
stopValueBytes: #Clerk#000000999,baseTableScanStart:,baseTableScanStop:]
… Index query successful.
Note: For a composite index where multiple columns are used to define an index, predicates are handled
and pushed down similar to what is done for composite row keys.
If there was no index, the predicate could not be pushed down as it is the non-leading part of
a dense column. In such cases, a full table scan is required as seen at the beginning of this
section.
Pushing Down Filters into HBase
Though HBase filters do not avoid full table scan, they limit the rows and data returned to the
client. HBase filters have a skip facility which lets them skip over certain portions of data.
Many of the inbuilt filters implement this and thus prove more efficient than a raw table scan.
There are filters that can limit the data within a row. For example, when we need to only get
columns in the key part of filter, some filters like FirstKeyOnlyFilter and
KeyOnlyFilter can be applied to get only a single instance of the row key part of data.
The sample query below will demonstrate a case where Big SQL pushes down a row scan
and a column filter.
BigSQL Shell
38
39. SELECT o_orderkey FROM orders WHERE o_custkey>100000 AND
o_orderstatus='P'
go -m discard;
1278 rows in results(first row: 0.37s; total: 0.38s)
Notice, the predicate on the custkey column triggers the row scan. The column filter,
SingleColumnValueFilter, is triggered because there is a predicate on the leading part
of a dense column (cf:d).
BigSQL Log
…
… Found a row scan that uses the first 1 part(s) of composite key.
… Found a row scan from row key parts
… HBase filter list created using AND.
…
… HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1):
[SingleColumnValueFilter (cf, d, EQUAL, x01Px00)], …, stopRow=,
startRow=x01x80x01x86xA1, totalColumns=1, …}
This way Big SQL can automatically convert predicates into many of these filters and thus
handle queries more efficiently.
Table Access Hints
Access hints affect the strategy that is used to read the table, identify the source of the data,
and how to optimize a query. For example, the strategy can affect such behaviour as
whether MapReduce is employed to implement the join or whether a memory (hash) join is
employed. These hints can also control how to access data from specific sources. The table
access hint that we will explore here is: accessmode.
Accessmode
The accessmode hint is very important for HBase. It avoids map reduce overhead.
Combined with point queries, they ensure sub-second response time without being affected
by the total data size.
There are multiple ways to specify accessmode hint – as query hint or at session level. Note
that session level hints take precedence. If “set force local off;” is run in a session,
all subsequent queries will always use map reduce even if an explicit accessmode=‘local’
hint is specified on the query.
You can check the state of accessmode, if it was explicitly set, on the session with the
following command in the Big SQL shell.
BigSQL Shell
set;
If you kept the same shell open throughout this part of the lab, you will see the following
output. This is because we used “set force local on” earlier in one of the previous
sections.
39
40. +--------------------+-------+
| key
| value |
+--------------------+-------+
| bigsql.force.local | true |
+--------------------+-------+
1 row in results(first row: 0.0s; total: 0.0s)
To change the setting back to the default, you can change the value to automatic with the
following command.
BigSQL Shell
set force local auto;
Issue the following select query.
BigSQL Shell
select o_orderkey from orders where o_custkey=4 and o_orderkey=5612065;
Notice how long the query takes.
+------------+
| o_orderkey |
+------------+
|
5612065 |
+------------+
1 row in results(first row: 7.2s; total: 7.2s)
Issue the same query with an accessmode hint this time.
BigSQL Shell
select o_orderkey from orders /*+ accessmode='local' +*/ where
o_custkey=4 and o_orderkey=5612065;
Notice how the query responds much faster with the results. This is because of the local
accessmode, hence no mapreduce job employed.
+------------+
| o_orderkey |
+------------+
|
5612065 |
+------------+
1 row in results(first row: 0.32s; total: 0.32s)
PART II – B – Connecting to Big SQL Server via JDBC
Organizations interested in Big SQL often have considerable SQL skills in-house, as well as
a suite of SQL-based business intelligence applications and query/reporting tools. The idea
of being able to leverage existing skills and tools — and perhaps reuse portions of existing
applications — can be quite appealing to organizations new to Hadoop.
40
41. Therefore Big SQL supports a JDBC driver that conforms to the JDBC 3.0 specification to
provide connectivity to Java™ applications. (Big SQL also supports a 32-bit or a 64-bit
ODBC driver, on either Linux or Windows, that conforms to the Microsoft Open Database
Connectivity 3.0.0 specification, to provide connectivity to C and C++ applications).
In this part of the lab, we will explore how to use Big SQL’s JDBC driver with BIRT, an open
source business intelligence and report tool that can plug into Eclipse. We will use this tool to
run some very simple reports using SQL queries on data stored in HBase on our Hadoop
environment.
Business Intelligence and Reporting via BIRT
To start, open eclipse from the Desktop of the virtual machine by clicking on the Eclipse icon.
When promoted to do so, leave the default workspace as is.
Once Eclipse has loaded switch to the 'Report Design' perspective so that we can work with
BIRT. To do so, from the menu bar click on: Window -> Open Perspective -> Other....
Than click on: Report Design -> OK as shown below.
Once in the Report Design perspective, double-click on Orders.rptdesign from the
Navigator pane (on the bottom left-hand side) to open the pre-created report.
41
42. Note: A report has been created on your behalf to quicker illustrate the functionally/usage of the Big SQL
drivers, while removing tedious steps of designing a report in BIRT.
Expand 'Data Sets' from Data Explorer. You will notice the data sets (or report queries)
contain a red 'X' beside them. This is because the pre-created report queries are not yet
associated to a data source. Now all that is necessary, prior to being able to run the report, is
to set up the JDBC connection to BigSQL.
To obtain the client drivers, open the BigInsights web console from the Desktop of the VM, or
point your browser to: http://bivm:8080. From the Welcome tab, in the Quick Links section,
select Download the Big SQL Client drivers.
Save the file to /home/biadmin/Desktop/IBD-1687A/.
42
43. Open the folder where you saved the file and extract the contents of the client package
under the same directory.
Back in Eclipse, add Big SQL as a source. Right-click on Data Sources -> New Data
Source from the Data Explorer pane on the top left-hand side. In the New Data Source
window, select JDBC Data Source and specify “Big SQL” for the Data Source Name. Click
Next.
43
44. In the New JDBC Data Source Profile window, click on Manage Drivers…. Once the
Manage JDBC Drivers window appears click on Add…
Point to the location where the client drivers were extracted than click OK.
Once added, you should have an entry for the BigSQLDriver in the Driver Class
dropdown field list. Select it, and complete the fields with the following information:
•
•
•
Database URL: jdbc:bigsql://localhost:7052
User Name: biadmin
Password: biadmin
44
45. Click on ‘Test Connection...’ to ensure we can connect to Big SQL using the JDBC driver.
Double-click 'Orders per year' and add the Big SQL connection that was just defined.
Examine the query:
WITH test
(order_year, order_date)
AS
(SELECT YEAR(o_orderdate), o_orderdate FROM orders FETCH FIRST 20 ROWS
ONLY)
SELECT order_year, COUNT(*) AS cnt FROM test GROUP BY order_year
45
46. Carry out the same procedure to add the Big SQL connection for the 'Top 5 salesmen'
data set and examine the query.
WITH base (o_clerk, tot) AS
(SELECT o_clerk, SUM(o_totalprice) AS tot FROM orders GROUP BY o_clerk
ORDER BY tot DESC)
SELECT o_clerk, tot FROM base FETCH FIRST 5 ROWS ONLY
Note: Disregard the red ‘X’ that may still exist on the Data Sets. This is a bug and can safely be ignored.
Now that we have defined the Data Source and have the Data Sets configured, run the
report in Web Viewer as shown in the diagram below.
The output from the web viewer against the orders table on Big SQL should be as follows.
46
47. As seen in this part of the lab, a variety of IBM and non-IBM software that supports JDBC
and ODBC data sources can also be configured to work with Big SQL. We used BIRT here,
but as another example, Cognos Business Intelligence can uses Big SQL's JDBC interface
to query data, generate reports, and perform other analytical functions. Similarly, other tools
like Tableau can leverage Big SQL’s ODBC drivers to work with data stored in a Big Insights
cluster.
47
48. Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity
• IBM Champions
o Recognizing individuals who have made the most outstanding
contributions to Information Management, Business Analytics, and
Enterprise Content Management communities
• ibm.com/champion
Thank You!
Your Feedback is Important!
• Access the Conference Agenda Builder to complete your session
surveys
o Any web or mobile browser at
http://iod13surveys.com/surveys.html
o Any Agenda Builder kiosk onsite
48