Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming Stack Why we still need SQL for Big Data ? How to make Big Data more responsive and faster ?
Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very challenging topic. Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark. SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance easy, while achieving a good tradeoff between performance and simplicity. In addition to fully supporting all the Avro schemas natively, SHC has also integrated natively with Phoenix data types. With SHC, Spark can execute batch jobs to read/write data from/into Phoenix tables. Phoenix can also read/write data from/into HBase tables created by SHC. For example, users can run a complex SQL query on top of an HBase table created by Phoenix inside Spark, perform a table join against an Dataframe which reads the data from a Hive table, or integrate with Spark Streaming to implement a more complicated system. In this talk, apart from explaining why SHC is of great use, we will also demo how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multiple secure HBase clusters, etc. This talk will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.
Zeppelin has become a popular way to unlock the value of data lake due to its user interface and appeal to business users. These business users ask their IT department for access to Zeppelin. Enterprise IT department want to help their business users but they have several enterprise concerns such as enterprise security, integration with their corporate LDAP/AD, scalability and multi-user environment, integration with Ranger and Kerberos. This session will walk through enterprise concerns and how these concerns can be handled with Zeppelin.
This document summarizes Salesforce's use of HBase and Phoenix for storing and querying large amounts of unstructured data at scale. Some key details include: - Salesforce uses over 100 HBase clusters to store both customer and internal data, handling over 4 billion write requests and 600 million read requests per day. - This includes storing login data, archived relational data, user activity, machine metrics and more, totaling over 80 terabytes written and 500 gigabytes read daily. - An internal metrics database collects data from over 80,000 machines, storing 11.4 trillion metrics and growing, with 2.8 trillion metrics added in the last 6 months alone.
This document discusses Apache Falcon and its Pipeline Designer tool. It provides an overview of key concepts in Pipeline Designer including feeds, processes, actions, transforms, and deployment. Pipeline Designer allows composing ETL workflows visually with a graphical interface and handles orchestration, monitoring, and execution on Hadoop clusters. Transformation actions are compiled into Pig scripts and the entire workflow is deployed as a Falcon process.
This talk will give an overview of two exciting releases for Apache HBase 2.0 and Phoenix 5.0. HBase provides a NoSQL column store on Hadoop for random, real-time read/write workloads. Phoenix provides SQL on top of HBase. HBase 2.0 contains a large number of features that were a long time in development, including rewritten region assignment, performance improvements (RPC, rewritten write pipeline, etc), async clients and WAL, a C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next big Phoenix release because of its integration with HBase 2.0 and a lot of performance improvements in support of secondary Indexes. It has many important new features such as encoded columns, Kafka and Hive integration, and many other performance improvements. This session will also describe the uses cases that HBase and Phoenix are a good architectural fit for. Speaker: Alan Gates, Co-Founder, Hortonworks
The document discusses Microsoft's Azure IoT platform for connecting, managing, and analyzing Internet of Things devices and data. It provides an overview of the key components of Azure IoT including Azure IoT Hub for device connectivity and management, analytics services like Azure Machine Learning and Stream Analytics, and connectivity to other Azure services. It also highlights aspects of Azure IoT like its open ecosystem, support for open standards, and global infrastructure running on Microsoft's Azure cloud.
The document discusses getting involved with open source projects at the Apache Software Foundation. It provides an overview of the ASF, how it works, and how to contribute to Apache projects. The key points are: - The ASF is a non-profit organization that oversees hundreds of open source projects and thousands of volunteers. Popular projects include Hadoop, Hive, and Pig. - To get involved, individuals can start by joining mailing lists, reviewing documentation, reporting issues, and submitting code patches. More responsibilities come with becoming a committer or PMC member. - Projects follow an open development process based on consensus. Voting on decisions helps include contributors from different time zones. - Contributing is rewarding
The document discusses sharing metadata across data lakes and streams. It proposes unifying the Hive Metastore (HMS) and Schema Registry so that batch and streaming systems can see each other's metadata. This would reduce the number of separate metadata systems administrators need to maintain. The document also describes making the HMS standalone so it is not required to install Hive, enabling other systems like Spark and Impala to use HMS independently. Finally, it provides use cases where streaming applications need access to batch data in Hive tables and vice versa.
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases. In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries. Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases. Agenda - 1) Introduction and Ideal Use cases for Druid 2) Data Architecture 3) Streaming Ingestion with Kafka 4) Demo using Druid, Kafka and Superset. 5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion 6) Future Work
Presented Apache Falcon at Hadoop Summit 2013, SJC. Delves into the motivation behind Falcon, overview of the architecture, and looking forward into the future.
This document discusses various tools and techniques for diagnosing slow Hadoop jobs, including metrics and monitoring, logging and correlation, and tracing and analysis. It describes how to use Ambari metrics and Grafana dashboards to monitor cluster health and performance. It also explains how to leverage Hadoop audit logs and caller context to correlate job activity. Techniques for application tracing using YARN timeline service and tools like Zeppelin and Tez analyzer are presented to enable deep performance analysis of Hadoop jobs.
Apache Hive is an Enterprise Data Warehouse build on top of Hadoop. Hive supports Insert/Update/Delete SQL statements with transactional semantics and read operations that run at Snapshot Isolation. This talk will describe the intended use cases, architecture of the implementation, new features such as SQL Merge statement and recent improvements. The talk will also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL. This API is used by Apache NiFi, Storm and Flume to stream data directly into Hive tables and make it visible to readers in near real time.
As organizations pursue Big Data initiatives to capture new opportunities for data-driven insights, data governance has become table stakes both from the perspective of external regulatory compliance as well as business value extraction internally within an enterprise. This session will introduce Apache Atlas, a project that was incubated by Hortonworks along with a group of industry leaders across several verticals including financial services, healthcare, pharma, oil and gas, retail and insurance to help address data governance and metadata needs with an open extensible platform governed under the aegis of Apache Software Foundation. Apache Atlas empowers organizations to harvest metadata across the data ecosystem, govern and curate data lakes by applying consistent data classification with a centralized metadata catalog. In this talk, we will present the underpinnings of the architecture of Apache Atlas and conclude with a tour of governance capabilities within Apache Atlas as we showcase various features for open metadata modeling, data classification, visualizing cross-component lineage and impact. We will also demo how Apache Atlas delivers a complete view of data movement across several analytic engines such as Apache Hive, Apache Storm, Apache Kafka and capabilities to effectively classify, discover datasets.
As HBase and Hadoop continue to become routine across enterprises, these enterprises inevitably shift priorities from effective deployments to cost-efficient operations. Consolidation of infrastructure, the sum of hardware, software, and system-administrator effort, is the most common strategy to reduce costs. As a company grows, the number of business organizations, development teams, and individuals accessing HBase grows commensurately, creating a not-so-simple requirement: HBase must effectively service many users, each with a variety of use-cases. This is problem is known as multi-tenancy. While multi-tenancy isn’t a new problem, it also isn’t a solved one, in HBase or otherwise. This talk will present a high-level view of the common issues organizations face when multiple users and teams share a single HBase instance and how certain HBase features were designed specifically to mitigate the issues created by the sharing of finite resources.
Apache Ambari is now the preferred way of provisioning, managing and monitoring Hadoop Clusters. Ambari helps users to manage Hadoop clusters simplifying actions such as upgrades, configuration management, service management, etc. From release 2.0, Ambari started supporting automated Rolling Upgrades. This was further enhanced with release 2.2.0.0 to include support for Express Upgrades, which allows users to upgrade large scale clusters faster but requiring cluster downtime. This talk will cover planning and execution of Hadoop cluster upgrades from an operational perspective. The talk will also cover the internals of the upgrade process including the various stages such as pre-upgrade, backup, service checks, configuration upgrades, and finalization. Finally, the talk will cover troubleshooting upgrade failures, monitoring services during upgrades and post upgrade actions. The presentation will conclude with a case study that will cover how the upgrade process works on a large cluster (including aspects such as planning the upgrade, the amount of time required for the various stages, and troubleshooting)
This document describes the implementation of data replication in Apache Accumulo. It discusses justifying the need for replication to handle failures, describes how replication is implemented using write-ahead logs, and outlines future work including replicating to other systems and improving consistency.
This document provides an overview of using Apache NiFi to build data pipelines that index data into Apache Solr. It introduces NiFi and its capabilities for data routing, transformation and monitoring. It describes how Solr accepts data through different update handlers like XML, JSON and CSV. It demonstrates how NiFi processors can be used to stream data to Solr via these update handlers. Example use cases are presented for indexing tweets, commands, logs and databases into Solr collections. Future enhancements are discussed like parsing documents and distributing commands across a Solr cluster.
Apache HBase is an open source, non-relational, distributed datastore modeled after Google’s Bigtable, that runs on top of the Apache Hadoop Distributed Filesystem and provides low-latency random-access storage for HDFS-based compute platforms like Apache Hadoop and Apache Spark. Apache Phoenix is a high performance relational database layer over HBase optimized for low latency applications. This session will explore how the Data Platform and Services group at Salesforce.com supports teams of application developers accustomed to structured relational data access, while surfacing additional advantages of the underlying flexible scale-out datastore.
This document discusses secondary indexing in HBase. It introduces HBase and how scans work. Secondary indexes are created to avoid full table scans by storing row keys corresponding to indexed columns in a separate table. CoProcessors can be used to implement secondary indexes by assigning index regions and keeping them in sync with data regions. Benchmark results show secondary indexing improves query performance at the cost of extra storage and processing overhead from CoProcessors. Challenges include indexing across regions and handling region splits.
Transform Salesforce into the system of engagement for your big data. Discuss best practices and lessons learned in accessing external data sets in Hadoop or Spark using Salesforce Connect. Leave the big data sets behind the firewall, and get on demand access for your users to big data insights using external objects with Salesforce Connect. In this session we will cover: Intro to Salesforce Connect Intro to Big Data Landscape How to connect Salesforce to Big Data using External Data Sources Lessons Learned accessing Big Data using External Objects for native reporting, writes, lookups, search and more Resources (How to learn more)
Vladimir Rodionov (Hortonworks) Time-series applications (sensor data, application/system logging events, user interactions etc) present a new set of data storage challenges: very high velocity and very high volume of data. This talk will present the recent development in Apache HBase that make it a good fit for time-series applications.
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
The document summarizes Apache Phoenix and HBase as an enterprise data warehouse solution. It discusses how Phoenix provides OLTP and analytics capabilities over HBase. It then covers various use cases where companies are using Phoenix and HBase, including for web analytics and time series data. Finally, it discusses optimizations that can be made to the schema design, queries, and writes in Phoenix to improve performance.
This document provides an overview of Apache HBase and Apache Phoenix. It discusses how HBase is a scalable, non-relational database that can store large volumes of data across commodity servers. Phoenix provides a SQL interface for HBase, allowing users to interact with HBase data using familiar SQL queries and functions. The document outlines new features in Phoenix for HDP 2.2, including improved support for secondary indexes and basic window functions.