At the end of day the only thing that data scientists want is one thing. They want tabular data for their analysis.
They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data
that is being streamed at them from IoT devices and apps and at the same time add structure to it so that data scientists
can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds).
Oh... and there are a bunch more data sources that you need to ingest and the current providers of data are changing their structure.
At GoPro, we have massive amounts of heterogeneous data being streamed at us from our consumer devices
and applications, and we have developed a concept of "dynamic DDL" to structure our streamed data on the fly using
Spark Streaming, Kafka, HBase, Hive, and S3. The idea is simple. Add structure (schema) to the data as soon as possible.
Allow the providers of the data to dictate the structure. And automatically create event-based and state-based tables (DDL)
for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...DataWorks Summit
Today enterprises desire to move more and more of their data lakes to the cloud to help them execute faster, increase productivity, drive innovation while leveraging the scale and flexibility of the cloud. However, such gains come with risks and challenges in the areas of data security, privacy, and governance. In this talk we cover how enterprises can overcome governance and security obstacles to leverage these new advances that the cloud can provide to ease the management of their data lakes in the cloud. We will also show how the enterprise can have consistent governance and security controls in the cloud for their ephemeral analytic workloads in a multi-cluster cloud environment without sacrificing any of the data security and privacy/compliance needs that their business context demands. Additionally, we will outline some use cases and patterns as well as best practices to rationally manage such a multi-cluster data lake infrastructure in the cloud.
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
This document discusses how organizations can leverage data and analytics to power their business models. It provides examples of Fortune 100 companies that are using Attunity products to build data lakes and ingest data from SAP and other sources into Hadoop, Apache Kafka, and the cloud in order to perform real-time analytics. The document outlines the benefits of Attunity's data replication tools for extracting, transforming, and loading SAP and other enterprise data into data lakes and data warehouses.
Cloudy with a chance of Hadoop - real world considerationsDataWorks Summit
Over the last eighteen months, we have seen significant adoption of Hadoop eco-system centric big data processing in Microsoft Azure and Amazon AWS. In this talk we present some of the lessons learned and architectural considerations for cloud-based deployments including security, fault tolerance and auto-scaling.
We look at how Hortonworks Data Cloud and Cloudbreak can automate that scaling of Hadoop clusters, showing how it can react dynamically to workloads, and what that can deliver in cost-effective Hadoop-in-cloud deployments.
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics - Apache Spark’s in memory capabilities catapulted it as the premier processing framework for Hadoop. Apache Ignite and Alluxio, both high-performance, integrated and distributed in-memory platform, takes Apache Spark to the next level by providing an even more powerful, faster and scalable platform to the most demanding data processing and analytic environments.
Speaker
Irfan Elahi, Consultant, Deloitte
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
Spark with GraphX is great for answering relatively simple graph questions which are worth starting a Spark job for, because they essentially involve the whole graph. But does it make sense to start one for every ad-hoc query or is it suitable for complex real-time queries?
In this talk I will introduce an alternative solution that adds those features to an existing Hadoop/Spark setup and enables real-time insights. I will address the following topics:
* Challenges in gaining deeper insights from large amounts of graph data
* Benefits and limitations of graph analysis with Spark
* Introduction to ArangoDB SmartGraphs
* Deployment of Hadoop, Spark and ArangoDB using DC/OS
* Performing complex queries on billions of nodes and vertices leveraging ArangoDB SmartGraphs (Live Demo)
Treat your enterprise data lake indigestion: Enterprise ready security and go...DataWorks Summit
Most enterprises with large data lakes today are flying blind when it comes to the extent to which they can understand how the data in their data lakes is organized, accessed, and utilized to create real business value. Couple this with the need to democratize data, enterprises often realize they have created a data swamp loaded with all kinds of data assets without any curation and without appropriate security controls hoping that developers and analysts can responsibly collaborate to generate insights. In this talk we will provide a broad overview of how organizations can use open source frameworks such as Apache Ranger and Apache Knox to secure their data lakes and Apache Atlas to effectively provide open metadata and governance services for Hadoop ecosystem. We will provide an overview of the new features that have been added in each of these Apache projects recently and how enterprises can leverage these new features to build a robust security and governance model for their data lakes.
Speaker
Owen O'Malley, Co-Founder & Technical Fellow, Hortonworks
The document discusses how EMC Isilon scale-out NAS storage improves Hadoop resiliency and operational efficiency. It analyzes the impact of DataNode and TaskTracker failures on Hadoop jobs. EMC Isilon provides high availability, independent scalability of storage and compute, data protection features, and support for multiple Hadoop distributions and protocols like HDFS, NFS, SMB. This allows using existing data for analysis without replication and reduces time-to-results for Hadoop jobs.
This document discusses Hadoop integration with cloud storage. It describes the Hadoop-compatible file system architecture, which allows Hadoop applications to work with both HDFS and cloud storage transparently. Recent enhancements to the S3A file system connector for Amazon S3 are discussed, including performance improvements and support for encryption. Benchmark results show significant performance gains for Hive queries with S3A compared to earlier versions. Upcoming work on output committers, object store abstraction, and consistency are outlined.
The challenge of computing big data for evolving digital business processes demands variety of computation techniques and engines (SQL, OLAP, time-series, graph, document store), but working in unified framework. A simple architecture of data transformations while ensuring the security, governance, and operational administration are the necessary critical components for enterprise production environments supporting day-to-day business processes. In this session, you will learn about best practices & critical components to ensure business value from latest production deployments. Hear how existing customers are using SAP Vora and the value they have achieved so far with this in-memory engine for distributed data processing. The session provides you with a clear understanding how SAP Vora and open source components like Apache Hadoop and Apache Spark offer an architecture that supports a wide variety of use cases and industries. You will also receive very useful insight where to find development resources, test drive demos, and general documentation.
The document discusses deploying Hadoop in the cloud. Some key benefits of using Hadoop in the cloud include scalability, automated failover of replicated data, and cost efficiency through distributed processing and storage. Microsoft's Azure HDInsight offering provides a fully managed Hadoop and Spark service in the cloud that allows clusters to be provisioned in minutes and is optimized for analytics workloads. The Cortana Intelligence Suite integrates big data technologies like HDInsight with machine learning and data processing tools.
Today enterprises desire to move more and more of their data lakes to the cloud to help them execute faster, increase productivity, drive innovation while leveraging the scale and flexibility of the cloud. However, such gains come with risks and challenges in the areas of data security, privacy, and governance. In this talk we cover how enterprises can overcome governance and security obstacles to leverage these new advances that the cloud can provide to ease the management of their data lakes in the cloud. We will also show how the enterprise can have consistent governance and security controls in the cloud for their ephemeral analytic workloads in a multi-cluster cloud environment without sacrificing any of the data security and privacy/compliance needs that their business context demands. Additionally, we will outline some use cases and patterns as well as best practices to rationally manage such a multi-cluster data lake infrastructure in the cloud.
Speaker:
Jeff Sposetti, Product Management, Hortonworks
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.
1. HAWQ is an open source MPP database for Hadoop that provides SQL querying capabilities and integration with data in HDFS and other sources.
2. It uses a master-segment architecture with dynamic resource management through YARN to enable high performance SQL queries across large datasets.
3. The document discusses HAWQ's architecture, performance advantages, extensions for querying external data through PXF, and integration with Hive through different connectors and a unified catalog.
Protecting your Critical Hadoop Clusters Against DisastersDataWorks Summit
Our enterprise customers are deploying business critical applications on Hadoop clusters and now, want a business continuity solution -that will protect against disasters and cover both processed and unstructured data with varying recovery point objective (RPO) requirements. Our customers are also asking for backup & restore of select unstructured data and databases, in case of accidental deletion by users. They are asking us to automagically tier and move data that becomes less frequently accessed over time to a high-density, slower media or cloud. We will unveil a product suite that is going to solve those customer pain points in phases, starting with Disaster Recovery of Hadoop eco-system with a single source of truth enforcement. We will also cover the deep dive architecture that required extensive changes in Hive, HDFS, Ranger, Atlas (more in pipeline) and demonstrate the end to end functioning of our data lifecycle management.
Speakers:
Jeff Sposetti, Product Management, Hortonworks
Venkat Ranganathan, Director of Engineering, Hortonworks
This document discusses enterprise-grade big data solutions from HPE. It outlines HPE's reference architecture for big data workloads including components like data lakes, data warehouses, archival storage, event processing, and in-memory analytics. It also discusses HPE's investments in Hortonworks and collaboration to optimize Hadoop for performance. The document promotes attending an HPE session at the Hadoop Summit on modernizing data warehouses and visiting the HPE booth for demos and a trivia game.
Format Wars: from VHS and Beta to Avro and ParquetDataWorks Summit
The document discusses different data storage formats such as text, Avro, Parquet, and their suitability for writing and reading data. It provides examples of how to choose a format based on factors like query needs, data types, and whether schemas need to evolve. The document also demonstrates how Avro can handle schema evolution by adding or changing fields while still reading existing data.
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
This document discusses challenges and solutions for using object storage with Apache Spark and Hive. It covers:
- Eventual consistency issues in object storage and lack of atomic operations
- Improving performance of object storage connectors through caching, optimized metadata operations, and consistency guarantees
- Techniques like S3Guard and committers that address consistency and correctness problems with output commits in object storage
This document discusses designing a new big data platform to replace an existing complex and outdated one. It analyzes challenges with the current platform, including inability to keep up with business needs. The proposed new platform called Dredge would use abstraction layers to integrate big data tools in a loosely coupled and scalable way. This would simplify development and maintenance while supporting business goals. Key aspects of Dredge include declarative configuration, logical workflows, and plug-and-play integration of tools like HDFS, Hive, HBase, Kafka and Spark in a reusable and event-driven manner. The new platform aims to improve scalability, reduce costs and better support analytics needs over time.
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
Back in 2014, our team set out to change the way the world exchanges and collaborates with data. Our vision was to build a single tenant environment for multiple organisations to securely share and consume data. And we did just that, leveraging multiple Hadoop technologies to help our infrastructure scale quickly and securely.
Today Data Republic’s technology delivers a trusted platform for hundreds of enterprise level companies to securely exchange, commercialise and collaborate with large datasets.
Join Head of Engineering, Juan Delard de Rigoulières and Senior Solutions Architect, Amin Abbaspour as they share key lessons from their team’s journey with Hadoop:
* How a startup leveraged a clever combination of Hadoop technologies to build a secure data exchange platform
* How Hadoop technologies helped us deliver key solutions around governance, security and controls of data and metadata
* An evaluation on the maturity and usefulness of some Hadoop technologies in our environment: Hive, HDFS, Spark, Ranger, Atlas, Knox, Kylin: we've use them all extensively.
* Our bold approach to expose APIs directly to end users; as well as the challenges, learning and code we created in the process
* Learnings from the front-line: How our team coped with code changes, performance tuning, issues and solutions while building our data exchange
Whether you’re an enterprise level business or a start-up looking to scale - this case study discussion offers behind-the-scenes lessons and key tips when using Hadoop technologies to manage data governance and collaboration in the cloud.
Speakers:
Juan Delard De Rigoulieres, Head of Engineering, Data Republic Pty Ltd
Amin Abbaspour, Senior Solutions Architect, Data Republic
The document provides an overview of new features in Apache Ambari 2.1, including rolling upgrades, alerts, metrics, an enhanced dashboard, smart configurations, views, Kerberos automation, and blueprints. Key highlights include the ability to perform rolling upgrades of Hadoop clusters without downtime by managing different software versions side-by-side, new alert types and a user interface for viewing and customizing alerts, integration of a metrics service for collecting and querying metrics from Hadoop services, customizable service dashboards with new widget types, smart configurations that provide recommended values and validate configurations based on cluster attributes and dependencies, and automated Kerberos configuration.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Adding structure to your streaming pipelines: moving from Spark streaming to ...DataWorks Summit
How do you go from a strictly typed object-based streaming pipeline with simple operations to a structured streaming pipeline with higher order complex relational operations? This is what the Data Engineering team did at GoPro to scale up the development of streaming pipelines for the rapidly growing number of devices and applications.
When big data frameworks such as Hadoop first came to exist, developers were happy because we could finally process large amounts of data without writing complex multi-threaded code or worse yet writing complicated distributed code. Unfortunately, only very simple operations were available such as map and reduce. Almost immediately, higher level operations were desired similar to relational operations. And so Hive and dozens (hundreds?) of SQL-based big data tools became available for more developer-efficient batch processing of massive amounts of data.
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world, so that nearly every streaming framework now supports higher level relational operations.
In this talk, we will discuss in a very hands-on manner how the streaming data pipelines for GoPro devices and apps have moved from the original Spark streaming with its simple RDD-based operations in Spark 1.x to Spark's structured streaming with its higher level relational operations in Spark 2.x. We will talk about the differences, advantages, and necessary pain points that must be addressed in order to scale relational-based streaming pipelines for massive IoT streams. We will also talk about moving from “hand built” Hadoop/Spark clusters running in the cloud to using a Spark-based cloud service. DAVID WINTERS, Big Data Architect, GoPro and HAO ZOU, Senior Software Engineer, GoPro
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Speakers:
Ian Meyers, AWS Solutions Architect
Toby Moore, Chief Technology Officer, Space Ape
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
Challenges in Data Analytics:
Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.
At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.
A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
This document discusses logging scenarios using DynamoDB and Elastic MapReduce. It covers collecting log data in real-time using tools like Fluentd and storing it in DynamoDB. It then describes using EMR to perform ETL processes on the data, extracting from DynamoDB, transforming the data across EC2 instances, and loading to S3 or DynamoDB. Finally, it discusses analyzing the data using Redshift for queries or CloudSearch for search capabilities.
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
The document discusses integrating Couchbase NoSQL with Apache Spark for augmenting operational databases with analytics. It outlines architectural alignment between Couchbase and Spark, including automatic data sharding and locality, data streaming replication from Couchbase to Spark, predicate pushdown to Couchbase global indexes from Spark, and flexible schemas. Integration points discussed include using the Couchbase data locality hints in Spark, limitations on predicate pushdown for Couchbase views and N1QL, and using the Couchbase change data capture protocol for low-latency data streaming into Spark Streaming.
Technologies for Data Analytics PlatformN Masahiro
This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
Nasdaq has extended its use of Amazon Redshift to include Amazon EMR and Amazon S3 in order to better manage storage and compute resources separately. Data is ingested into Redshift and then transformed and unloaded to S3. EMR is then used to convert the data to Parquet format and write it to S3 partitioned by date. The data in S3 is accessed using Presto with encryption at rest. Hive is used to manage schemas and partitions across data sources. Tools were developed to help with encryption, schema management, and data migrations between systems while maintaining security and performance.
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
In this session, you will learn the key differences between a relational database management service (RDBMS) and non-relational (NoSQL) databases like Amazon DynamoDB. You will learn about suitable and unsuitable use cases for NoSQL databases. You'll learn strategies for migrating from an RDBMS to DynamoDB through a 5-phase, iterative approach. See how Sony migrated an on-premises MySQL database to the cloud with Amazon DynamoDB, and see the results of this migration.
Serverless Cloud Data Lake with Spark for Serving Weather Data
1) The document discusses using a serverless architecture with IBM Cloud services like SQL Query powered by Spark, Cloud Object Storage, and Cloud Functions to build a cost-effective cloud data lake for serving historical weather data on demand.
2) It describes how data skipping techniques and geospatial indexes in SQL Query can accelerate queries by an order of magnitude by pruning irrelevant data.
3) The new serverless solution provides unlimited storage, global coverage, and supports large queries for machine learning and analytics at an order of magnitude lower cost than the previous implementation.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
This document discusses engineering machine learning data pipelines and addresses five big challenges: 1) scattered and difficult to access data, 2) data cleansing at scale, 3) entity resolution, 4) tracking data lineage, and 5) ongoing real-time changed data capture and streaming. It presents DMX Change Data Capture as a solution to capture changes from various data sources and replicate them in real-time to targets like Kafka, HDFS, databases and data lakes to feed machine learning models. Case studies demonstrate how DMX-h has helped customers like a global hotel chain and insurance and healthcare companies build scalable data pipelines.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Similar to Dynamic DDL: Adding structure to streaming IoT data on the fly (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
This document discusses using Apache NiFi to build a high-speed cyber security data pipeline. It outlines the challenges of ingesting, transforming, and routing large volumes of security data from various sources to stakeholders like security operations centers, data scientists, and executives. It proposes using NiFi as a centralized data gateway to ingest data from multiple sources using a single entry point, transform the data according to destination needs, and reliably deliver the data while avoiding issues like network traffic and data duplication. The document provides an example NiFi flow and discusses metrics from processing over 20 billion events through 100+ production flows and 1000+ transformations.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
2. OUR SPEAKERS
Hao Zou
Software Engineer
Data Science & Engineering
GoPro
David Winters
Big Data Architect
Data Science & Engineering
GoPro
3. TOPICS TO COVER
• Background and Business
• GoPro Data Platform Architecture
• Old File-based Pipeline Architecture
• New Dynamic DDL Architecture
• Dynamic DDL Deep Dive
• Using Cloud-Based Services (Optional)
• Questions
8. DATA CHALLENGES AT GOPRO
• Variety of data - Hardware and Software products
• Software - Mobile and Desktop Apps
• Hardware - Cameras, Drones, Controllers, Accessories,
etc.
• External - CRM, ERP, OTT, E-Commerce, Web, Social, etc.
• Variety of data ingestion mechanisms - Lambda Architecture
• Real-time streaming pipeline - GoPro products
• Batch pipeline - External 3rd party systems
• Complex Transformations
• Data often stored in binary to conserve space in cameras
• Heterogeneous data formats (JSON, XML, and packed
binary)
• Seamless Data Aggregations
• Blend data between different sources, hardware, and
15. PROS AND CONS OF OLD SYSTEM
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters
16. NEW DYNAMIC DDL ARCHITECTURE
Amazon S3
Bucket
Real Time
Cluster
Batch
Induction
Framework
Hive
Metastore
Ephemeral
ETL
Cluster
Parquet
+
DDL
Aggregates
Events
+
State Ephemeral
Data Mart
Cluster #1
Ephemeral
Data Mart
Cluster #2
Ephemeral
Data Mart
Cluster #N
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download
•Notebook
s
•Tableau
•Plotly
•Python
•R
Improvements
Single copy of data
Separate storage from compute
Elastic clusters
Single long running cluster to maintain
Parquet
+
DDL
Dynamic
DDL!
18. NEW DYNAMIC DDL ARCHITECTURE
Streaming Cluster
ELBHTTP
Pipeline for processing of streaming logs
S3
HIVE
METASTORE
transition
Centralized Hive
MetaStore
For each topic, dynamically add the table
structure and create the table or insert
data into the table if already exists
19. DYNAMIC DDL
• What is Dynamic DDL?
• Dynamic DDL is adding structure (schema) to the data on the fly whenever the providers of the data are changing
their structure.
• Why is Dynamic DDL needed?
• Providers of data are changing their structure constantly. Without Dynamic DDL, the table schema is hard coded
and has to be manually updated based on the changes of the incoming data.
• All of the aggregation SQL would have to be manually updated due to the schema change.
• Faster turnaround for the data ingestion. Data can be ingested and made available within minutes (sometimes
seconds).
• How we did this?
• Using Spark SQL/Dataframe
• See example
20. DYNAMIC DDL
• Example:
{"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07-20T00:06:01Z"}}
Fixed schema
Dynamically generated
schema
{"record_key":"state","record_value":"California","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"last_name","record_value":"Fork","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"first_name","record_value":"John","id":"1","log_ts":"2016-07-20T00:06:01Z"}
Flatten the data first
SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state,
MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name,
MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name,
MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city,
id as data_record_id, log_ts as data_log_ts from test group by id, log_ts
21. DYNAMIC DDL USING SPARK SQL/DATAFRAME
• Code snippet of Dynamic DDL transforming new JSON attributes into relational columns
Add the partition columns
Manually create the table due to a bug in
spark
22. DYNAMIC DDL USING SPARK SQL/DATAFRAME
Add the new columns that exist in the incoming
data frame but do not exist yet in the destination
table
This syntax is not working anymore after upgrading to spark 2.x
23. DYNAMIC DDL USING SPARK SQL/DATAFRAME
Three temporary way to solve the problem in spark 2.x:
• Launch a hiveserver2 service, then use jdbc call hive to alter the table
• Use spark to directly connect to hivemetastore, then update the
metadata
• Update spark source code to support Alter table syntax and
repackage it
24. DYNAMIC DDL USING SPARK SQL/DATAFRAME
Project all columns from the table
Append the data into the destination table
25. DYNAMIC DDL USING SPARK SQL/DATAFRAME
Add the new partition key
• Reprocessing the DDL Table with new partition Key (Tuning tips)
Choose the partition key wisely
Use coalesce if there too many partitions
Use Coalesce to control the job tasksUse filter if Data still too large
27. USING S3: WHAT IS S3?
• S3 is not a file system.
• S3 is an object store. Similar to a key-value store.
• S3 objects are presented in a hierarchical view but are not stored in that manner.
• S3 objects are stored with a key derived from a “path”.
• The key is used to fan out the objects across shards.
• The path is for display purposes only. Only the first 3 to 4 characters are
used for sharding.
• S3 does not have strong transactional semantics but instead has eventual
consistency.
• S3 is not appropriate for realtime updates.
• S3 is suited for longer term storage.
28. USING S3: BEHAVIORS
• S3 has similar behaviors to HDFS but even more extreme.
• Larger latencies
• Larger files/writes – Think GBs
• Write and read latencies are larger but the bandwidth is much larger with S3.
• Thus throughput can be increased with parallel writers (same latency but
more throughput through parallel operations)
• Partition your RDDs/DataFrames and increase your workers/executors
to optimize the parallelism.
• Each write/read has more overhead due to the web service calls.
• So use larger buffers.
• Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS.
• Collect data for longer durations before writing large buffers in parallel to S3.
• Retry logic – Writes to S3 can and will fail.
• Cannot stream to S3 – Complete files must be uploaded.
• Technically, you can simulate streaming with multipart upload.
29. USING S3: TIPS
• Tips for using S3 with HDFS
• Use the s3a scheme.
• Many optimizations including buffering options (disk-based, on-heap, or
off-heap) and incremental parallel uploads (S3A Fast Upload).
• More here: http://hadoop.apache.org/docs/current/hadoop-
aws/tools/hadoop-aws/index.html#S3A
• Don’t use rename/move.
• Moves are great for HDFS to support better transactional semantics
when streaming files.
• For S3, moves/renames are copy and delete operations which can be
very slow especially due to the eventual consistency.
• Other advanced S3 techniques:
• Hash object names to better shard the objects in a bucket.
• Use multiple buckets to increase bandwidth.
GoPro is the ultimate accessory for active people with smartphones.
Products and services include:
Core products -- Cameras
Advanced Solutions – Karma, stabilization and VR
Accessories and Mounts
Software suite that ties it all together
GoPro has become the ultimate, end-to-end storytelling solution.
This is what we wanted to build.
These were the challenges to solve.
High Level Architecture of Data Platform
Isolation of workloads 3 clusters (ingest, ETL, delivery)
Lamdba architecture
Input and output data formats
Cadence of clusters
A word about Data Sources:
IoT data
Logs from devices, applications (desktop and mobile), external systems and services, ERP, web/email marketing, etc.
Some Raw and Gzip, Some Binary and JSON
Some streaming and some batch
Batch data
Web marketing, campaigns
Social media
ERP
CRM
Lambda architecture
Both batch and stream processing
Basic needs/workloads in a Data Platform
High throughput ingestion
Transformations: joins, aggregations, etc.
Fast queries
Today, we have 3 clusters to isolate these workloads
We started with one cluster, ETL
Everything ran there
Ingest (Flume)
Batch (Framework)
ETL (Hive)
Analytical (Impala)
Lots of resource contention (I/O, memory, cores)
To alleviate the resource contention, we opted for 3 clusters to isolate the workloads.
Ingest cluster for near real-time streaming
Kafka, Spark Streaming (Cloudera Parcels)
Input: Logs, Output: JSON
Minutes cadence
Moving towards more real-time in seconds
Induction framework for scheduled batch ingestion
ETL cluster for heavy duty aggregation
Input: JSON flat files, Output: Aggregated Parquet files
Hive (Map/Reduce)
Hourly cadence
Secure Data Mart
Kerberos, LDAP, Active Directory, Apache Sentry (Cloudera committers)
Input: Compressed Parquet files
Analytical SQL engine for Tableau, ad-hoc queries (Hue), data wrangling (Trifacta), and data science (Jupyter Notebooks and RStudio)
With all that said, we will examine the newer technologies that will enable us to simplify our architecture and merge clusters in the future.
Kudu is one possible new technology that could help us to consolidate some of the clusters.
Let’s take a deeper dive into our streaming ingestion…
Logs are streamed from devices and software applications (desktop and mobile) to web service endpoint
Endpoint is an elastic pool of Tomcat servers sitting behind ELB in AWS
Custom servlet pushes logs into Kafka topics by environment
A series of Spark streaming jobs process the logs from Kafka
Landing place in ingestion cluster is HDFS with JSON flat files
Rationalization of tech stacks…
Why Kafka?
Unrivaled write throughput for a queue
Traditional queue throughput: 100K writes/sec on the biggest box you can buy
Kafka throughput: 1M writes/sec on 3-4 commodity servers
Strong ordering policy of messages
Distributed
Fault-tolerant through replication
Support synchronous and asynchronous writes
Pairs nicely with Spark Streaming for simpler scaling out (Kafka topic partitions map directly to Spark RDD partitions/tasks)
Why Spark Streaming?
Strong transactional semantics - "exactly once" processing
Leverage Spark technology for both data ingest and analytics
Horizontally scalable - High throughput for micro-batching
Large open source community
Keyword: Impedance mismatch
As previously stated, logs are streamed from devices and software applications (desktop and mobile) to web service endpoint
Logs are diverse: gzipped, raw, binary, JSON, batched events, streamed single events
Vary significantly in size from < 1 KB to > 1 MB
Logs are redirected based on data category and routed to appropriate Kafka topic and respective Spark Streaming job
Logs move from Kafka topic to Kafka topic with each Kafka topic having a Spark Streaming job that consumes the log, processes the log, and writes the log to another topic
Tree like structure of jobs with more generic logic towards the root of the tree and more specialized logic moving towards the leaf nodes
There are generic jobs/services and specialized jobs/services
Generic services include PII removal and hashing, IP to Geo lookups, and batched writing to HDFS
We perform batched HDFS writing since Kafka likes small messages (1 KB ideal) and HDFS likes large files (100+ MB)
Specialized services contain business logic
Finally, the logs are written into HDFS as JSON flat files (which are sometimes compressed depending on the type of data)
Scheduled ETL jobs perform a distributed copy (distcp) to move the data to the ETL cluster for further heavier aggregations
A few things…
Two flows of data: streaming and batch
Join data sources
Aggregate data sources
Convert to compressed columnar format (gzipped Parquet fies)
On the ETL cluster…
Here’s where we do our heavy lifting.
Almost entirely all Hive Map Reduce jobs
Some Impala to make the really big narly aggregations more performant
Previously, had a custom Java Map Reduce job for sessionization of events
This has been replaced with a Spark Streaming job on the ingestion cluster
In the future, want to push as much of the ETL processing back into the ingestion cluster for more real-time processing
We also have a custom Java Induction framework which ingests data from external services that only make data available on slower schedules (daily, twice daily, etc.)
The output from the ETL cluster is Parquet files that are added to partitioned managed tables in the Hive metastore.
The Parquet files are then copied via distcp to the Secure Data Mart.
Parquet files are copied from the ETL cluster and added to partitioned managed tables in the Hive Metastore of the Secure Data Mart.
The Secure Data Mart is protected with Apache Sentry.
Kerberos is used for authentication. Corporate Standard
Active Directory stores the groups. Corporate Standard
Access control is role based and the roles are assigned with Sentry.
Hue has a Sentry UI app to manage authorization.
Store data in one place Data (S3) + Structure (Hive Metastore)
Separate compute nodes from storage nodes
Elasticity size of clusters and number of clusters
Lower operational overhead of maintaining HDFS storage nodes
Redirect batch ingest into stream ingest (Pump batch data into Kafka) RESULT: One codebase for both stream and batch ingestion
Even thought the fixed schema resolves the problem of data provider changing the data structure frequently.
It becomes really difficult for the analysts and data scientists to analyze the data.
1.how would they know these two rows are coming from one event?
2.has to use id to associate these two rows
/** The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata* gets out of synch with Hive's metadata if we let Spark automatically create the table . So we create the table first and later alter* the table and add columns to it. See below for more details:** https://issues.apache.org/jira/browse/SPARK-9761*/
/** The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata* gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter* the table and add columns to it. See below for more details:** https://issues.apache.org/jira/browse/SPARK-9761*/
/** The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata* gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter* the table and add columns to it. See below for more details:** https://issues.apache.org/jira/browse/SPARK-9761*/
/** The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata* gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter* the table and add columns to it. See below for more details:** https://issues.apache.org/jira/browse/SPARK-9761*/