In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS.
- It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics.
- Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
The document discusses data mesh vs data fabric architectures. It defines data mesh as a decentralized data processing architecture with microservices and event-driven integration of enterprise data assets across multi-cloud environments. The key aspects of data mesh are that it is decentralized, processes data at the edge, uses immutable event logs and streams for integration, and can move all types of data reliably. The document then provides an overview of how data mesh architectures have evolved from hub-and-spoke models to more distributed designs using techniques like kappa architecture and describes some use cases for event streaming and complex event processing.
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
Snowflake is a cloud-based data warehouse that is built for the cloud. It was founded in 2012 and has raised $1 billion in funding. Snowflake's architecture separates storage, compute, and metadata services, allowing it to offer unlimited scalability, multiple clusters that can access shared data with no downtime, and full transactional consistency across the system. Snowflake has over 2000 customers including large enterprises that use it for analytics, data science, and sharing large volumes of data securely.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
Dragan Berić will take a deep dive into Lakehouse architecture, a game-changing concept bridging the best elements of data lake and data warehouse. The presentation will focus on the Delta Lake format as the foundation of the Lakehouse philosophy, and Databricks as the primary platform for its implementation.
Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.
This migration plan aims to explore the potential of migrating from on-premises Hadoop to Azure Databricks. By leveraging Databricks' scalability, performance, collaboration, and advanced analytics capabilities, organizations can unlock faster insights and facilitate data-driven decision-making.
The document discusses modern data architectures. It presents conceptual models for data ingestion, storage, processing, and insights/actions. It compares traditional vs modern architectures. The modern architecture uses a data lake for storage and allows for on-demand analysis. It provides an example of how this could be implemented on Microsoft Azure using services like Azure Data Lake Storage, Azure Data Bricks, and Azure Data Warehouse. It also outlines common data management functions such as data governance, architecture, development, operations, and security.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
Hadoop Migration to databricks cloud project plan.pptxyashodhannn
Telecom Bell is migrating their core applications to the cloud to improve network quality of service and enable personalized customer engagement using customer data. They are facing challenges with their on-premise data platform's lack of scalability, data silos, and governance issues. Databricks will help design a new cloud-based data platform architecture using their platform and Confluent for event streaming. The joint delivery approach between Telecom Bell and Databricks teams will include establishing data governance, migrating applications in phases, change management support, and reaching the desired timeline of May 2024.
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
Data mesh was among the most discussed and controversial enterprise data management topics of 2021. One of the reasons people struggle with data mesh concepts is we still have a lot of open questions that we are not thinking about:
Are you thinking beyond analytics? Are you thinking about all possible stakeholders? Are you thinking about how to be agile? Are you thinking about standardization and policies? Are you thinking about organizational structures and roles?
Join data.world VP of Product Tim Gasper and Principal Scientist Juan Sequeda for an honest, no-bs discussion about data mesh and its role in data governance.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.
Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.
In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
This book gives a quick introduction to Hadoop-like problems, and gives a primer on the real value of HDInsight. Next, it will show how to set up your HDInsight cluster.
Then, it will take you through the four stages: collect, process, analyze, and report.
For each of these stages you will see a practical example with the working code.
The document outlines the goals and contents of a book about HDInsight, Microsoft's Hadoop distribution. The book aims to provide an overview of Hadoop, describe how to deploy HDInsight on-premise and on Azure, and provide examples of ingesting, transforming, and analyzing data with HDInsight. Each chapter is summarized briefly, covering topics like Hadoop concepts, installing HDInsight, administering HDInsight clusters, loading and processing data in HDInsight.
These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
View our quarterly customer education webcast to learn about the new advancements in Syncsort DMX and DMX-h data integration software and DataFunnel - our new easy-to-use browser-based database onboarding application. Learn about DMX Change Data Capture and the advantages of true streaming over micro-batch.
View this webcast on-demand where you'll hear the latest news on:
• Improvements in Syncsort DMX and DMX-h
• What’s next in the new DataFunnel interface
• Streaming data in DMX Change Data Capture
• Hadoop 3 support in Syncsort Integrate products
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
Sahara is an OpenStack project that provides an abstraction layer for provisioning and managing Apache Hadoop clusters and jobs in OpenStack clouds. It allows users to easily deploy and scale Hadoop clusters on demand without having to manage the underlying infrastructure. Sahara uses plugins to integrate various Hadoop distributions like Hortonworks Data Platform (HDP) and Cloudera Distribution including Apache Hadoop (CDH). It leverages other OpenStack services like Nova, Neutron, Swift, Cinder, Heat etc. to provision, configure and manage the Hadoop clusters and jobs.
Hadoop and OpenStack - Hadoop Summit San Jose 2014spinningmatt
This document discusses Hadoop and OpenStack Sahara. Sahara is an OpenStack project that allows users to provision and manage Hadoop clusters within OpenStack. It provides a plugin mechanism to support different Hadoop distributions like Hortonworks Data Platform (HDP). The HDP plugin fully integrates HDP clusters with Sahara using the Ambari API for cluster management. Sahara handles tasks like cluster scaling, integration with Swift for storage, and data locality. Its plugin architecture allows different Hadoop versions and distributions to be deployed and managed through Sahara.
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseScyllaDB
FireEye believes in intelligence driven cyber security. Their legacy system used PostgreSQL with a custom graph database system to store and facilitate analysis of threat intelligence data. As their user base increased they ran into scaling issues requiring a system redesign with a new platform.
This presentation will focus on the bac kend systems and migration path to a new technology stack using JanusGraph running on top of Scylla plus Elasticsearch.
Using Scylladb turned out to be a game-changer in terms of performance and the types of analysis our application is able to do effortlessly.
The document summarizes Oracle's Big Data Appliance and solutions. It discusses the Big Data Appliance hardware which includes 18 servers with 48GB memory, 12 Intel cores, and 24TB storage per node. The software includes Oracle Linux, Apache Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and Oracle Loader for Hadoop. Oracle Loader for Hadoop can be used to load data from Hadoop into Oracle Database in online or offline mode. The Big Data Appliance provides an optimized platform for storing and analyzing large amounts of data and is integrated with Oracle Exadata.
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
Динамичное развитие инструментов для обработки Больших Данных порождает новые подходы к повышению производительности. Ключевые новые технологии в Hadoop 2.0, такие как Yarn labeling и Storage Tiering, уже используются компаниями Yahoo и Ebay. Эти новые технологии открывают путь для серьезного повышения эффективности ИТ-инфраструктуры для Hadoop, достигая прироста производительности в несколько десятков процентов при одновременном снижении потребления памяти и электроэнергии.
Эталонная архитектура для Hadoop от HP — HP Big Data Reference Architecture — предлагает использование специализированных "микросерверов" HP Moonshot вкупе с высокоплотными узлами хранения HP Apollo для достижения лучших на сегодня показателей полезной отдачи от железа в Hadoop.
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
The document introduces Apache Kudu (incubating), a new updatable columnar storage system for Apache Hadoop designed for fast analytics on fast and changing data. It was designed to simplify architectures that use HDFS and HBase together. Kudu aims to provide high throughput for scans, low latency for individual rows, and database-like ACID transactions. It uses a columnar format and is optimized for SSD and new storage technologies.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
ScyllaDB CTO Avi Kivity looks at the present state of Scylla's capabilities, and offers a glimpse of what's to come. From incremental compaction strategy to take advantage of newer, denser nodes, to data transformations with User Defined Functions (UDFs) and User Defined Aggregates (UDAs), ScyllaDB continues to expand its horizons for capabilities, use cases and APIs.
This document provides an overview of Spark driven big data analytics. It begins by defining big data and its characteristics. It then discusses the challenges of traditional analytics on big data and how Apache Spark addresses these challenges. Spark improves on MapReduce by allowing distributed datasets to be kept in memory across clusters. This enables faster iterative and interactive processing. The document outlines Spark's architecture including its core components like RDDs, transformations, actions and DAG execution model. It provides examples of writing Spark applications in Java and Java 8 to perform common analytics tasks like word count.
This document provides an overview of Hadoop and related big data technologies. It discusses the core Hadoop projects like HDFS, MapReduce, Hive and Spark. It also covers ingestion tools like Flume and Sqoop and real-time streaming tools like Storm and Kafka. Example use cases for web analytics, data warehousing and IoT are presented. Finally deployment options on premise and in the cloud are briefly discussed.
Similar to 5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Machine Learning CI/CD for Email Attack DetectionDatabricks
Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models.
In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
Sarah: CEO-Finance-Report pipeline seems to be slow today. Why
Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days.
Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours.
Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert.
We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example:
Tom: I am not seeing any data for today in my Campaign Metrics Dashboard.
Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021.
This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
This presentation introduces Tune and Fugue, frameworks for intuitive and scalable hyperparameter optimization (HPO). Tune supports both non-iterative and iterative HPO problems. For non-iterative problems, Tune supports grid search, random search, and Bayesian optimization. For iterative problems, Tune generalizes algorithms like Hyperband and Asynchronous Successive Halving. Tune allows tuning models both locally and in a distributed manner without code changes. The presentation demonstrates Tune's capabilities through examples tuning Scikit-Learn and Keras models. The goal of Tune and Fugue is to make HPO development easy, testable, and scalable.
When it comes to Large Scale data processing and Machine Learning, Apache Spark is no doubt one of the top battle-tested frameworks out there for handling batched or streaming workloads. The ease of use, built-in Machine Learning modules, and multi-language support makes it a very attractive choice for data wonks. However bootstrapping and getting off the ground could be difficult for most teams without leveraging a Spark cluster that is already pre-provisioned and provided as a managed service in the Cloud, while this is a very attractive choice to get going, in the long run, it could be a very expensive option if it’s not well managed.
As an alternative to this approach, our team has been exploring and working a lot with running Spark and all our Machine Learning workloads and pipelines as containerized Docker packages on Kubernetes. This provides an infrastructure-agnostic abstraction layer for us, and as a result, it improves our operational efficiency and reduces our overall compute cost. Most importantly, we can easily target our Spark workload deployment to run on any major Cloud or On-prem infrastructure (with Kubernetes as the common denominator) by just modifying a few configurations.
In this talk, we will walk you through the process our team follows to make it easy for us to run a production deployment of our Machine Learning workloads and pipelines on Kubernetes which seamlessly allows us to port our implementation from a local Kubernetes set up on the laptop during development to either an On-prem or Cloud Kubernetes environment
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...JeevanKp7
Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients.
The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease.
A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataSamuel Jackson
We present our work to improve data accessibility and performance for data-intensive tasks within the fusion research community. Our primary goal is to develop services that facilitate efficient access for data-intensive applications while ensuring compliance with FAIR principles [1], as well as adoption of interoperable tools, methods and standards.
The major outcome of our work is the successful creation and deployment of a data service for the MAST (Mega Ampere Spherical Tokamak) experiment [2], leading to substantial enhancements in data discoverability, accessibility, and overall data retrieval performance, particularly in scenarios involving large-scale data access. Our work follows the principles of Analysis-Ready, Cloud Optimised (ARCO) data [3] by using cloud optimised data formats for fusion data.
Our system consists of a query-able metadata catalogue, complemented with an object storage system for publicly serving data from the MAST experiment. We will show how our solution integrates with the Pandata stack [4] to enable data analysis and processing at scales that would have previously been intractable, paving the way for data-intensive workflows running routinely with minimal pre-processing on the part of the researcher. By using a cloud-optimised file format such as zarr [5] we can enable interactive data analysis and visualisation while avoiding large data transfers. Our solution integrates with common python data analysis libraries for large, complex scientific data such as xarray [6] for complex data structures and dask [7] for parallel computation and lazily working with larger that memory datasets.
The incorporation of these technologies is vital for advancing simulation, design, and enabling emerging technologies like machine learning and foundation models, all of which rely on efficient access to extensive repositories of high-quality data. Relying on the FAIR guiding principles for data stewardship not only enhances data findability, accessibility, and reusability, but also fosters international cooperation on the interoperability of data and tools, driving fusion research into new realms and ensuring its relevance in an era characterised by advanced technologies in data science.
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18
[2] M Cox, The Mega Amp Spherical Tokamak, Fusion Engineering and Design, Volume 46, Issues 2–4, 1999, Pages 397-404, ISSN 0920-3796, https://doi.org/10.1016/S0920-3796(99)00031-9
[3] Stern, Charles, et al. "Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production." Frontiers in Climate 3 (2022): 782909.
[4] Bednar, James A., and Martin Durant. "The Pandata Scalable Open-Source Analysis Stack." (2023).
[5] Alistair Miles (2024) ‘zarr-developers/zarr-python: v2.17.1’. Zenodo. doi: 10.5281/zenodo.10790679
[6] Hoyer, S. & Hamman, J., (20
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
5. History of Hadoop
● Created 2005
● Open Source distributed processing and storage
platform running on commodity hardware
● Originally consisted of HDFS, and MapReduce, but
now incorporates numerous open source projects
(Hive, HBase, Spark)
● On-prem and on the cloud
6. COMPLEX FIXED
Today Hadoop is very hard
● Many tools: Need to understand
multiple technologies.
● Real-time and batch ingestion to
build AI models requires
integrating many components.
Slow Innovation
● 24/7 clusters.
● Fixed capacity: CPU
+ RAM + Disk.
● Costly to upgrade.
Cost Prohibitive
MAINTENANCE
INTENSIVE
● Hadoop ecosystem is
complex and hard to
manage that is prone to
failures.
Low Productivity
X
7. Enterprises Need a Modern
Data Analytics Architecture
CRITICAL REQUIREMENTS
Cost-effective scale and performance in the cloud
Easy to manage and highly reliable for diverse data
Predictive and real-time insights to drive innovation
8. Structured Semi-structured Unstructured Streaming
Lakehouse Platform
Data Engineering
BI & SQL
Analytics
Real-time Data
Applications
Data Science
& Machine Learning
Data Management & Governance
Open Data Lake
SIMPLE OPEN COLLABORATIVE
13. Migration Planning
Technical Planning
● Target state architecture
● Data migration
● Workload migration
○ Lift and shift, transformative, hybrid
● Data governance approach
● Automated deployment
● Monitoring and Operations
14. Migration Planning
Enablement and Evaluation
● Workshops,Technical deep dives
● Training
● Proof of technology / MVP
○ Validate assumptions and designs
15. Migration Planning
Migration Execution
● Environment Deployment
● Iterate of use cases
○ Data Migration
○ Workload Migration
○ Dual Production Deployment - Old and New
○ Validation
○ Cut-over and Decommission of Hadoop
19. Hadoop Ecosystem to Databricks Concepts
Hadoop
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
c
c
c
2x12c = 24c
compute
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
c
c
c
2x12c = 24c
compute
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Driver
c
c
c
c
c
c
2x12c = 24c
compute
...
Node 1 Node 2 Node N
Hive
Metastore
Hive
Server
Impala
(LoadBalancer)
HBase
API
Sentry
Table Metadata +
HDFS ACLs
JDBC/ODBC
Node makeup
▪ Local disks
▪ Cores/Memory carved to services
▪ Submitted jobs compete for resources
▪ Services constrained to accommodate
resources
Metadata and Security
▪ Sentry table metadata permissions combined
with syncing HDFS ACLs OR
▪ Apache Ranger, policy based access control
Endpoints
▪ Direct Access to HDFS / Copied dataset
▪ Hive (on MR or Spark) accepts incoming
connections
▪ Impala for interactive queries
▪ HBase APIs as required
Ranger
Policy based
access control
OR
20. Hadoop Ecosystem to Databricks Concepts
Hadoop
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
c
c
c
2x12c = 24c
compute
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
c
c
c
2x12c = 24c
compute
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Driver
c
c
c
c
c
c
2x12c = 24c
compute
...
Node 1 Node 2 Node N
Hive
Metastore
Hive
Server
Impala
(LoadBalancer)
HBase
API
Sentry/Ranger
Table Metadata +
HDFS ACLs
Hive
Metastore
(managed)
Databricks
SQL Endpoint
JDBC/ODBC
High Conc. Cluster SQL Analytics
CosmosDB/
DynamoDB/
Keyspaces
Object Storage
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
Spark ETL
(Batch/Streaming)
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
SQL Analytics
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
ML Runtime
Table
ACLs
Object Storage ACLs
Ephemeral
Clusters for
All-purpose
or Jobs
JDBC/ODBC
21. Hadoop Ecosystem to Databricks Concepts
Hive
Metastore
(managed)
Databricks
SQL Endpoint
High Conc. Cluster SQL Analytics
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
Spark ETL
(Batch/Streaming)
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
SQL Analytics
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
ML Runtime
Table
ACLs
Ephemeral
Clusters or
long running
for
All-purpose
or Jobs
JDBC/ODBC
Node makeup
▪ Each Node (VM), maps to single Spark
Driver/Worker
▪ Cluster of nodes completely isolated from other
jobs/compute
▪ De-coupled compute and storage
Metadata and Security
▪ Managed Hive metastore (other options
available)
▪ Table ACLs (Databricks) and Object Storage
permissions
Endpoints
▪ SQL endpoint for both advanced analytics and
simple SQL analytics
▪ Code access to data - Notebooks
▪ HBase → maps to Azure CosmosDB, AWS
DynamoDB/Keyspaces (non-Databricks
solution)
Object Storage Object Storage ACLs
CosmosDB/
DynamoDB/
Keyspaces
24. Data Migration
- On-premise block storage.
- Fixed disk capacity.
- Health checks to validate data
Integrity.
- As data volumes grow, must
add more nodes to cluster and
rebalance data.
MIGRATE
- Fully managed cloud object storage.
- Unlimited capacity.
- No maintenance, no health checks, no rebalancing.
- 99.99% availability, 99.9999999% durability.
- Use native cloud services to migrate data.
- Leverage partner solutions:
25. Data Migration
Build a Data Lake in cloud storage with Delta Lake
● Open source and uses Parquet file format.
● Performance: Data indexing → Faster queries.
● Reliability: ACID Transactions → Guaranteed data integrity.
● Scalability: Handle petabyte-scale tables with billions of partitions and files at ease.
● Enhanced Spark SQL: UPDATE, MERGE, and DELETE commands.
● Unify Batch and Stream processing → No more LAMBDA architecture.
● Schema Enforcement: Specify schema on write.
● Schema Evolution: Automatically change schemas on the fly.
● Audit History: Full audit trail of the changes.
● Time Travel: Restore data from past versions.
● 100% Compatible with Apache Spark API.
26. Start with Dual ingestion
● Add a feed to cloud storage
● Enable new use cases with new data
● Introduces options for backup
27. How to migrate data
● Leverage existing Data Delivery tools to point to cloud storage
● Introduce simplified flows to land data into cloud storage
28. How to migrate data
● Push the data
○ DistCP
○ 3rd Party Tooling
○ In-house frameworks
○ Cloud Native - Snowmobile , Azure Data Box, Google Transfer Appliance
○ Typically easier to approve (security)
● Pull the data
○ Spark Streaming
○ Spark Batch
■ File Ingest
■ JDBC
○ 3rd Party Tooling
29. How to migrate data - Pull approach
● Set up connectivity to On Premises
○ AWS Direct Connect
○ Azure ExpressRoute / VPN Gateway
○ This may be needed for some use cases
● Kerberized Hadoop Environments
○ Databricks clusters initialization scripts
■ Kerberos client setup
■ krb5.conf, keytab
■ kinit()
● Shared External Metastore
○ Databricks and Hadoop can share a metastore
39. Security and Governance
Authentication Authorization Metadata Management
- Single Sign On (SSO) with SAML
2.0 supported corporate
directory.
- Access Control Lists (ACLs) for
Databricks RBAC.
- Table ACLs - Dynamic Views for
Column/Row permissionons
- Leverage cloud native
security: IAM Federation and
AAD passthrough.
- Integration with Ranger an
Immuta for more advanced
RBAC and ABAC.
- Integration with 3rd party
services.
Amazon Glue
41. Migrating Security Policies from
Hadoop to Databricks
Enabling enterprises to responsibly use their data in the cloud
Powered by Apache Ranger
42. HADOOP ECOSYSTEM
● 100s and 1000s of tables in
Apache Hive
● 100s of policies in Apache
Ranger
● Variety of policies. Resource
Based, Tag Based, Masking, Row
Level Filters, etc.
● Policies for Users and Groups
from AD/LDAP
45. ● Richer, deeper, and more robust Access Control
● Row/Column level access control in SQL
● Dynamic and Static data de-identification
● File level access control for Dataframes, object level access
● Read/Write operations supported
Object Store
(S3/ADLS)
Privacera
+
Databricks
S3 - Bucket
Level
Y
S3 - Object
Level
Y
ADLS Y
Privacera Value Add - Enhancing Databricks Authorization
Spark SQL and R Privacera +
Databricks
Table Y
Column Y
Column Masking Y
Row Level Filtering Y
Tag Based Policies Y
Attribute based policies Y
Centralized Auditing Y
46. Databricks SQL/Python Cluster
Spark Driver Ranger Plugin
Spark Executors
Spark Executors Ranger Policy Manager
Privacera Portal
Privacera Audit Server
DB Solr
Apache Kafka
Splunk
Cloud Watch
SIEM
Privacera Cloud
Spark SQL
and/or Spark
Read/Write
Privacera
Anomaly
Detection and
Alerting
Databricks Cluster
Privacera Discovery
Business User
Admin User
Privacera Approval
Workflow
AD/LDAP
3rd Party Catalog
49. What about the SQL Community
Hadoop
● HUE
○ Data browsing
○ SQL Editor
○ Visualizations
● Interactive SQL
○ Impala
○ Hive LLAP
Databricks
● SQL Analytics Workspace
○ Data Browser
○ SQL Editor
○ Visualizations
● Interactive SQL
○ Spark optimizations - Adaptive Query Execution
○ Advanced Caching
○ Project Photon
○ Scaling cluster of clusters
50. SQL & BI Layer
Optimized SQL and BI
Performance BI Integrations Tuned
- Fast queries with Delta Engine
on Delta Engine.
- Support for high-concurrency
with auto-scaling clusters.
- Optimized JDBC/ODBC drivers.
- Optimized and tuned for BI and
and SQL out of the box.
Compatible with any BI client
and tool that supports Spark.
51. Vision
Give SQL users a home in Databricks
Provide SQL workbench, light
dashboarding, and alerting capabilities
Great BI experience on the data lake
Enable companies to effectively leverage
the data lake from any BI tool without
having to move the data around.
Easy to use & price-performant
Minimal setup & configuration. Data lake
price performance.
52. SQL-native user interface for
analysts
▪ Familiar SQL Editor
▪ Auto Complete
▪ Built in visualizations
▪ Data Browser
▪ Automatic Alerts
▪ Trigger based upon values
▪ Email or Slack integration
▪ Dashboards
▪ Simply convert queries to
dashboards
▪ Share with Access Control
53. Built-in connectors for existing
BI tools
Other BI & SQL clients
that support
▪ Supports your favorite tool
▪ Connectors for top BI & SQL clients
▪ Simple connection setup
▪ Optimized performance
▪ OAuth & Single Sign On
▪ Quick and easy authentication
experience. No need to deal with
access tokens.
▪ Power BI Available now
▪ Others coming soon
54. Performance
Delta Metadata Performance
Improved read performance for cold queries on Delta
tables. Provides interactive metadata performance
regardless of # of Delta tables in a query or table sizes.
New ODBC / JDBC Drivers
Wire protocol re-engineered to provide lower latencies
& higher data transfer speeds:
▪ Lower latency / less overhead (~¼ sec) with reduced
round trips per request
▪ Higher transfer rate (up to 50%) using Apache Arrow
▪ Optimized metadata performance for ODBC/JDBC
APIs (up to 10x for metadata retrieval operations)
Photon - Delta Engine
[Preview]
New MPP engine built from scratch in C++.
Vectorized to exploit data level parallelism and
instruction-level parallelism. Optimized for
modern structured and semi-structured
workloads.