This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
Matthew Powers gave a talk on optimizing data lakes for Apache Spark. He discussed community goals like standardizing method signatures. He advocated for using Spark helper libraries like spark-daria and spark-fast-tests. Powers explained how to build better data lakes using techniques like partitioning data on relevant fields to skip data and speed up queries significantly. He also covered modern Scala libraries, incremental updates, compacting small files, and using Delta Lakes to more easily update partitioned data lakes over time.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS.
- It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics.
- Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.
Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.
This document provides an introduction and overview of Apache Spark with Python (PySpark). It discusses key Spark concepts like RDDs, DataFrames, Spark SQL, Spark Streaming, GraphX, and MLlib. It includes code examples demonstrating how to work with data using PySpark for each of these concepts.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Deep Dive into the New Features of Apache Spark 3.0Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
Dustin Vannoy presented on using Delta Lake with Azure Databricks. He began with an introduction to Spark and Databricks, demonstrating how to set up a workspace. He then discussed limitations of Spark including lack of ACID compliance and small file problems. Delta Lake addresses these issues with transaction logs for ACID transactions, schema enforcement, automatic file compaction, and performance optimizations like time travel. The presentation included demos of Delta Lake capabilities like schema validation, merging, and querying past versions of data.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...StreamNative
The Apache Pulsar community is collaborating with the Delta Lake community to add up to both ecosystems: Pulsar - Delta Lake Connector.
In this session, we will first provide an overview of Pulsar - Delta Lake connector and Delta Lake Standalone Reader, then introduce the design of Pulsar - Delta Lake CDC source connector including how to capture data change of Delta lake and how to recover from last checkpoint with the help of Pulsar Function state store . We will also discuss the scalability of this Pulsar - Delta lake CDC source connector.
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward
Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API. SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined. In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.
The document discusses Pentaho Data Integration (Kettle), an open-source ETL tool. It describes Kettle's components like Spoon for designing transformations and jobs, Pan and Kitchen for executing them. It covers extracting, transforming and loading data, transformation steps, job entries, and how Kettle is used to integrate data from various sources into data warehouses.
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022HostedbyConfluent
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Are you considering converting your daily batch ETLs into a new and exhilarating realtime framework? We’ll help you look before you leap as we take a deep dive into the unique operational challenges entailed in transitioning data processing paradigms.
As batched data pipelines consume data from well defined time intervals and write results to partitioned data storage, batched jobs are often idempotent, so the failure recovery is simply rerunning the faulty job instances. Batched data processes are triggered at a certain frequency (e.g. daily or hourly), so the data latency is determined by both the job scheduler and job run time. Therefore, many advanced data use cases, such as frequency capping, requires event streaming to enable real-time data insights. Event streaming applications process unbounded input data in real-time and append output to message queues and/or tables to be further processed. However, real-time data insights are no free meal - because event streaming comes with many unique engineering challenges, such as handling late-arriving and duplicate events, implementing event-time partitioning, and backfilling historical data after failures. In addition, batched-driven and even streaming are not incompatible to each other but can often be better together, as the Delta and Kappa Architecture are commonly adopted in modern data systems.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
The document discusses Delta Live Tables (DLT), a tool from Databricks that allows users to build reliable data pipelines in a declarative way. DLT automates complex ETL tasks, ensures data quality, and provides end-to-end visibility into data pipelines. It unifies batch and streaming data processing with a single SQL API. Customers report that DLT helps them save significant time and effort in managing data at scale, accelerates data pipeline development, and reduces infrastructure costs.
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Andreas Buckenhofer
Part 3(4)
The slides contain a DWH lecture given for students in 5th semester. Content:
- Introduction DWH and Business Intelligence
- DWH architecture
- DWH project phases
- Logical DWH Data Model
- Multidimensional data modeling
- Data import strategies / data integration / ETL
- Frontend: Reporting and anaylsis, information design
- OLAP
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Delivering rapid-fire Analytics with Snowflake and TableauHarald Erb
Until recently, advancements in data warehousing and analytics were largely incremental. Small innovations in database design would herald a new data warehouse every
2-3 years, which would quickly become overwhelmed with rapidly increasing data volumes. Knowledge workers struggled to access those databases with development intensive BI tools designed for reporting, rather than exploration and sharing. Both databases and BI tools were strained in locally hosted environments that were inflexible to growth or change.
Snowflake and Tableau represent a fundamentally different approach. Snowflake’s multi-cluster shared data architecture was designed for the cloud and to handle logarithmically larger data volumes at blazing speed. Tableau was made to foster an interactive approach to analytics, freeing knowledge workers to use the speed of Snowflake to their greatest advantage.
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Flink SQL: The Challenges to Build a Streaming SQL EngineHostedbyConfluent
Flink SQL is Apache Flink's streaming SQL engine that supports data movement, data warehousing, and event-driven scenarios. There are four main challenges in building a streaming SQL engine: late data with unbounded operators, retractions amplification in complex query graphs, maintaining event ordering across distributed systems, and dealing with nondeterminism from functions like random and timestamps. The document discusses how Flink SQL addresses these challenges and the state and storage solutions in Flink, including using local state, disaggregated state in external storage, and the Apache Paimon lake storage format which can improve performance by 10x.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartoriwalk2talk srl
The document discusses considerations for migrating databases to Microsoft Azure SQL Database. It covers cloud options like Infrastructure as a Service (IaaS) using SQL Server on Azure VMs and Platform as a Service (PaaS) options like Azure SQL Database. It also discusses analyzing database compatibility, different migration methods like using BACPAC files or the Data Migration Assistant, and ways to optimize the migration process like monitoring tempdb usage.
Using Databricks as an Analysis PlatformDatabricks
Over the past year, YipitData spearheaded a full migration of its data pipelines to Apache Spark via the Databricks platform. Databricks now empowers its 40+ data analysts to independently create data ingestion systems, manage ETL workflows, and produce meaningful financial research for our clients.
This document provides steps for LO extraction from SAP R/3 to SAP BI/BW. It discusses LBWE activities like scheduling extraction jobs, different update modes including direct delta, queued delta, and unserialized V3 update. It also covers setup tables which collect required data from application tables to initialize delta loads and full loads, avoiding direct access to R/3 tables. The author is an SAP BI consultant who has experience with ABAP, custom report development, and system migrations.
The document provides information about Trivadis, an IT consulting company, and discusses database upgrade and migration. It begins with an overview of Trivadis' mission to reduce complexity for customers through technological expertise. The next sections cover Trivadis' key figures, the experience of consultant Markus Flechtner, and an agenda for discussing database upgrade and migration methods and steps. The checklist then reviews choosing the right upgrade/migration methods, considerations for using DataPump, preparing clients and databases, creating a project plan, testing, and post-upgrade checks.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
Combined supervised and unsupervised neural networks for pulse shape discrimi...Samuel Jackson
Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem.
Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...rightmanforbloodline
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Big Data and Analytics Shaping the future of PaymentsRuchiRathor2
The payments industry is experiencing a data-driven revolution powered by big data and analytics.
Here's a glimpse into 5 ways this dynamic duo is transforming how we pay.
In essence, big data and analytics are playing a pivotal role in building a future filled with faster, more secure, and convenient payment methods for everyone.
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
2. Agenda
▪ Data capture challenges
▪ Change Data Feed on the
Lakehouse
▪ Capture Changes
▪ Process Changes
▪ Demo
3. Who are we? - Jose Torres
● Software Engineer - Databricks
● Committer - Delta Lake, Apache
Spark
● Database and cooking enthusiast
4. Who are we? - Rahul Mahadev
● Software Engineer– Databricks
● Delta Lake Committer
● MS Computer Science, University of
Illinois Urbana-Champaign
5. Who are we? - Itai Weiss
• Lead Solution Architect– Databricks
• Working with Apache Spark! since
v1.6
• Consulted for numerous firms across
Financial, Insurance, Tech, Pharma,
Manufacturing, Retail and
Transportation
6. Delta Lake
An open, reliable, performant,
and secure data storage and
management layer for your
data
■ Fresh & reliable data with a single source of truth
■ Data warehouse performance with data lake
economics
■ Advanced security and standards to meet
compliance needs
■ Open format
7. Current Challenges
Delta table v-1
Spark Insert
Spark Updates
Delta table v
Spark Delete
Spark Union
■ Big data increases the complexity of
• Lots of data
• Changes infrequently
• Just want to process the newest changes
Full Outer Join
Inserts
Updates
Deletes
8. ■ Read only the changed data
■ Avoid full table scans
■ Reduce compute and memory
Solve data challenges with CDF
Improve ETL
pipelines
Unify batch and
streaming
BI on your
data lake
Meet regulatory
needs
For All your use cases
Delta table with CDF
Spark Union
Delta Change Data Feed
9. Where Delta Change Data Feed Applies
Raw Ingestion
and History
BRONZE
Filtered, Cleaned,
Augmented
SILVER
Business-level
Aggregates
GOLD
CD
F
External feeds,
Other CDC output,
Extracts
10. How Does Delta Change Data Feed Work?
PK B
A1 B1
A2 B2
A3 B3
Original Table (v1)
PK B
A2 Z2
A3 B3
A4 B4
Change data
(Merged as v2)
PK B
Change
Type
Time
Versio
n
A2 B2 Preimage
12:00:0
0
2
A2 Z2 Postimage
12:00:0
0
2
A3 B3 Delete
12:00:0
0
2
A4 B4 Insert
12:00:0
0
2
Change Data Feed Output
A1 record did not receive an update or delete.
So it will not be output by CDF.
11. Consuming the Delta Change Data Feed
Stream-based Consumption
● Delta Change Feed is processed as
each source commit completes
● Rapid source commits can result
in multiple Delta versions being
included in a single micro-batch
Batch Consumption
● Batches are constructed based in
time-bound windows which may
contain multiple Delta versions
A2 B2 Preimage 12:00:00 2
A2 Z2 Postimage 12:00:00 2
A3 B3 Delete 12:00:00 2
A4 B4 Insert 12:00:00 2
A5 B5 Insert
12:08:0
0
3 A6 B6 Insert 12:09:00 4
A6 B6 Preimage 12:10:05 5
A6 Z6 Postimage 12:10:05 5
A5 B5 Insert 12:08:00 3
A6 B6 Insert 12:09:00 4
12:00
A6 B6 Preimage 12:10:05 5
A6 Z6
Postimag
e
12:10:05 5
12:10:00 12:20
12:08 12:10:10
Stream - micro batches
Batch - every 10 mins
12:00
12. Storing the Delta Change Data Feed
Changes to the table
are stored as ordered,
atomic units called
commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…
13. Storing the Delta Change Data Feed
● When Change Data
Feed is enabled,
commits will reference
an additional set of
files containing the
change data events
● These files contains
the updated and
deleted records
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
Cdf 1.parquet
000000.json
000001.json
…
14. Storing the Delta Change Data Feed
● Silver & Gold Tables
○ Improve Delta performance by processing only data changes, and
simplify ETL/ELT operations
● Materialized Views
○ Create up-to-date, aggregated views of information based on the data
changes
● Transmit Changes
○ Send only the data changes to downstream systems
● Audit Trail Table
○ Capture and store all the data changes over time, including inserts,
updates, and deletes
15. When to Use Delta Change Data Feed
● Delta changes include updates and/or deletes ● Delta changes are append only
● Small fraction of records updated in each
batch
● Most records in the table updated in each
batch
● Data received from external sources is in CDC
format
● Data received comprises destructive loads
● Send data changes to downstream application
● Find and ingest data outside of the Lakehouse
16. Getting started with Delta Change Data Feed
spark.conf.set('spark.databri
cks.delta.properties.defaults
.enableChangeDataFeed', True)
ALTER TABLE ...
SET TBLPROPERTIES
(delta.enableChangeDataFeed =
true);
Enable CDF on TABLE… … or Establish CDF on CLUSTER
in SQL, Python, or Scala in Python, or Scala
17. Using Delta Change Data Feed
INSERT INTO TABLE ...
USING delta ...
as
SELECT … FROM
table_changes(...)
SELECT … FROM
table_changes('tableName',
startingVersion
[,endingVersion])
or
SELECT … FROM
table_changes('tableName',
'startingTimestamp'
[,'endingTimestamp'])
Query changes… … and store them
in SQL, Python, or Scala