Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL semantics and datatypes. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins and how they are implemented in MapReduce. The overall presentation aims to explain how Hive provides scalable SQL processing for big data.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Apache Hive is a data warehouse software built on top of Hadoop that allows users to query data stored in various databases and file systems using an SQL-like interface. It provides a way to summarize, query, and analyze large datasets stored in Hadoop distributed file system (HDFS). Hive gives SQL capabilities to analyze data without needing MapReduce programming. Users can build a data warehouse by creating Hive tables, loading data files into HDFS, and then querying and analyzing the data using HiveQL, which Hive then converts into MapReduce jobs.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
The document discusses Long-Lived Application Process (LLAP), a new capability in Apache Hive that enables long-lived daemon processes to improve query performance. LLAP eliminates Hive query startup costs by keeping query execution engines alive between queries. It allows queries to leverage just-in-time optimization and data caching to enable interactive query performance directly on HDFS data. LLAP utilizes asynchronous I/O, in-memory caching, and a query fragment API to optimize query processing. It integrates with Apache Tez to coordinate query execution across long-lived daemon processes and traditional YARN containers.
This document summarizes techniques for optimizing Hive queries, including recommendations around data layout, format, joins, and debugging. It discusses partitioning, bucketing, sort order, normalization, text format, sequence files, RCFiles, ORC format, compression, shuffle joins, map joins, sort merge bucket joins, count distinct queries, using explain plans, and dealing with skew.
The document discusses Apache NiFi and its role in the Hadoop ecosystem. It provides an overview of NiFi, describes how it can be used to integrate with Hadoop components like HDFS, HBase, and Kafka. It also discusses how NiFi supports stream processing integrations and outlines some use cases. The document concludes by discussing future work, including improving NiFi's high availability, multi-tenancy, and expanding its ecosystem integrations.
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
Apache Tez - A New Chapter in Hadoop Data Processing. Talk at Hadoop Summit, San Jose. 2014 By Bikas Saha and Hitesh Shah.
Apache Tez is a modern data processing engine designed for YARN on Hadoop 2. Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
This document discusses Apache Ranger, an open source framework for centralized security administration across Hadoop components like HDFS, Hive, HBase, Knox, Storm, YARN, Kafka, and Solr. It provides authorization and auditing capabilities. Ranger allows defining flexible access policies in a centralized manner and enforcing them. It has an extensible architecture to easily add new components and customize authorization decisions using conditions and context enrichers. The document outlines Ranger's key capabilities and provides examples of its policy definitions and extensibility features.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Transactional operations in Apache Hive: present and futureDataWorks Summit
Apache Hive is an enterprise data warehouse build on top of Hadoop. Hive supports insert, update, delete, and merge SQL operations with transactional semantics and read operations that run at snapshot isolation. The well defined semantics of these operations in the face of failure and concurrency are critical to building robust application on top of Apache Hive. In the past there were many preconditions to enabling these features which meant giving up other functionality. The need to make these tradeoffs is rapidly being eliminated.
This talk will describe the intended use cases, architecture of the implementation, recent improvements and new features build for Hive 3.0. For example, bucketing transactional tables, while supported, is no longer required. Performance overhead of using transactional tables is nearly eliminated relative to identical non-transactional tables. We’ll also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL.
Speaker
Eugene Koifman, Hortonworks, Principal Software Engineer
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
This deck presents the best practices of using Apache Hive with good performance. It covers getting data into Hive, using ORC file format, getting good layout into partitions and files based on query patterns, execution using Tez and YARN queues, memory configuration, and debugging common query performance issues. It also describes Hive Bucketing and reading Hive Explain query plans.
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It presents a SQL-like interface for querying data stored in various databases and file systems that integrate with Hadoop. The document provides links to Hive documentation, tutorials, presentations and other resources for learning about and using Hive. It also includes a table describing common Hive CLI commands and their usage.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Storm is a distributed and fault-tolerant realtime computation system. It was created at BackType/Twitter to analyze tweets, links, and users on Twitter in realtime. Storm provides scalability, reliability, and ease of programming. It uses components like Zookeeper, ØMQ, and Thrift. A Storm topology defines the flow of data between spouts that read data and bolts that process data. Storm guarantees processing of all data through its reliability APIs and guarantees no data loss even during failures.
The document discusses Apache Tez, a distributed execution framework for data processing applications. Tez is designed to improve performance over Hadoop MapReduce by expressing computations as dataflow graphs and optimizing resource usage. It aims to empower users with expressive APIs, a flexible runtime model, and simplifying deployment. Tez also works to improve execution performance through eliminating overhead from MapReduce, dynamic runtime optimization, and optimal resource management with YARN.
This document discusses Apache Tez, a framework for accelerating Hadoop query processing. Tez is designed to express query computations as dataflow graphs and execute them efficiently on YARN. It addresses limitations of MapReduce by allowing for custom dataflows and optimizations. Tez provides APIs for defining DAGs of tasks and customizing inputs/outputs/processors. This allows applications to focus on business logic while Tez handles distributed execution, fault tolerance, and resource management for Hadoop clusters.
The document discusses Apache Tez, a framework for accelerating Hadoop query processing. Some key points:
- Tez is a dataflow framework that expresses computations as directed acyclic graphs (DAGs) of tasks. This allows optimizations like container reuse and locality-aware scheduling.
- It is built on YARN and provides a customizable execution engine as well as runtime and DAG APIs for applications to define computations.
- Compared to MapReduce, Tez can provide better performance, predictability, and resource utilization through its DAG execution model and optimizations like reducing intermediate data writes.
- It has been used to improve performance for workloads like Hive, Pig, and large TPC-DS queries
This document discusses Apache Tez, a framework for accelerating Hadoop query processing. Some key points:
- Tez is a dataflow framework that expresses computations as directed acyclic graphs (DAGs) of tasks, allowing for optimizations like container reuse and locality-aware scheduling.
- It is built on YARN and provides a customizable execution engine as well as APIs for applications like Hive and Pig.
- By expressing jobs as DAGs, Tez can reduce overheads, queueing delays, and better utilize cluster resources compared to the traditional MapReduce framework.
- The document provides examples of how Tez can improve performance for operations like joins, aggregations, and handling of multiple outputs
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
Apache Tez is the new data processing framework in the Hadoop ecosystem. It runs on top of YARN - the new compute platform for Hadoop 2. Learn how Tez is built from the ground up to tackle a broad spectrum of data processing scenarios in Hadoop/BigData - ranging from interactive query processing to complex batch processing. With a high degree of automation built-in, and support for extensive customization, Tez aims to work out of the box for good performance and efficiency. Apache Hive and Pig are already adopting Tez as their platform of choice for query execution.
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
호튼웍스 아시아 기술 총괄 이사 제프 마크햄 (Jeff Markham) 이 테즈에 대한 소개를 합니다. 테즈는 맵리듀스를 대체하여 하둡의 질의 처리를 가속하는 소프트웨어입니다. 왜 테즈를 만들었고, 어떻게 구성되었으며, 최적화는 어떻게 진행되고, 그 성능은 얼마나 좋아졌는지 전반에 대해 설명합니다.
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
The document discusses Hadoop 2.2.0 and new features in YARN and MapReduce. Key points include: YARN introduces a new application framework and resource management system that replaces the jobtracker, allowing multiple data processing engines besides MapReduce; MapReduce is now a library that runs on YARN; Tez is introduced as a new data processing framework to improve performance beyond MapReduce.
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
This document provides an overview of Apache Tez, a framework for building data processing applications on Hadoop YARN. It describes how Tez allows applications to define complex data flows as directed acyclic graphs (DAGs) and handles distributed execution, fault tolerance, and resource management. Tez has improved the performance of Apache Hive and Pig by an order of magnitude by enabling more flexible DAG definitions and runtime optimizations. It also supports integration with other data processing engines like Spark, Storm and interactive SQL queries. The document outlines how Tez works and provides guidance on how developers can contribute to the open source project.
YARN Ready: Integrating to YARN with Tez Hortonworks
YARN Ready webinar series helps developers integrate their applications to YARN. Tez is one vehicle to do that. We take a deep dive including code review to help you get started.
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
This document provides an overview of Tez, an Apache project that provides a framework for executing data processing jobs on Hadoop clusters. Tez allows expressing data processing jobs as directed acyclic graphs (DAGs) of tasks and executes these tasks in a optimized manner. It addresses limitations of MapReduce by providing a more flexible execution engine that can optimize performance and resource utilization.
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
Apache Tez is a library to build data processing engines in Hadoop/YARN. It takes care of many common building blocks like scheduling, fault tolerance, speculation, security etc. so that the engine can focus on its core features. E.g. Apache Hive can focus on SQL optimization. There has been rapid adoption in projects like Hive, Pig, Flink, Cascading, Scalding and commercial products like Datameer and Syncsort. We will provide a brief overview of Tez and then look at new features for job monitoring in the Tez UI and performance debugging tools for Tez applications. Finally we will explore upcoming features like hybrid scheduling that open up new areas of performance and functionality.
Apache Tez is a framework for building data processing applications on top of YARN. It allows expressing a computation as a directed acyclic graph (DAG) to optimize execution. Tez improves on MapReduce by avoiding intermediate data writes to HDFS and enabling optimizations across jobs. The presentation covered Tez features like container reuse, dynamic parallelism, and integration with YARN timeline service. It also discussed ongoing work to improve performance through speculation, intermediate file formats, and shuffle optimizations.
Apache Tez is a framework for executing data processing jobs on Hadoop clusters. It allows expressing jobs as directed acyclic graphs (DAGs) which enables optimizations like running jobs as a single logical unit rather than separate MapReduce jobs. The presentation covered Tez features like container reuse, dynamic parallelism, and integration with YARN and ATS for monitoring. It also discussed ongoing work to improve performance through speculation, intermediate file formats, and shuffle optimizations, as well as better debuggability using tools like the Tez UI.
Hortonworks Get Started Building YARN Applications Dec. 2013. We cover YARN basics, benefits, getting started and roadmap. Actian shares their experience and recommendations on building their real-world YARN application.
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
Apache Hadoop has become popular from its specialization in the execution of MapReduce programs. However, it has been hard to leverage existing Hadoop infrastructure for various other processing paradigms such as real-time streaming, graph processing and message-passing. That was true until the introduction of Apache Hadoop YARN in Apache Hadoop 2.0. YARN supports running arbitrary processing paradigms on the same Hadoop cluster. This allows for development of newer frameworks as well as more efficient implementations of existing frameworks that can all run on and share the resources of a single multi-tenant YARN cluster. This talk gives a brief introduction to YARN. We will illustrate how to create applications and how to best make use of YARN. We will show examples of different applications such as Apache Tez and Apache Samza that can leverage YARN and present best practices/guidelines on building applications on top of Apache Hadoop YARN.
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
During this presentation, Olivier will introduce Apache Tez. What it does ? Why is it seen by many as the Map Reduce v2. How is it helping Hive / Pig / Cascading and other increase their performance.
Speaker: Olivier Renault is a Principal Solution Engineer at Hortonworks the company behind Hortonworks Data Platform. Olivier is an expert on how to deploy Hadoop at scale in a secure and performant manner.
Ever wonder what Hadoop might look like in 12 months or 24 months or longer? Apache Hadoop MapReduce has undergone a complete re-haul to emerge as Apache Hadoop YARN, a generic compute fabric to support MapReduce and other application paradigms. As a result, Hadoop looks very different from itself 12 months ago. This talk will take you through some ideas for YARN itself and the many myriad ways it is really moving the needle for MapReduce, Pig, Hive, Cascading and other data-processing tools in the Hadoop ecosystem.
YARN: Future of Data Processing with Apache HadoopHortonworks
Vinod Kumar Vavilapalli presented on the future of data processing with Apache Hadoop. He discussed limitations of the classic MapReduce architecture including scalability, single point of failure, and low resource utilization. He then introduced the new YARN architecture which splits up the JobTracker into a ResourceManager and per-application ApplicationMasters for improved fault tolerance, utilization, and scalability. Benchmarks show performance gains of up to 2x compared to classic MapReduce. Hadoop 2.0 alpha is available for testing and feedback.
Similar to Apache Tez: Accelerating Hadoop Query Processing (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
This document discusses using Apache NiFi to build a high-speed cyber security data pipeline. It outlines the challenges of ingesting, transforming, and routing large volumes of security data from various sources to stakeholders like security operations centers, data scientists, and executives. It proposes using NiFi as a centralized data gateway to ingest data from multiple sources using a single entry point, transform the data according to destination needs, and reliably deliver the data while avoiding issues like network traffic and data duplication. The document provides an example NiFi flow and discusses metrics from processing over 20 billion events through 100+ production flows and 1000+ transformations.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.