This document provides information about running Spark on YARN including:
- Spark allows processing of large datasets in a distributed manner using Resilient Distributed Datasets (RDDs).
- When running on YARN, Spark is able to leverage existing Hadoop clusters for locality-aware processing, resource management, and other benefits while still using its own execution engine.
- Running Spark on YARN provides advantages like shipping code to where the data is located instead of moving large amounts of data, leveraging existing Hadoop cluster infrastructure, and allowing Spark workloads to run natively within Hadoop.
June 2015 Berlin Buzzwords Presentation
http://berlinbuzzwords.de/file/bbuzz-2015-szehon-ho-hive-spark
https://berlinbuzzwords.de/session/hive-spark
Speaker Interview:
https://berlinbuzzwords.de/news/speaker-interview-szehon-ho
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
Spark on YARN provides resource management and security features through YARN, but still has areas for improvement. Dynamic allocation in YARN allows Spark applications to grow and shrink executors based on task demand, though latency and data locality could be enhanced. Security supports Kerberos authentication and delegation tokens, but long-lived applications face token expiration issues and encryption needs improvement for control plane, shuffle files, and user interfaces. Overall, usability, security, and performance remain areas of focus.
This document discusses integrating Docker containers with YARN by introducing a Docker container runtime to the LinuxContainerExecutor in YARN. The DockerContainerRuntime allows YARN to leverage Docker for container lifecycle management and supports features like resource isolation, Linux capabilities, privileged containers, users, networking and images. It remains a work in progress to support additional features around networking, users and images fully.
This document summarizes a presentation about new features in Apache Hadoop 3.0 related to YARN and MapReduce. It discusses major evolutions like the re-architecture of the YARN Timeline Service (ATS) to address scalability, usability, and reliability limitations. Other evolutions mentioned include improved support for long-running native services in YARN, simplified REST APIs, service discovery via DNS, scheduling enhancements, and making YARN more cloud-friendly with features like dynamic resource configuration and container resizing. The presentation estimates the timeline for Apache Hadoop 3.0 releases with alpha, beta, and general availability targeted throughout 2017.
The document discusses evolving HDFS to support generalized storage containers in order to better scale the number of files and blocks. It proposes using block containers and a partial namespace approach to initially scale to billions of files and blocks, and eventually much higher numbers. The storage layer is being restructured to support various container types for use cases beyond HDFS like object storage and HBase.
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
True streaming is fast becoming a necessity for many business use cases. On the other hand the data set sizes and volumes are also growing exponentially compounding the complexity of data processing pipelines.There exists a need for true low latency streaming coupled with very high throughput data processing. Apache Apex as a low latency and high throughput data processing framework and Apache Kudu as a high throughput store form a nice combination which solves this pattern very efficiently.
This session will walk through a use case which involves writing a high throughput stream using Apache Kafka,Apache Apex and Apache Kudu. The session will start with a general overview of Apache Apex and capabilities of Apex that form the foundation for a low latency and high throughput engine with Apache kafka being an example input source of streams. Subsequently we walk through Kudu integration with Apex by walking through various patterns like end to end exactly once, selective column writes and timestamp propagations for out of band data. The session will also cover additional patterns that this integration will cover for enterprise level data processing pipelines.
The session will conclude with some metrics for latency and throughput numbers for the use case that is presented.
Speaker
Ananth Gundabattula, Senior Architect, Commonwealth Bank of Australia
This document discusses using Spark as an execution engine for Hive queries. It begins by explaining that Hive and Spark are both commonly used in the big data space, and that Hive on Spark uses the Hive optimizer with the Spark query engine, while Spark with a Hive context uses both the Catalyst optimizer and Spark engine. The document then covers challenges in deploying Hive on Spark, such as using a custom Spark JAR without Hive dependencies. It shows how the Hive EXPLAIN command works the same on Spark, and how the execution plan and stages differ between MapReduce and Spark. Overall, the document provides a high-level overview of using Spark as a query engine for Hive.
Hadoop 2 introduces the YARN framework to provide a common platform for multiple data processing paradigms beyond just MapReduce. YARN splits cluster resource management from application execution, allowing different applications like MapReduce, Spark, Storm and others to run on the same Hadoop cluster. HDFS 2 improves HDFS with features like high availability, federation and snapshots. Apache Tez provides a new data processing engine that enables pipelining of jobs to improve performance over traditional MapReduce.
This document proposes a container-based sizing framework for Apache Hadoop/Spark clusters that uses a multi-objective genetic algorithm approach. It emulates container execution on different cloud platforms to optimize configuration parameters for minimizing execution time and deployment cost. The framework uses Docker containers with resource constraints to model cluster performance on various public clouds and instance types. Optimization finds Pareto-optimal configurations balancing time and cost across objectives.
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsDataWorks Summit
Apache Ambari manages Hadoop at large-scale and it becomes increasingly difficult for cluster admins to keep the machinery running smoothly as data grows and nodes scale from 30 to 3000 agents. To test at scale, Ambari has a Performance Stack that allows a VM to host as many as 50 Ambari Agents. The simulated stack and 50 Agents per VM can stress-test Ambari Server with the same load as a 3000 node cluster. This talk will cover how to tune the performance of Ambari and MySQL, and share performance benchmarks for features like deploy times, bulk operations, installation of bits, Rolling & Express Upgrade. Moreover, the speaker will show how to use Ambari Metrics System and Grafana to plot performance, detect anomalies, and pinpoint tips on how to improve performance for a more responsive experience. Lastly, the talk will discuss roadmap features in Ambari 3.0 for improving performance and scale.
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
This document provides an overview of Apache Tez, a framework for building data processing applications on Hadoop YARN. It describes how Tez allows applications to define complex data flows as directed acyclic graphs (DAGs) and handles distributed execution, fault tolerance, and resource management. Tez has improved the performance of Apache Hive and Pig by an order of magnitude by enabling more flexible DAG definitions and runtime optimizations. It also supports integration with other data processing engines like Spark, Storm and interactive SQL queries. The document outlines how Tez works and provides guidance on how developers can contribute to the open source project.
This document provides best practices for YARN administrators and application developers. For administrators, it discusses YARN configuration, enabling ResourceManager high availability, configuring schedulers like Capacity Scheduler and Fair Scheduler, sizing containers, configuring NodeManagers, log aggregation, and metrics. For application developers, it discusses whether to use an existing framework or develop a native application, understanding YARN components, writing the client, and writing the ApplicationMaster.
- The document discusses Apache Hadoop YARN, including its past, present, and future.
- In the past, YARN started as a sub-project of Hadoop and had several alpha and beta releases before the first stable release in 2013.
- Currently, YARN enables rolling upgrades, long running services, node labels, and improved cluster management features like preemption scheduling and fine-grained resource isolation.
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
This workshop will provide an introduction to Big Data Analytics using Apache Spark and Apache Zeppelin.
https://github.com/zeltovhorton/intro_spark_zeppelin_meetup
There will be a short lecture that includes an introduction to Spark, the Spark components.
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.
Spark is a fast and general engine for large-scale data processing. It provides APIs in Java, Scala, and Python and an interactive shell. Spark applications operate on resilient distributed datasets (RDDs) that can be cached in memory for faster performance. RDDs are immutable and fault-tolerant via lineage graphs. Transformations create new RDDs from existing ones while actions return values to the driver program. Spark's execution model involves a driver program that coordinates tasks on executor machines. RDD caching and lineage graphs allow Spark to efficiently run jobs across clusters.
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
This document discusses the Cloud and Information Services Lab (CISL) and its vision of having one cluster that can run all workloads. It describes CISL's research into improving resource management in shared clusters through projects like Mercury, which aims to improve cluster utilization by opportunistically using otherwise idle resources. The document outlines Mercury's architecture and how it implements a hybrid of centralized and distributed scheduling to better schedule tasks with short durations.
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
This document discusses Apache Spark-on-YARN, which allows Spark applications to leverage existing Hadoop clusters. Spark improves efficiency over Hadoop via in-memory computing and supports rich APIs. Spark-on-YARN provides access to HDFS data and resources on Hadoop clusters without extra deployment costs. It supports running Spark jobs in YARN cluster and client modes. The document describes Yahoo's use of Spark-on-YARN for machine learning applications on large datasets.
This document discusses Netflix's use of Spark on Yarn for ETL workloads. Some key points:
- Netflix runs Spark on Yarn across 3000 EC2 nodes to process large amounts of streaming data from over 100 million daily users.
- Technical challenges included optimizing performance for S3, dynamic resource allocation, and Parquet read/write. Improvements led to up to 18x faster job completion times.
- Production Spark applications include recommender systems that analyze user behavior and personalize content across billions of profiles and titles.
Hadoop and Spark Analytics over Better StorageSandeep Patil
This document discusses using IBM Spectrum Scale to provide a colder storage tier for Hadoop & Spark workloads using IBM Elastic Storage Server (ESS) and HDFS transparency. Some key points discussed include:
- Using Spectrum Scale to federate ESS with existing HDFS or Spectrum Scale filesystems, allowing data to be seamlessly accessed even if moved to the ESS tier.
- Extending HDFS across multiple HDFS and Spectrum Scale clusters without needing to move data using Spectrum Scale's HDFS transparency connector.
- Integrating ESS tier with Spectrum Protect for backup and Spectrum Archive for archiving to take advantage of their policy engines and automation.
- Examples of using the unified storage for analytics workflows, life
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue
This document describes building a REST job server for interactive Spark as a service using Livy. It discusses the history and challenges of running Spark jobs in Hue, introduces Livy as a Spark server, and details its local and YARN-cluster modes as well as session creation, execution flows, and interpreter support for Scala, Python, R and more. Magic commands are also covered for JSON, table, plotting and other output formats.
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
Apache Spark is a gift to the big data community, which adds tons of new features on every release. However, it’s difficult to manage petabyte-scale Hadoop clusters with hundreds of edge nodes, multiple Spark releases and demonstrate operational efficiencies and standardization. In order to address these challenges, Paypal has developed and deployed a REST0based Spark platform: Spark Compute as a Service (SCaaS),which provides improved application development, execution, logging, security, workload management and tuning.
This session will walk through the top challenges faced by PayPal administrators, developers and operations and describe how Paypal’s SCaaS platform overcomes them by leveraging open source tools and technologies, like Livy, Jupyter, SparkMagic, Zeppelin, SQL Tools, Kafka and Elastic. You’ll also hear about the improvements PayPal has added, which enable it to run greater than 10,000 Spark applications in production effectively.
Introduction to Machine Learning in Spark. Presented at Bangalore Apache Spark Meetup by Shashank L and Shashidhar E S on 17/10/2015.
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/225649429/
ETL with SPARK - First Spark London meetupRafal Kwasny
The document discusses how Spark can be used to supercharge ETL workflows by running them faster and with less code compared to traditional Hadoop approaches. It provides examples of using Spark for tasks like sessionization of user clickstream data. Best practices are covered like optimizing for JVM issues, avoiding full GC pauses, and tips for deployment on EC2. Future improvements to Spark like SQL support and Java 8 are also mentioned.
The document discusses Spark exceptions and errors related to shuffling data between nodes. It notes that tasks can fail due to out of memory errors or files being closed prematurely. It also provides explanations of Spark's shuffle operations and how data is written and merged across nodes during shuffles.
A proxy server acts as an intermediary between a client and the internet. It allows enterprises to ensure security, administrative control, and caching services. There are different types of proxy servers such as caching proxies, web proxies, content filtering proxies, and anonymizing proxies. Proxy servers can operate in either a transparent or opaque mode. They provide benefits like security, performance improvements through caching, and load balancing but also have disadvantages like creating single points of failure.
Tech-Talk at Bay Area Spark Meetup
Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends.
In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.
The document discusses proxy servers, specifically HTTP and FTP proxy servers. It defines a proxy server as a server that acts as an intermediary for requests from clients to other servers. It describes the main purposes of proxy servers as keeping machines behind it anonymous for security purposes and speeding up access to resources via caching. It also provides details on the mechanisms, types, protocols (HTTP and FTP), and functions of proxy servers.
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
http://hortonworks.com/hadoop/spark/
Recording:
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=03debab5ba04b34a033dc5c2f03c7967
As the ratio of memory to processing power rapidly evolves, many within the Hadoop community are gravitating towards Apache Spark for fast, in-memory data processing. And with YARN, they use Spark for machine learning and data science use cases along side other workloads simultaneously. This is a continuation of our YARN Ready Series, aimed at helping developers learn the different ways to integrate to YARN and Hadoop. Tools and applications that are YARN Ready have been verified to work within YARN.
Hortonworks tech workshop in-memory processing with sparkHortonworks
Apache Spark offers unique in-memory capabilities and is well suited to a wide variety of data processing workloads including machine learning and micro-batch processing. With HDP 2.2, Apache Spark is a fully supported component of the Hortonworks Data Platform. In this session we will cover the key fundamentals of Apache Spark and operational best practices for executing Spark jobs along with the rest of Big Data workloads. We will also provide a working example to showcase micro-batch and machine learning processing using Apache Spark.
This document discusses Cloudera's initiative to make Spark the standard execution engine for Hadoop. It outlines how Spark improves on MapReduce by leveraging distributed memory and having a simpler developer experience. It also describes Cloudera's investments in areas like management, security, scale, and streaming to further Spark's capabilities and make it production-ready. The goal is for Spark to replace MapReduce as the execution engine and for specialized engines like Impala to handle specific workloads, with all sharing the same data, metadata, resource management, and other platform services.
This document provides an overview of installing and programming with Apache Spark on the Hortonworks Data Platform (HDP). It discusses how Spark fits within HDP and can be used for batch processing, streaming, SQL queries and machine learning. The document outlines how to install Spark on HDP using Ambari and describes Spark programming with Resilient Distributed Datasets (RDDs), transformations, actions and caching/persistence. It provides examples of Spark APIs and programming patterns.
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
This document provides an overview of installing and programming with Apache Spark on Hortonworks Data Platform (HDP). It introduces Spark and its components, benefits over other frameworks, and Hortonworks' commitment to Spark. The document outlines an example Spark programming workflow using Resilient Distributed Datasets (RDDs) in Scala, and covers common RDD transformations, actions, and persistence methods. It also discusses Spark deployment modes like standalone and on YARN, and reference HDP architectures using Spark.
Spark will replace MapReduce as the standard execution engine for Hadoop. Spark is faster than MapReduce for iterative jobs and can be used for batch processing, streaming, machine learning, and graph processing workloads. Cloudera is leading the effort to integrate Spark with Hadoop and make it enterprise ready through initiatives like the One Platform initiative to unify Spark and Hadoop for management, security, scale, and streaming capabilities.
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionWangda Tan
This document summarizes the state of Apache Hadoop YARN and its evolution over time. It discusses how YARN started as a sub-project of Hadoop to support multiple applications and long-running services. It then outlines recent initiatives like containerization, GPU/FPGA support, federation, and improved scheduling algorithms to handle larger clusters with tens of thousands of nodes. The document also previews upcoming features in YARN 3.2 and beyond such as node attributes, container overcommit, and auto-spawning of system services.
Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues, etc.
In this talk, we’ll start with the current status of Apache Hadoop YARN—how it is used today in deployments large and small. We'll then move on to the exciting present and future of YARN—features that are further strengthening YARN as the first-class resource management platform for data centers running enterprise Hadoop.
We’ll discuss the current status as well as the future promise of features and initiatives like: powerful container placement, global scheduling, support for machine learning and deep learning workloads through GPU and FPGA support, extreme scale with YARN federation, containerized apps on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications, and operational enhancements including insights through Timeline Service V2, a new web UI, and better queue management.
Speakers
Wangda Tan, Staff Software Engineer, Hortonworks
Billie Rinaldi, Principal Software Engineer I, Hortonworks
This document summarizes Hortonworks' Hadoop distribution called Hortonworks Data Platform (HDP). It discusses how HDP provides a comprehensive data management platform built around Apache Hadoop and YARN. HDP includes tools for storage, processing, security, operations and accessing data through batch, interactive and real-time methods. The document also outlines new capabilities in HDP 2.2 like improved engines for SQL, Spark and streaming and expanded deployment options.
Lessons Learned Running a Container Cloud on Apache Hadoop YARNBillie Rinaldi
The document discusses lessons learned from running a container cloud on Apache Hadoop YARN. Some key points include:
- Running over 5.8 million containers and 1.1 million tests on the container cloud with improved hardware utilization and test speed.
- Challenges involved issues like IP management, Docker storage drivers causing kernel panics, user namespacing limitations, and image management.
- Solutions involved allocating IP addresses, testing storage configurations, running containers as a common user, and managing a private image registry for cleanup.
Lessons learned running a container cloud on YARNDataWorks Summit
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes, YARN now supports running Docker containers alongside process containers. Coupled with the newly added support for long-running services on YARN, this allows a host of new possibilities.
In this talk, we'll present how to run a container cloud on YARN. Leveraging the support in YARN for Docker and long-running services, we can allow users to easily spin up sets of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications. We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, service discovery, etc.
Speaker
Billie Rinaldi, Principal Software Engineer I, Hortonworks
Hadoop operations started on-prem primarily driven by Apache Ambari. However, due to the agility and flexibility of the cloud, it has driven many Hadoop cluster operations to the cloud and to hybrid environments. Cloud is enabling many ephemeral on-demand use cases which is a game-changing opportunity for analytic workloads. But all of this comes with the challenges of running enterprise workloads in the cloud securely and with ease.
Apache Ambari is used by thousands of Hadoop Operators to manage the deployment, lifecycle, and automation of DevOps for Hadoop ecosystem projects. Starting out, Apache Ambari installed a handful of Apache Hadoop ecosystem projects, on a few operating systems, and helped with the most basic Hadoop operational tasks. Today, the product manages over 20 different services, runs on multiple major operating systems and versions, and automates many of the most challenging Hadoop operational tasks in the most secure customer environments.
In this session, we will also take you through Cloudbreak as a solution to simplify provisioning and managing enterprise workloads while providing an open and common experience for deploying workloads across clouds. We will discuss the challenges (and opportunities) to run enterprise workloads in the cloud and will go through a live demo of how the latest from Cloudbreak enables enterprises to easily and securely run Apache Hadoop. This includes deep-dive discussion on Ambari Blueprints, recipes, custom images, and enabling Kerberos -- which are all key capabilities for Enterprise deployments.
As part of this talk, will walk you through what we've learned, the challenges we've overcome, and how the Apache Ambari and Cloudbreak community has changed the product to handle them. The future is fast approaching, and with it comes new on-premise and cloud deployment architectures. See how Apache Ambari and Cloudbreak are being re-imagined to handle these new challenges.
Hadoop operations started on-prem primarily driven by Apache Ambari. However, due to the agility and flexibility of the cloud, it has driven many Hadoop cluster operations to the cloud and to hybrid environments. Cloud is enabling many ephemeral on-demand use cases which is a game-changing opportunity for analytic workloads. But all of this comes with the challenges of running enterprise workloads in the cloud securely and with ease.
Apache Ambari is used by thousands of Hadoop Operators to manage the deployment, lifecycle, and automation of DevOps for Hadoop ecosystem projects. Starting out, Apache Ambari installed a handful of Apache Hadoop ecosystem projects, on a few operating systems, and helped with the most basic Hadoop operational tasks. Today, the product manages over 20 different services, runs on multiple major operating systems and versions, and automates many of the most challenging Hadoop operational tasks in the most secure customer environments.
In this session, we will also take you through Cloudbreak as a solution to simplify provisioning and managing enterprise workloads while providing an open and common experience for deploying workloads across clouds. We will discuss the challenges (and opportunities) to run enterprise workloads in the cloud and will go through a live demo of how the latest from Cloudbreak enables enterprises to easily and securely run Apache Hadoop. This includes deep-dive discussion on Ambari Blueprints, recipes, custom images, and enabling Kerberos -- which are all key capabilities for Enterprise deployments.
As part of this talk, will walk you through what we've learned, the challenges we've overcome, and how the Apache Ambari and Cloudbreak community has changed the product to handle them. The future is fast approaching, and with it comes new on-premise and cloud deployment architectures. See how Apache Ambari and Cloudbreak are being re-imagined to handle these new challenges.
Speaker: Santosh Gowda, Principal Solutions Engineer, Hortonworks
The document discusses Hortonworks' Slider project, which aims to simplify deploying and managing distributed applications on YARN. Slider provides a packaging format for applications, launches application components as YARN containers via an Application Master, and handles service registration and configuration management. It addresses limitations of earlier frameworks by supporting dynamic configurations, embedded usage, and integration with service discovery in Zookeeper.
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
Apache Tez is a library to build data processing engines in Hadoop/YARN. It takes care of many common building blocks like scheduling, fault tolerance, speculation, security etc. so that the engine can focus on its core features. E.g. Apache Hive can focus on SQL optimization. There has been rapid adoption in projects like Hive, Pig, Flink, Cascading, Scalding and commercial products like Datameer and Syncsort. We will provide a brief overview of Tez and then look at new features for job monitoring in the Tez UI and performance debugging tools for Tez applications. Finally we will explore upcoming features like hybrid scheduling that open up new areas of performance and functionality.
Project Savanna automates the deployment of Apache Hadoop on OpenStack. It provisions Hadoop clusters using templates, allows for elastic scaling of nodes, and provides multi-tenancy. The Hortonworks OpenStack plugin uses Ambari to install and manage Hadoop clusters on OpenStack. It demonstrates provisioning a Hadoop cluster with Ambari, installing services, and monitoring the cluster through the Ambari UI. OpenStack provides operational agility while Hadoop is well-suited for its scale-out architecture.
The document discusses Oracle Database Cloud Service, which allows users to quickly create databases using automated provisioning and easily move data and workloads between on-premise and cloud environments. It highlights the unified management capabilities of Enterprise Manager to manage databases across on-premise and cloud environments using the same architecture, software, and skills.
This document discusses building applications on Hadoop and introduces the Kite SDK. It provides an overview of Hadoop and its components like HDFS and MapReduce. It then discusses that while Hadoop is powerful and flexible, it can be complex and low-level, making application development challenging. The Kite SDK aims to address this by providing higher-level APIs and abstractions to simplify common use cases and allow developers to focus on business logic rather than infrastructure details. It includes modules for data, ETL processing with Morphlines, and tools for working with datasets and jobs. The SDK is open source and supports modular adoption.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
This document discusses using Apache NiFi to build a high-speed cyber security data pipeline. It outlines the challenges of ingesting, transforming, and routing large volumes of security data from various sources to stakeholders like security operations centers, data scientists, and executives. It proposes using NiFi as a centralized data gateway to ingest data from multiple sources using a single entry point, transform the data according to destination needs, and reliably deliver the data while avoiding issues like network traffic and data duplication. The document provides an example NiFi flow and discusses metrics from processing over 20 billion events through 100+ production flows and 1000+ transformations.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.