This document discusses security features in Apache Kafka including SSL for encryption, SASL/Kerberos for authentication, authorization controls using an authorizer, and securing Zookeeper. It provides details on how these security components work, such as how SSL establishes an encrypted channel and SASL performs authentication. The authorizer implementation stores ACLs in Zookeeper and caches them for performance. Securing Zookeeper involves setting ACLs on Zookeeper nodes and migrating security configurations. Future plans include moving more functionality to the broker side and adding new authorization features.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...confluent
The document discusses securing connections between KSQL and Kafka. It covers enabling encryption using TLS for the KSQL-Kafka connection. It also covers enabling authentication using SASL and authorization using Kafka ACLs. It provides configuration examples for securing each part of the connection and recommends configuring the KSQL output topic name prefix to more easily manage ACLs for output topics.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
Apache Kafka is a distributed streaming platform that allows for building real-time data pipelines and streaming apps. It provides a publish-subscribe messaging system with persistence that allows for building real-time streaming applications. Producers publish data to topics which are divided into partitions. Consumers subscribe to topics and process the streaming data. The system handles scaling and data distribution to allow for high throughput and fault tolerance.
source : http://www.opennaru.com/opensource/kubernetes/
Kubernetes는 컨테이너화된 애플리케이션(Containerized Application)의 배포, 확장 그리고 관리를 할 수 있는 오픈 소스 컨테이너 오케스트레이션 시스템입니다.
쿠버네티스는 구글 엔지니어들이 개발하고 설계한 플랫폼으로서 사내에서 이용하던 컨테이너 클러스터 관리 도구인 “Borg”의 아이디어를 바탕으로 만들어진 오픈소스 소프트웨어입니다.
구글은 쿠버네티스의 원천이 되는 Borg를 수년 동안 개발하고 운영하면서 축적된 경험을 바탕으로 쿠버네티스를 오픈소스 프로젝트로 만들어 었습니다.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
Kafka Streams: What it is, and how to use it?confluent
Kafka Streams is a client library for building distributed applications that process streaming data stored in Apache Kafka. It provides a high-level streams DSL that allows developers to express streaming applications as set of processing steps. Alternatively, developers can use the lower-level processor API to implement custom business logic. Kafka Streams handles tasks like fault-tolerance, scalability and state management. It represents data as streams for unbounded data or tables for bounded state. Common operations include transformations, aggregations, joins and table operations.
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It was developed by LinkedIn in 2011 to solve problems with data integration and processing. Kafka uses a publish-subscribe messaging model and is designed to be fast, scalable, and durable. It allows both streaming and storage of data and acts as a central data backbone for large organizations.
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
Kubernetes has the concept of resource requests and limits. Pods get scheduled on the nodes based on their requests and optionally limited in how much of the resource they can consume. Understanding and optimizing resource requests/limits is crucial both for reducing resource "slack" and ensuring application performance/low-latency. This talk shows our approach to monitoring and optimizing Kubernetes resources for 80+ clusters to achieve cost-efficiency and reducing impact for latency-critical applications. All shown tools are Open Source and can be applied to most Kubernetes deployments.
This document discusses security features in Apache Kafka including SSL, SASL authentication using Kerberos or plaintext, and authorization controls. It provides an overview of how SSL and SASL authentication work in Kafka as well as how the Kafka authorizer controls access at a fine-grained level through ACLs defined on topics, operations, users and hosts. It also briefly mentions securing Zookeeper which stores Kafka metadata and ACLs.
This document provides an introduction to Apache Kafka. It describes Kafka as a distributed messaging system with features like durability, scalability, publish-subscribe capabilities, and ordering. It discusses key Kafka concepts like producers, consumers, topics, partitions and brokers. It also summarizes use cases for Kafka and how to implement producers and consumers in code. Finally, it briefly outlines related tools like Kafka Connect and Kafka Streams that build upon the Kafka platform.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This document provides an overview of Apache Kafka. It begins with defining Kafka as a distributed streaming platform and messaging system. It then lists the agenda which includes what Kafka is, why it is used, common use cases, major companies that use it, how it achieves high performance, and core concepts. Core concepts explained include topics, partitions, brokers, replication, leaders, and producers and consumers. The document also provides examples to illustrate these concepts.
Vault is a tool for securely accessing secrets. It encrypts and stores secrets and enforces strict access controls. Secrets have a limited lifetime and must be renewed. Vault supports dynamic secret generation, revocation of access, and audit logging. It uses Shamir's secret sharing algorithm to split encryption keys across Vault servers for high availability.
Building Cloud-Native App Series - Part 2 of 11
Microservices Architecture Series
Event Sourcing & CQRS,
Kafka, Rabbit MQ
Case Studies (E-Commerce App, Movie Streaming, Ticket Booking, Restaurant, Hospital Management)
Common Patterns of Multi Data-Center Architectures with Apache Kafkaconfluent
Whether you know you want to run Apache Kafka in multiple data centers and need practical advice or you are wondering why some organizations even need more than one cluster, this online talk is for you.
In this short session, we’ll discuss the basic patterns of multi-datacenter Kafka architectures, explore some of the use-cases enabled by each architecture and show how Confluent Enterprise products make these patterns easy to implement.
Visit www.confluent.io for more information.
Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand
Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees.
We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview.
This session is part 3 of 4 in our Fundamentals for Apache Kafka series.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
The many benefits of a RESTful architecture has made it the standard way in which to design web based APIs. For example, the principles of REST state that we should leverage standard HTTP verbs which helps to keep our APIs simple. Server components that are considered RESTFul should be stateless which help to ensure that they can easily scale. We can leverage caching to gain further performance and scalability benefits.
However, the best practices of REST and security often seem to clash. How should a user be authenticated in a stateless application? How can a secured resource also support caching? Securing RESTful endpoints is further complicated by the the fact that security best practices evolve so rapidly.
In this talk Rob will discuss how to properly secure your RESTful endpoints. Along the way we will explore some common pitfalls when applying security to RESTful APIs. Finally, we will see how the new features in Spring Security can greatly simplify securing your RESTful APIs.
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...JAXLondon2014
This document discusses how Brandwatch uses Apache Kafka and Zookeeper to distribute data processing workloads across multiple Java processes. It describes how Kafka is used to stream social media mentions from crawlers to a processing cluster. Individual processes then use Zookeeper for leader election to coordinate tracking different metrics for queries in a distributed manner.
The document discusses Spring Web Flow, which is an extension of the Spring MVC framework focused on defining and executing page flows. It describes how Spring Web Flow integrates with Spring MVC and other technologies and provides configuration options for defining flows, states, actions, and transitions. It also covers view rendering, data binding, validation, and other features involved in executing flows.
Consumer Driven Contracts and Your Microservice Architecture @ Warsaw JUGMarcin Grzejszczak
This document discusses using Spring Cloud Contract to help define contracts between microservices. It notes that without contracts, tests between services can break when implementations change. The document demonstrates coding a sample app to check ages for alcohol purchases, then generating contract stubs using Spring Cloud Contract. It addresses questions like storing contracts in a shared repo, generating stubs without writing scripts using WireMock and RestDocs, and learning more about Spring Cloud Contract.
This document summarizes a presentation about scheduling policies in YARN. It discusses existing scheduling in YARN, adding new resource types and resource profiles, resource scheduling for services like affinity and anti-affinity, and a proposed new GUTS API to provide a unified approach for specifying resource requests and constraints. The new API aims to simplify expressing complex scheduling requirements and relationships between placements in a single request.
This document provides an overview of the Spring Framework, including its core modules, advantages, and requirements for usage. It discusses the Spring runtime environment and modules for core container functionality, data access, web functionality, testing, and aspects/instrumentation. It also covers configuration through Maven dependencies, Java classes, XML files, and web.xml. Finally, it introduces Spring Security modules, the interaction flow, and configurations for security including the web.xml, password encoding, CSRF protection, Spring XML, and authentication providers.
- The document summarizes the state of Apache HBase, including recent releases, compatibility between versions, and new developments.
- Key releases include HBase 1.1, 1.2, and 1.3, which added features like async RPC client, scan improvements, and date-tiered compaction. HBase 2.0 is targeting compatibility improvements and major changes to data layout and assignment.
- New developments include date-tiered compaction for time series data, Spark integration, and ongoing work on async operations, replication 2.0, and reducing garbage collection overhead.
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka’s ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems.
However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes.
We describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
eventbrite_kafka_summit_event_logo_v3-035858-edited.png
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Rajat Venkatesh from Qubole presented on Quark, a virtualization engine for analytics. Quark uses a multi-store architecture to optimize queries using materialized views, predicate injection, and denormalized/sorted tables. It supports multiple SQL and storage engines. The roadmap includes improvements to the cost-based optimizer, support for OLAP cubes, and developing Quark as a service. Coordinates for the Quark GitHub and mailing list were provided.
Presto is an open source distributed SQL query engine that allows interactive analysis of data across multiple data stores. At Facebook, Presto is used for ad-hoc queries of their Hadoop data warehouse, which processes trillions of rows and scans petabytes of data daily. Presto's low latency also makes it suitable for powering analytics in user-facing products. New features of Presto include improved SQL support, performance optimizations, and connectors to additional data sources like Redis and MongoDB.
The document discusses best practices for operating and supporting Apache HBase. It outlines tools like the HBase UI and HBCK that can be used to debug issues. The top categories of issues covered are region server stability problems, read/write performance, and inconsistencies. SmartSense is introduced as a tool that can help detect configuration issues proactively.
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
The document provides an introduction to Apache Spark and related technologies. It discusses the Spark ecosystem including Spark Core, Spark SQL, Spark Streaming and MLlib. It also covers Resilient Distributed Datasets (RDDs), DataFrames, Spark SQL optimizations using Catalyst, and using Spark on a YARN cluster. The document is intended to provide a hands-on intro to Spark and related tools in the Hortonworks Data Platform sandbox environment.
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
This document discusses YARN federation, which allows multiple YARN clusters to be connected together. It summarizes:
- YARN is used at Microsoft for resource management but faces challenges of large scale and diverse workloads. Federation aims to address this.
- The federation architecture connects multiple independent YARN clusters through centralized services for routing, policies, and state. Applications are unaware and can seamlessly run across clusters.
- Federation policies determine how work is routed and scheduled across clusters, balancing objectives like load balancing, scaling, fairness, and isolation. A spectrum of policy options is discussed from full partitioning to full replication to dynamic partial replication.
- A demo is presented showing a job running across
Ranger provides centralized authorization policies for Hadoop resources. A new Ranger Hive Metastore security agent was introduced to address gaps in authorizing Hive CLI and synchronizing policies with object changes. It enforces authorization for Hive CLI, synchronizes access control lists when objects are altered via DDL in HiveServer2, and provides consistent authorization between Hive and HDFS resources. The agent has been implemented through custom classes extending Hive authorization hooks. A demo showed the new capabilities working with Ranger, Hive, and Hadoop.
Marcin Kleczynski founded Malwarebytes in 2004 after getting infected by malware as a teenager and wanting to help others. He taught himself coding and created free anti-malware software, which later became a paid version launched with a business partner. Malwarebytes has since grown significantly with over 500 million downloads worldwide and offices across several countries. It aims to create a malware-free world for all.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Kafka security includes SSL for wire encryption, SASL (Kerberos) for authentication, and authorization controls. SSL uses certificates for encryption during network communication. SASL performs authentication using Kerberos credentials. Authorization is provided by pluggable authorizers that define access control lists controlling permissions for principals to perform operations on resources and hosts. Securing Zookeeper with ACLs and SASL is also important as Kafka stores metadata there.
Kafka 2018 - Securing Kafka the Right WaySaylor Twift
How to evaluate, implement and maintain Kafka Message Broker in a high-throughput production environment. Taylor Swift's rectum probably smells like a Creamsicle.
Protecting your data at rest with Apache Kafka by Confluent and Vormetricconfluent
This document discusses securing Apache Kafka deployments with Vormetric and Confluent Platform. It begins with an introduction to Apache Kafka and Confluent Platform. It then provides an overview of Vormetric's policy-driven security solution and how it can be used to encrypt Kafka data at rest. The document outlines the typical Confluent Platform deployment architecture and various security considerations, such as authentication, authorization, and data encryption. Finally, it provides steps for implementing secure deployments using SSL, Kerberos, and Vormetric encryption policies.
This document discusses securing Spark applications. It covers encryption to protect data in transit and at rest, authentication using Kerberos to identify users, and authorization for access control through tools like Sentry and a proposed RecordService. While Spark can be secured today by leveraging Hadoop security, continued work is needed for easier encryption, improved Kerberos support for long-running jobs, and row/column-level authorization beyond file permissions.
This document discusses Kafka security and provides tips for implementing it. It covers the three main aspects of Kafka security: encryption, authentication, and authorization. For encryption, it explains how to set up SSL and discusses options for end-to-end encryption. Authentication details how to use SSL client authentication or SASL mechanisms like Kerberos or PLAIN. Authorization explains managing access control lists (ACLs) stored in Zookeeper to control access. The document concludes by emphasizing the challenges of securing Kafka clients and provides advice like creating standardized client wrappers and Docker images.
DockerCon Live 2020 - Securing Your Containerized Application with NGINXKevin Jones
NGINX is one of the most popular images on Docker Hub and has been at the forefront of the web since the early 2000's. In this talk we will discuss how and why NGINX's lightweight and powerful architecture makes it a very popular choice for securing containerized applications as a sidecar reverse proxy within containers. We will highlight important aspects of application security that NGINX can help with, such as TLS, HTTP, AuthN, AuthZ and traffic control.Additional Sponsor InformationDuring our session we will be Raffling off a swag pack to live attendees. We'll also be offering 30% off our swag store that can be shared via social. Details below:URL: swag-nginx.com
Code: DOCKERCON30
Value: 30% off
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
The document discusses the Apache Knox Gateway, which is an extensible reverse proxy framework that securely exposes REST APIs and HTTP-based services from Hadoop clusters. It provides features such as support for common Hadoop services, integration with enterprise authentication systems, centralized auditing of REST API access, and service-level authorization controls. The Knox Gateway aims to simplify access to Hadoop services, enhance security by protecting network details and supporting partial SSL, and enable centralized management and control over REST API access.
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Kevin Minder
The Apache Knox Gateway is an extensible reverse proxy framework for securely exposing REST APIs and HTTP-based services at a perimeter. It provides out of the box support for several common Hadoop services, integration with enterprise authentication systems, and other useful features. Knox is not an alternative to Kerberos for core Hadoop authentication or a channel for high-volume data ingest/export. It has graduated from the Apache incubator and is included in Hortonworks Data Platform releases to simplify access, provide centralized control, and enable enterprise integration of Hadoop services.
This document provides an overview of Apache Hadoop security, both historically and what is currently available and planned for the future. It discusses how Hadoop security is different due to benefits like combining previously siloed data and tools. The four areas of enterprise security - perimeter, access, visibility, and data protection - are reviewed. Specific security capabilities like Kerberos authentication, Apache Sentry role-based access control, Cloudera Navigator auditing and encryption, and HDFS encryption are summarized. Planned future enhancements are also mentioned like attribute-based access controls and improved encryption capabilities.
Exploring Advanced Authentication Methods in Novell Access ManagerNovell
Novell Access Manager provides many different levels of authentication beyond a simple user name and password. In this session, you will learn about its more advanced methods of authentication—from emerging standard like OpenID and CardSpace to tokens and certificates. Attendees will also see a demonstration of FreeRADIUS and the Vasco Digipass with Novell eDirectory, the Vasco NMAS method and an Access Manager plug-in that provides SSO to Web applications that expect a static password.
The document discusses new features and enhancements in MySQL Connector/J including new connection properties, security improvements with SSL and pluggable authentication, and performance considerations when using prepared statements. It provides details on configuration options for SSL encryption and cipher suites as well as the pluggable authentication support in MySQL 5.5.
This document provides an overview of Amazon Workspaces, a desktop-as-a-service offering that allows customers to provision cloud-based virtual desktops for remote users. It discusses how Workspaces provides a consistent desktop environment across multiple availability zones, the networking architecture which involves authentication and streaming gateways, and how customers can integrate Workspaces with their existing Active Directory. The document also covers provisioning, authentication, application deployment, monitoring, backups and common issues when using Amazon Workspaces.
The document discusses AWS security best practices and common mistakes made when using AWS. It provides examples of real security incidents that occurred due to misconfigurations or lack of security controls. The presentation covers topics like identity and access management, network access control, logging and monitoring, compliance frameworks, and security tools that can be used to harden AWS environments. It also describes advanced VPC networking techniques and the DoD security technical implementation guide (STIG) compliance process.
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Cloudera, Inc.
This document provides an overview of Cloudera's Navigator Key Trustee, which is a key management server that acts as a proxy between CDH components and an external key store. It discusses how Key Trustee uses encryption zone keys stored in an external hardware security module to encrypt data encryption keys, which are then used to encrypt data at rest in HDFS. The document also covers Key Trustee's architecture, deployment considerations, access control lists, and troubleshooting steps.
This document discusses securing Kafka at PayPal, which processes 500 billion messages per day. It covers:
1. Enabling TLS on Kafka brokers using self-signed certificates to encrypt communication.
2. Implementing mutual TLS authentication between brokers and clients by generating and distributing keystores and truststores.
3. Authenticating clients using SASL and OAuthBearer, with custom callback handlers to retrieve credentials from an internal key management service.
Team Collaboration in Kafka Clusters With Maria Berinde-Tampanariu | Current ...HostedbyConfluent
Team Collaboration in Kafka Clusters With Maria Berinde-Tampanariu | Current 2022
When different teams start to use the same Kafka clusters, it opens up opportunities and challenges. During this talk, we will look at different architectures and team structures to explore ways in which to set up authorization in a granular and maintainable way for real-world users, as well as for producing or consuming clients.
What are the options offered by the Kafka built-in Authorizer, how can the Authorizer be customized and how are integrations with external systems built in order to provide group or role-based access control? Confluent Cloud and Confluent Platform provide predefined roles as part of the Role-based Access Control (RBAC) feature. We will look at the permissions included in these role bindings, the scope on which they can be used, and the components for which they are available. Role-based Access Control and Access Control Lists can be used together - let’s explore the options, best practices, and order of precedence.
We will put the capabilities into action by looking at the practices used by an imaginary company where the central Platform Team provisions clusters for its internal customers and provides access for teams to self-manage their domains. What’s the best approach to grant access to team members to their team’s resources and what needs to happen when one team collaborates with another team? What happens when a team member works temporarily on two teams?
We will close the session by looking at the ability to use the authorization mechanisms in conjunction with different authentication options and at the automation options to make the actions predictable and repeatable.
Introducing new features in Confluent Platform 5.4 and Apache Kafka 2.4...
CP 5.4 (based on AK 2.4)
Security:
Role-Based Access Control (RBAC)
Structured Audit Logs
Resilience:
Multi-Region Clusters (MRC)
Data Compatibility:
Server-side Schema Validation
Management & Monitoring:
Control Center enhancements
RBAC management
Replicator monitoring
Performance & Elasticity:
Tiered Storage (preview)
Stream Processing:
New ksqlDB features like Pull Queries and Kafka Connect Integration (preview)
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.
This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.
This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.
The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.
This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.
This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.