The document discusses using Flume and Solr to index log data. It begins with an introduction and example of using Flume to index syslog data into Solr. It then covers aspects like high availability, data routing, and schema design. The document also provides a step-by-step example of transforming a syslog message into a Solr document using Morphlines.
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...DataWorks Summit
This document discusses using Apache Flume to stream data from various sources to Hadoop for telecommunications operators. It introduces Flume, describing its key components like agents, sources, channels, and sinks. It provides an end-to-end architecture example showing data flowing from external sources through Flume into Hadoop and then into an EDW for analysis and user reports. Finally, it discusses next generation architectures using technologies like Spark, machine learning, and real-time analytics.
Stephan Ewen - Running Flink EverywhereFlink Forward
http://flink-forward.org/kb_sessions/running-apache-flink-everywhere-standalone-yarn-mesos-docker-kubernetes-etc/
The world of cluster managers and deployment frameworks is getting complicated. There is zoo of tools to deploy and manage data processing jobs, all of which have different resource management and fault tolerance slightly different. Some tools have a only per-job processes (Yarn, Docker/Kubernetes), while others require some long running processes (Mesos, Standalone). In some frameworks, streaming jobs control their own resource allocation (Yarn, Mesos), while for other frameworks, resource management is handled by external tools (Kubernetes). To be broadly usable in a variety of setups, Flink needs to play well with all these frameworks and their paradigms. This talk describes Flink’s new proposed process and deployment model that will make it work together well with the above mentioned frameworks. The new abstraction is designed to cover a variety of use cases, like isolated single job deployments, sessions of multiple short jobs, and multi-tenant setups.
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEkawamuray
This document summarizes a presentation about Kafka multi-tenancy at LINE Corporation. The presentation discusses how LINE runs a single shared Kafka cluster to handle over 100 billion messages per day from many independent services. It describes the hardware used, requirements for multi-tenancy like protecting against abusive workloads and providing isolation. It then discusses specific issues identified like slow response times caused by disk reads, and the solutions implemented like request quotas, metrics, and pre-loading data into memory to avoid blocking. The presentation concludes that after addressing these issues, their shared Kafka hosting model works efficiently while maintaining a single data hub.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
Cloud deployments of Apache Hadoop are becoming more commonplace. Yet Hadoop and it's applications don't integrate that well —something which starts right down at the file IO operations. This talk looks at how to make use of cloud object stores in Hadoop applications, including Hive and Spark. It will go from the foundational "what's an object store?" to the practical "what should I avoid" and the timely "what's new in Hadoop?" — the latter covering the improved S3 support in Hadoop 2.8+. I'll explore the details of benchmarking and improving object store IO in Hive and Spark, showing what developers can do in order to gain performance improvements in their own code —and equally, what they must avoid. Finally, I'll look at ongoing work, especially "S3Guard" and what its fast and consistent file metadata operations promise.
Apache Flume is a system for collecting streaming data from various sources and transporting it to destinations such as HDFS or HBase. It has a configurable architecture that allows data to flow from clients to sinks via channels. Sources produce events that are sent to channels, which then deliver the events to sinks. Flume agents run sources and sinks in a configurable topology to provide reliable data transport.
The document discusses designing robust data architectures for decision making. It advocates for building architectures that can easily add new data sources, improve and expand analytics, standardize metadata and storage for easy data access, discover and recover from mistakes. The key aspects discussed are using Kafka as a data bus to decouple pipelines, retaining all data for recovery and experimentation, treating the filesystem as a database by storing intermediate data, leveraging Spark and Spark Streaming for batch and stream processing, and maintaining schemas for integration and evolution of the system.
Spark is a fast and general engine for large-scale data processing. It improves on MapReduce by allowing iterative algorithms through in-memory caching and by supporting interactive queries. Spark features include in-memory caching, general execution graphs, APIs in multiple languages, and integration with Hadoop. It is faster than MapReduce, supports iterative algorithms needed for machine learning, and enables interactive data analysis through its flexible execution model.
Multitenancy: Kafka clusters for everyone at LINEkawamuray
Yuto Kawamura from LINE Corporation presented on their use of Apache Kafka clusters to provide multitenancy for different internal teams. They face challenges in ensuring isolation between client workloads and preventing abusive clients. Their solutions include request quotas to limit client resource usage, slow logs to identify slow requests, and changes to the broker code to pre-warm caches and minimize the impact of disk reads during message fetching. With these approaches, they are able to reliably operate shared Kafka clusters with high throughput and multiple tenants.
Apache Kafka is a distributed publish-subscribe messaging system that was originally created by LinkedIn and contributed to the Apache Software Foundation. It is written in Scala and provides a multi-language API to publish and consume streams of records. Kafka is useful for both log aggregation and real-time messaging due to its high performance, scalability, and ability to serve as both a distributed messaging system and log storage system with a single unified architecture. To use Kafka, one runs Zookeeper for coordination, Kafka brokers to form a cluster, and then publishes and consumes messages with a producer API and consumer API.
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.
Sqoop is a tool for transferring bulk data between Hadoop and structured datastores like relational databases. This document compares Sqoop 1 and the new Sqoop 2 architecture. Sqoop 2 addresses limitations of Sqoop 1 by providing a server-side installation, explicit connector selection, common functionality for all connectors, and role-based security for accessing external systems and managing resources. The document highlights improved ease of use, extension, and security in Sqoop 2 compared to the client-side tool design of Sqoop 1.
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
Zhiyong Bai
As a high performance and scalable key value database, Zhihu use HBase to provide online data store system along with Mysql and Redis. Zhihu’s platform team had accumulated some experience in technology of container, and this time, based on Kubernetes, we build flexible platform of online HBase system, create multiple logic isolated HBase clusters on the shared physical cluster with fast rapid,and provide customized service for different business needs. Combined with Consul and DNS server, we implement high available access of HBase using client mainly written with Python. This presentation is mainly shared the architecture of online HBase platform in Zhihu and some practical experience in production environment.
hbaseconasia2017 hbasecon hbase
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Steve Hoffman
Apache Flume is a distributed system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store such as Hadoop Distributed File System (HDFS). It consists of agents that collect data from sources and deliver it to sinks using channels. Common sources include log files, Kafka streams, and Avro clients. Common sinks include HDFS, HBase, Elasticsearch, and Kafka. Flume provides reliable and available service for efficiently collecting and moving large amounts of log data.
LINE's messaging service architecture underlying more than 200 million monthl...kawamuray
Yuto Kawamura from LINE Corp presented on the messaging service architecture underlying LINE's 200 million monthly active users. The key points are:
1. LINE uses a distributed architecture with the LEGY gateway, talk-server application servers, and a hybrid Redis/HBase datastore to handle over 25 billion messages per day.
2. The Armeria RPC framework is used for communication between systems like the talk-servers, authentication services, and analytics.
3. Apache Kafka is used as the backbone for asynchronous task processing and data synchronization between services due to its load distribution, fail-over capabilities, and pub-sub model.
4. While LINE leverages many open source technologies, it also
Real-time streaming and data pipelines with Apache KafkaJoe Stein
Get up and running quickly with Apache Kafka http://kafka.apache.org/
* Fast * A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
* Scalable * Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
* Durable * Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
* Distributed by Design * Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
OpenStack + Cloud Foundry for the OpenStack Boston Meetupragss
The document provides an overview of Cloud Foundry, including:
- An introduction to Cloud Foundry, its architecture, and how to deploy and manage applications on it.
- Details on deploying applications and services, and how services are bound to applications.
- How Cloud Foundry provides health management of applications and the platform itself through high availability, monitoring, and logging.
- Additional resources for learning more about Cloud Foundry.
Webinar - Order out of Chaos: Avoiding the Migration MigrainePeak Hosting
When your business has outgrown your current managed hosting provider, the logical thing is to search for something better. Change can be difficult and chaotic, but it doesn’t have to be.
This webinar focuses on best practices for making your migration from the cloud as pain free as possible, including a discussion on what you need to know and ask of your migration provider to ensure it goes smoothly. As an example of this, we will outline Peak Hosting’s migration process, as well as discuss one of our customer migrations and why they chose to undertake it.
CtrlS: Cloud Solutions for Retail & eCommerceeTailing India
The document discusses various topics related to data centers and IT infrastructure including:
1. Metrics for measuring data center efficiency such as PUE (Power Usage Effectiveness), DCiE (Data Center Infrastructure Efficiency), and strategies for improving efficiency including water-cooled condensers, variable frequency drives, and more efficient server power supplies.
2. CtrlS's data center architecture and disaster recovery solutions including solutions spanning multiple availability zones with synchronous and asynchronous replication for high availability.
3. Trends in IT infrastructure like virtualization, cloud computing, and openflow networking.
4. Managed services solutions discussed including e-commerce platforms, logistics, CRM and other applications.
The document discusses whether healthcare organizations should abandon cloud computing. It notes that while cloud computing provides benefits like cost savings, increased IT complexity and stringent regulations have led to both successes and failures in healthcare cloud adoption. The document considers factors for healthcare organizations to evaluate on initial cloud deployment and ongoing use, such as security, accessibility, governance and meeting key performance indicators. It argues that with the right monitoring platform to gain visibility and control over hybrid cloud environments, healthcare organizations can better anticipate issues and optimize cloud operations while meeting compliance requirements.
By: Marianne Eggett, Linux Emerging Technology Practice Mgr, Mainline Information Systems
Are you considering a migration to Linux on IBM System z? The first step is to develop a detailed plan that outlines the short term and long term benefits of your migration.
In this presentation you will learn:
- How to identify the business case to support consolidation with System z Linux
- Examples of cost savings other businesses have experienced
- How to build a Total Cost of Ownership report specific to your environment
To view this presentation with audio, visit: http://go.mainline.com/pages/start/knowledge-center-building-the-case-zlinux-webcast-june-2009/index.html?Campaign_Id=7071&Activity_Id=6131
For other topics, visit: www.mainline.com/kc
Service-Level Objective for Serverless Applicationsalekn
Deploying commercial applications that meet their expected business needs is challenging due to the differences between how business goals are specified and how the system is evaluated. Furthermore, business goals are dynamic, requiring deployment to change constantly over time. Such difficulties make it costly to maintain application quality as the underlying infrastructure is not always fast enough to keep up with business changes. Nowadays, serverless opens a new approach to build application. By abstracting out the deployment details, serverless application can be implemented with minimum deployment efforts. Serverless also reduces maintenance cost with auto-scaling and pay-as-you-go. Such abilities make us believe that by adopting serverless, we can build application that can meet and quickly adapt to business goals.
However, simply writing applications with serverless is not sufficient. Due to best-effort invocation mechanisms and the lack of application structure awareness, serverless performance is highly variable and often fails to support applications with rigorous quality of service requirements. In this study, we aim to mitigate such limitations by coupling serverless deployment with business needs. In particular, we define an Serverless Service-Level Objective (SLO) interface that allows developers to describe their application structure and business goals in terms of software-level objectives. We implement an SLO enforcer, which uses this information in combination with the system performance metrics to decide a proper serverless deployment and resource allocation for meeting business goals. The Serverless SLO leverages blueprint model, which allow developers to describe applications' architecture and runtime characteristics needs, to map application description to serverless function deployment on the top of Knative. We deploy our proposed system on KinD, a tool to run Kubernetes cluster over our local Docker container, and evaluate it with different system configurations. Evaluation results showed that SLO definition and enforcement helps serverless application use resources in accordance with business goals.
This document provides troubleshooting best practices for Genesys Voice Platform (GVP) 8.x. It discusses how GVP applications communicate with each other using SIP, describes common SIP error codes and how to track calls using identifiers in Genesys logs. The document explains how to enable custom logging, save temp files, and manually produce dump and core files. It also reviews using Wireshark to analyze network traffic and addresses common questions and problems.
The Real World - Plugging the Enterprise Into It (nodejs)Aman Kohli
This document discusses using Node.js as the foundation for building applications that connect the physical world to enterprise systems through mobile devices and sensors. It describes initial work done to build a proxy and protocol for handling requests and addresses challenges with authentication, scalability, and performance testing. The document shares results from benchmarking the system under different network conditions and outlines next steps to improve concurrency, security, and infrastructure elasticity.
MT49 Dell EMC XtremIO: Product Overview and New Use CasesDell EMC World
This session provides an overview of the Dell EMC XtremIO all-flash, scale-out array and its design objectives.
The architecture will be discussed and compared to other flash arrays in the market with the goal of helping the audience understand the unique architectural differentiation and highlight some new and exciting customer use cases that create business agility and demonstrate XtremIO’s transformative value.
We will challenge your assumptions about storage infrastructure.
Logging is important for troubleshooting a DNS service. Conveniently with BIND 9, almost all problems will show up somewhere in the log output, but only if the logging is enabled and configured correctly.
In this webinar, we’ll discuss the BIND 9 logging configuration and best practices in searching through large log-files to find the entries of interest. In addition, we’ll release log-management tools used by Men & Mice Services.
Telecoms Service Assurance & Service Fulfillment with Neo4j Graph DatabaseNeo4j
This document discusses how graph databases can help with telecom network management. It presents a sample telecom network data model with nodes for customers, services, equipment, and connections linked by relationships. Graph databases are well-suited because they can efficiently query the complex relationships in telecom network topology data. The document describes how graph queries can help with tasks like capacity planning, path finding, service impact analysis, and alarm correlation that are important for telecom network assurance and service fulfillment.
Supporting Enterprise System Rollouts with SplunkErin Sweeney
At Cricket Communications, Splunk started as a way to correlate all of our data into one view to help our operations team keep processes humming. Then we gave secured access to our developers, now they’re addicted. In fact, Splunk is critical in helping us speedup deployment of new systems (like our recent multi-million dollar billing system implementation). Learn how we use Splunk to display key metrics for the business, track overall system health, track transactions, optimize license usage, and support capacity
planning.
Jim Stertz: Automation and Robotic Arm: Maximizing Throughput and Capacity360mnbsu
A medical device contract manufacturer developed and implemented an automated system and a robotic arm to tend three inspection Coordinate Measuring Machines. This presentation will be a case study of their throughput, uptime, and capacity expansions using automation.
From the 2014 Taking Shape Summit: The Internet of Things & the Future of Manufacturing.
The document discusses Cloud Foundry, an open source Platform as a Service (PaaS) that provides a way for developers to easily deploy and scale applications in the cloud. It covers topics like deploying and managing Cloud Foundry, developing and deploying applications on Cloud Foundry, and how Cloud Foundry provides health monitoring and management at both the application and platform level. The presentation includes diagrams of the Cloud Foundry architecture and deployment process.
The document discusses Cloud Foundry, an open source Platform as a Service (PaaS) that provides a way for developers to easily deploy and scale applications in the cloud. It covers topics like deploying and managing CloudFoundry, developing and deploying applications on CloudFoundry, and how CloudFoundry provides health monitoring, logging, and high availability of both applications and the platform itself. The document also provides several resources for learning more about CloudFoundry and other PaaS offerings.
Impact2014 session #1317 you have got a friend on z - tales from cics tran...Elena Nanos
IMPACT 2014 - ACU-1317: You've Got a Friend on IBM System z - Tales from CICS Transaction Gateway
I was a guest speaker at IBM IMPACT 2014 conference.
Explore the latest and greatest from CICS Transaction Gateway (TG), and hear how HCSC consolidated more than 80 CICS TG standalone instances into a CICS TG on z/OS high-availability architecture for best performance. I presented how our company eliminated single points of failure, simplified CICS TG infrastructure management and support, improved performance, took advantage of real-time monitoring and reporting capabilities, made use of idle zAAP capacity with a 98 percent offload rate, and drastically reduced maintenance cost. Presentation covers details about CICS TG on the z/OS HA architecture with automatic failover, lessons learned, and what organizations need to know to set this up.
Re-Architect Your Legacy Environment To Enable An Agile, Future-Ready EnterpriseDell World
It’s time to re-architect your legacy environment in order to lay the foundation for an adaptive enterprise. In this session, you'll learn how to increase your business and technical agility using a fit-to-purpose .NET or Java architecture, while deploying your apps intelligently in the cloud and integrating with your complex IT environment, customers and partners.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
This document discusses using Apache NiFi to build a high-speed cyber security data pipeline. It outlines the challenges of ingesting, transforming, and routing large volumes of security data from various sources to stakeholders like security operations centers, data scientists, and executives. It proposes using NiFi as a centralized data gateway to ingest data from multiple sources using a single entry point, transform the data according to destination needs, and reliably deliver the data while avoiding issues like network traffic and data duplication. The document provides an example NiFi flow and discusses metrics from processing over 20 billion events through 100+ production flows and 1000+ transformations.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.