Jive is using Flume to deliver the content of a social web (250M messages/day) to HDFS and HBase. Flume's flexible architecture allows us to stream data to our production data center as well as Amazon's Web Services datacenter. We periodically build and merge Lucene indices with Hadoop jobs and deploy these to Katta to provide near real time search results. This talk will explore our infrastructure and decisions we've made to handle a fast growing set of real time data feeds. We will further explore other uses for Flume throughout Jive including log collection and our distributed event bus.
Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersDataWorks Summit
This document discusses enabling exploratory analytics of data in shared-service Hadoop clusters using Hunk. It describes how Hunk allows users to visually browse and analyze data in HDFS through an interactive search interface without needing to understand the data schema. The document provides examples of how Hunk has been used at Yahoo to gain operational insights from Hadoop cluster metrics and optimize performance. It demonstrates how Hunk can create visualizations and dashboards for analyzing jobs, queues, NameNode usage and more.
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
During the course of this presentation, forward-looking statements were made regarding Splunk's expected performance and legal notices were provided. The presentation discussed using Splunk to analyze large amounts of data stored in Hadoop by moving computation to the data through MapReduce jobs while supporting Splunk Processing Language and maintaining schema on read. Optimization techniques like partition pruning were covered to improve performance as well as best practices, troubleshooting tips, and resources for using Hunk.
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersBrett Sheppard
The document discusses Hunk, a self-service analytics platform for exploring, visualizing, and analyzing data stored in Hadoop clusters and other data stores. Hunk allows users to rapidly interact with data through an interactive search interface and preview results without waiting for full queries to finish. It provides integrated visualization of data through built-in graphs and charts. Hunk deployment is fast, requiring under 60 minutes to connect to Hadoop clusters and begin searching data.
The document provides an overview of Hunk, a product from Splunk that allows users to explore, analyze and visualize data stored in Hadoop. Some key points:
- Hunk uses virtual indexes to enable searching of data in Hadoop using Splunk's interface and capabilities without needing to move the data. It handles MapReduce jobs behind the scenes.
- It provides an interactive interface for business users to explore and query data in Hadoop in an easy and flexible way, with the ability to preview results while MapReduce jobs are running.
- Integration with Hadoop is done through Hadoop client libraries, requiring only read access to data stored in HDFS. Hunk supports various Hadoop distributions and operating
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData, Inc.
Hunk is a Splunk analytics tool that allows users to explore, analyze, and visualize raw big data stored in Hadoop and NoSQL data stores. It can interactively query raw data, accelerate reporting, create charts and dashboards, and archive historical data to HDFS. BlueData's EPIC platform enables running Hunk jobs on Hadoop clusters while accessing data from any storage system, such as HDFS, NFS, Gluster, and others. Hunk supports ingesting large amounts of data and provides pre-packaged analytics functions and intuitive visualization of results.
Two of the most frequently asked questions about Pinot’s history are “Why did LinkedIn build Pinot?”, “How is it different from Druid, ElasticSearch, Kylin?”. In this talk, we will go over the use cases that motivated us to build Pinot and how it has changed the analytics landscape at LinkedIn, Uber, and other companies.
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
This document discusses using an open source Lambda architecture with Kafka, Hadoop, Samza, and Druid to handle event data streams. It describes the problem of interactively exploring large volumes of time series data. It outlines how Druid was developed as a fast query layer for Hadoop to enable low-latency queries over aggregated data. The architecture ingests raw data streams in real-time via Kafka and Samza, aggregates the data in Druid, and enables reprocessing via Hadoop for reliability.
This document provides an agenda and overview for a Splunk TechDay event focused on Splunk Ninja skills. The agenda includes refreshers on search language and structure, examples of SPL commands for searching, charting, and exploring data, and custom commands for extending SPL capabilities. The overview sections explain key aspects of SPL like its large command set, syntax based on Unix pipelines and SQL, and uses for data searching, filtering, and manipulation. Examples are provided for various SPL techniques including search/filter, evaluating/modifying fields, statistics, and charting.
From Eric Baldeschwieler's presentation "Hadoop @ Yahoo! - Internet Scale Data Processing" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/509
First part of the talk will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and BCP management. Second part of the talk will focus on out of box support for spark jobs.
Speakers:
Purshotam Shah is a senior software engineer with the Hadoop team at Yahoo, and an Apache Oozie PMC member and committer.
Satish Saley is a software engineer at Yahoo!. He contributes to Apache Oozie.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...Imply
Ensuring a consistently great Netflix experience while continuously pushing innovative technology updates is no easy feat.
We'll look at how Netflix turns log streams into real-time metrics to provide visibility into how devices are performing in the field. Including sharing some of the lessons learned around optimizing Druid to handle our load.
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Sid Anand
This document discusses cloud native data pipelines. It begins by describing the speaker and their work experience. Then, it outlines some key qualities of resilient data pipelines like operability, correctness, timeliness and cost. Two use cases at the speaker's company for applying trust models to messages are presented - one using batch processing and the other using near real-time processing. The document discusses how tools like Apache Airflow, auto-scaling groups, Amazon Kinesis and Avro can help achieve those qualities for data pipelines in the cloud.
Managing your black friday logs Voxxed LuxembourgDavid Pilato
The document discusses strategies for optimally scaling Elasticsearch clusters to handle large volumes of time-series data like logs. It recommends creating a new index daily to separate older data and allow deleting indexes after some period. It also suggests techniques like sharding data across nodes, using aliases to query multiple indexes, and load balancing ingest across coordinating nodes to optimize performance and avoid bottlenecks when data volumes increase over time.
This document provides an overview of WANdisco's NonStop HBase solution for making HBase continuously available for enterprise deployments. It discusses traditional high availability approaches that rely on backups and describes how these can fail. It then introduces WANdisco's patented active-active replication technology that provides 100% uptime with zero downtime. The document demonstrates how WANdisco implements multiple active HBase masters and region servers using a distributed coordination engine and Paxos consensus protocol. This allows HBase to avoid single points of failure and provides seamless failover for clients. It concludes with a demo of the NonStop HBase solution in action.
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
The document discusses building real-time dashboards on data streams. It describes using Apache Kafka to ingest streaming data from Wikipedia edits. The data is enriched using Kafka Streams and stored in Apache Druid for powering interactive visualizations in Superset. Key components are Kafka for the event flow, Kafka Streams for processing, Druid for the data store, and Superset for visualization.
Managing your Black Friday Logs NDC OsloDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
The document summarizes Druid, an open source data analytics platform, and how it has enhanced the data platform for a company to enable better business decisions. Key features of Druid include sub-second aggregate queries, real-time analytics dashboards, and live queries for unique users. Druid has helped scale to several hundred terabytes of data with thousands of queries per second while supporting new analytics applications, ad hoc reporting, and exploratory analysis. Future plans include improving the query service and migrating components to technologies like Spark, Flink, Mesos and Docker.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Chicago Data Summit: Flume: An IntroductionCloudera, Inc.
Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
This document discusses Spring for Apache Hadoop, a framework that simplifies using Apache Hadoop and related projects like HBase and Hive within the Spring programming model. It provides wrappers and configuration for common Hadoop tasks like MapReduce jobs, scripting, and accessing Hadoop databases and data processing engines. The goals are to provide a programmatic model for the Hadoop ecosystem, simplify client libraries, and leverage Spring features. It supports various Hadoop distributions and provides interfaces for MapReduce, HBase, Hive, Pig and other Hadoop technologies.
Designing a reactive data platform: Challenges, patterns, and anti-patterns Alex Silva
Presentation given at the O'Reilly Software Architecture Conference in NYC, April 2016.
Covers the key architectural decisions made behind the design of a reactive self-service data ingestion analytics platform that is able to fulfill several business use cases at massive scale, both at real-time and batch scopes, while leveraging and integrating Kafka and Spark in an efficient, easy to use way.
The presentation describes a message-driven, reactive and distributed design that leverages REST and Hypermedia protocols, and several open source frameworks and platforms, including Akka, Kafka, Hadoop and Spark.
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
Hadoop is not an island. To deliver a complete Big Data solution, a data pipeline needs to be developed that incorporates and orchestrates many diverse technologies. In this session we will demonstrate how the open source Spring Batch, Spring Integration and Spring Hadoop projects can be used to build manageable and robust pipeline solutions to coordinate the running of multiple Hadoop jobs (MapReduce, Hive, or Pig), but also encompass real-time data acquisition and analysis.
This document discusses a case study on fraud detection using Hadoop. It begins with an overview of fraud detection requirements, including the need for real-time and near real-time processing of large volumes and varieties of data. It then covers considerations for the system architecture, including using HDFS and HBase for storage, Kafka for ingestion, and Spark and Storm for stream and batch processing. Data modeling with HBase and caching options are also discussed.
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.
Building Continuously Curated Ingestion PipelinesArvind Prabhakar
Data ingestion is a critical piece of infrastructure for any Big Data project. Learn about the key challenges in building Ingestion infrastructure and how enterprises are solving them using low level frameworks like Apache Flume, Kafka, and high level systems such as StreamSets.
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist. This session will survey the big data ingestion landscape, focusing on how open source tools such as Sqoop, Flume, Nifi and StreamSets can keep the data pipeline flowing.
Data Ingestion, Extraction & Parsing on Hadoopskaluska
The document discusses options for ingesting, extracting, parsing, and transforming data on Hadoop using Informatica products. It outlines Informatica's current capabilities for data integration with Hadoop and its roadmap to enhance capabilities for processing data directly on Hadoop in the first half of 2012. This will allow users to design data processing flows visually and execute them on Hadoop for optimized performance.
The document discusses best practices for streaming applications. It covers common streaming use cases like ingestion, transformations, and counting. It also discusses advanced streaming use cases that involve machine learning. The document provides an overview of streaming architectures and compares different streaming engines like Spark Streaming, Flink, Storm, and Kafka Streams. It discusses when to use different storage systems and message brokers like Kafka for ingestion pipelines. The goal is to understand common streaming use cases and their architectures.
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
This document provides an introduction to HBase internals and schema design for HBase users. It discusses the logical and physical views of HBase, including how tables are split into regions and stored across region servers. It covers best practices for schema design, such as using row keys efficiently and avoiding redundancy. The document also briefly discusses advanced topics like coprocessors and compression. The overall goal is to help HBase users optimize performance and scalability based on its internal architecture.
Arvind Prabhakar presented on Apache Flume. He discussed that Flume is an open-source system for aggregating large amounts of log and streaming data from many sources and efficiently transporting it to data stores and processing systems. It is designed to handle high volumes of continuously arriving data from distributed servers or devices. Flume uses a pipeline-based architecture that allows for reliable, scalable, and customizable data ingestion.
This document outlines Apache Flume, a distributed system for collecting large amounts of log data from various sources and transporting it to a centralized data store such as Hadoop. It describes the key components of Flume including agents, sources, sinks and flows. It explains how Flume provides reliable, scalable, extensible and manageable log aggregation capabilities through its node-based architecture and horizontal scalability. An example use case of using Flume for near real-time log aggregation is also briefly mentioned.
This document discusses using Apache Spark and Apache NiFi together for data lakes. It outlines the goals of a data lake including having a central data repository, reducing costs, enabling easier discovery and prototyping. It also discusses what is needed for a Hadoop data lake, including automation of pipelines, governance, and interactive data discovery. The document then provides an example ingestion project and describes using Apache Spark for functions like cleansing, validating, and profiling data. It outlines using Apache NiFi for the pipeline design with drag and drop functionality. Finally, it demonstrates ingesting and preparing data, data self-service and transformation, data discovery, and operational monitoring capabilities.
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
The Internet of Things is moving into the mainstream and this new world of data-driven products is transforming a vast number of industry sectors and technologies.
However, IoT creates a new challenge: how to build and operationalize continual data ingestion from such a wide and ever-changing array of endpoints so that the data arrives consumption-ready and can drive analysis and action within the business.
In this webinar, Sean Anderson from Cloudera and Kirit Busu, Director of Product Management at StreamSets, will discuss Hadoop's ecosystem and IoT capabilities and provide advice about common patterns and best practices. Using specific examples, they will demonstrate how to build and run end-to-end IOT data flows using StreamSets and Cloudera infrastructure.
1. Hadoop is used extensively at Twitter to handle large volumes of data from logs and other sources totaling 7TB per day. Tools like Scribe and Crane are used to input data and Elephant Bird and HBase for storage.
2. Pig is used for data analysis on these large datasets to perform tasks like counting, correlating, and researching trends in users and tweets.
3. The results of these analyses are used to power various internal and external Twitter products and keep the business agile through ad-hoc analyses.
This document provides an overview of big data processing and how it is implemented at Detik.com. It defines big data as large, complex datasets that cannot be processed by traditional databases. It discusses the four V's of big data: volume, velocity, variety, and veracity. It then gives examples of big data sources and sizes. The document outlines the Hadoop ecosystem including components like HDFS, MapReduce, Hive, and Pig. It describes how Detik uses Hadoop, Akka, Hive and Pig to process large log files and perform analytics calculations on metrics like popular articles, exit rates, and bounce rates within 15 minutes.
Big Data Applications Made Easy: Fact Or Fiction?Glenn Renfro
With Spring XD the answer is Fact. In short Spring XD provides a one stop shop for writing and deploying Big Data Applications. It provides a scalable, fault tolerant, distributed runtime for Data Ingestion, Analytics, and Workflow Orchestration using a single programming, configuration and extensibility model. By reducing the complexity of Big Data development, developers can focus on the business problem.
In this discussion, we will cover:
• The basics of Spring XD
• Show how to deploy streams that will handle data received from multiple sources, and write the results to various sinks
• Capture some analytics from a live data stream
• Show how to create and execute Jobs
• Demonstrate the failover capabilities of a XD Cluster
• Discuss how to create your own custom modules
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
Michael Sun presented on CBS Interactive's use of Hadoop for web analytics processing. Some key points:
- CBS Interactive processes over 1 billion web logs daily from hundreds of websites on a Hadoop cluster with over 1PB of storage.
- They developed an ETL framework called Lumberjack in Python for extracting, transforming, and loading data from web logs into Hadoop and databases.
- Lumberjack uses streaming, filters, and schemas to parse, clean, lookup dimensions, and sessionize web logs before loading into a data warehouse for reporting and analytics.
- Migrating to Hadoop provided significant benefits including reduced processing time, fault tolerance, scalability, and cost effectiveness compared to their
Data Freeway is a system developed by Facebook to handle large volumes of data in real-time at scale. It includes components like Scribe for distributed logging, Calligraphus for persisting logs to HDFS, and Puma for real-time analytics on the data. The system is designed to handle over 10GB/second of data reliably with low latency of less than 10 seconds for 99% of data. It provides a simple interface for applications to access real-time data streams through tools like ptail. The system is open source and used at Facebook to power applications like real-time search, spam detection, and metrics analysis.
The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.
Testing Big Data: Automated ETL Testing of HadoopBill Hayduk
Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
This document discusses Hadoop usage at eBay over time from 2007 to 2015. It describes:
- The growth of eBay's Hadoop clusters from 1-10 nodes in 2007 to over 10,000 nodes and 150,000 cores projected for 2015.
- How the amount of data stored in Hadoop has grown from 1PB in 2010 to a projected 150+ PB in 2015.
- The types of clusters eBay uses including dedicated, shared, and HAAS clusters.
- Some key use cases for Hadoop at eBay like building a near real-time search index and processing 1.68 million items in 3 minutes.
- Operational requirements for eBay's large Hadoop ecosystem like high availability, security,
Hive is used at Facebook for data warehousing and analytics tasks on a large Hadoop cluster. It allows SQL-like queries on structured data stored in HDFS files. Key features include schema definitions, data summarization and filtering, extensibility through custom scripts and functions. Hive provides scalability for Facebook's rapidly growing data needs through its ability to distribute queries across thousands of nodes.
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
Apache Drill is an interactive SQL query engine for analyzing large scale datasets. It allows for querying data stored in HBase and other data sources. Drill uses an optimistic execution model and late binding to schemas to enable fast queries without requiring metadata definitions. It leverages recent techniques like vectorized operators and late record materialization to improve performance. The project is currently in alpha stage but aims to support features like nested queries, Hive UDFs, and optimized joins with HBase.
This document provides an overview of Hive, including:
1. It describes Hive's architecture which uses HDFS for storage, MapReduce for execution, and stores metadata in an RDBMS.
2. It outlines Hive's data types including primitive, collection, and file format types.
3. It discusses Hive's query language (HQL) which resembles SQL and can be used to define databases and tables, load and query data.
Data infrastructure at Facebook with reference to the conference paper " Data warehousing and analytics infrastructure at facebook"
Datewarehouse
Hadoop - Hive - scrive
Presto is a distributed SQL query engine that allows for interactive analysis of large datasets across various data sources. It was created at Facebook to enable interactive querying of data in HDFS and Hive, which were too slow for interactive use. Presto addresses problems with existing solutions like Hive being too slow, the need to copy data for analysis, and high costs of commercial databases. It uses a distributed architecture with coordinators planning queries and workers executing tasks quickly in parallel.
This document discusses using streaming MapReduce to perform real-time big data analytics on click stream data. The goal is to analyze large log streams from web servers to identify products meeting thresholds for impressions within a short time period, such as having over 1,000 views in the last 5 seconds from hundreds of servers. The project was changed to use generated click stream data resembling web server logs to simulate millions of impressions per second from many virtual machines for testing real-time analysis within guaranteed service level agreements. HBase is used for storage.
This document discusses using streaming MapReduce to perform real-time big data analytics on click stream data. The goal is to analyze large log streams from web servers to identify products meeting thresholds for impressions within a short time period, such as having over 1,000 views in the last 5 seconds from hundreds of servers. The project was changed to use generated click stream data resembling web server logs to simulate millions of impressions per second from many virtual machines for testing real-time analysis within guaranteed service level agreements. HBase is used for storage.
Pivotal HD is a Hadoop distribution that includes additional components to configure, deploy, monitor and manage Hadoop clusters. It provides tools like the Command Center for visual cluster monitoring and job management, Hadoop Virtualization Extensions to improve resource utilization, and HAWQ for high performance SQL queries and analytics across Hadoop data.
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
The General Data Protection Regulation (GDPR) is a legislation designed to protect personal data of European Union citizens and residents. The main requirement is to log personal data accesses/changes in customer-specific applications. These logs can then be audited by owning entities to provide reporting to end users indicating usage of their personal data. Users have the ""right to be forgotten,â€Âmeaning their personal data can be purged from the system at their request. The regulation goes into effect on May 25,2018 with significant fines for non-compliance.
This session will provide insight on how to approach/implement a GDPR compliance solution using Hadoop and Streaming for any enterprise with heavy volumes of data.This session will delve into deployment strategies, architecture of choice (Kafka,NiFi. and Hive ACID with streaming), implementation best practices, configurations, and security requirements. Hortonworks Professional Services System Architects helped the customer on ground to design, implement, and deploy this application in production.
Speaker
Saurabh Mishra, Hortonworks, Systems Architect
Arun Thangamani, Hortonworks, Systems Architect
The document discusses LinkedIn's data ecosystem and the challenge of bridging operational transactional data (OLTP) with analytical processing (OLAP) at scale. It describes LinkedIn's solution called Lumos, which is a scalable ETL framework that uses change data capture, delta processing, and virtual snapshots to frequently refresh petabyte-scale data from OLTP databases into Hadoop for OLAP. Lumos supports requirements like handling multiple data centers, schema evolution, and efficient change capture while ensuring data consistency and low latency refresh times.
This document contains the resume of Hassan Qureshi. He has over 9 years of experience as a Hadoop Lead Developer with expertise in technologies like Hadoop, HDFS, Hive, Pig and HBase. Currently he works as the technical lead of a data engineering team developing insights from data. He has extensive hands-on experience installing, configuring and maintaining Hadoop clusters in different environments.
This document provides an introduction to big data and related technologies. It defines big data as datasets that are too large to be processed by traditional methods. The motivation for big data is the massive growth in data volume and variety. Technologies like Hadoop and Spark were developed to process this data across clusters of commodity servers. Hadoop uses HDFS for storage and MapReduce for processing. Spark improves on MapReduce with its use of resilient distributed datasets (RDDs) and lazy evaluation. The document outlines several big data use cases and projects involving areas like radio astronomy, particle physics, and engine sensor data. It also discusses when Hadoop and Spark are suitable technologies.
Similar to Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software (20)
The document discusses using Cloudera DataFlow to address challenges with collecting, processing, and analyzing log data across many systems and devices. It provides an example use case of logging modernization to reduce costs and enable security solutions by filtering noise from logs. The presentation shows how DataFlow can extract relevant events from large volumes of raw log data and normalize the data to make security threats and anomalies easier to detect across many machines.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
The document outlines the 2021 finalists for the annual Data Impact Awards program, which recognizes organizations using Cloudera's platform and the impactful applications they have developed. It provides details on the challenges, solutions, and outcomes for each finalist project in the categories of Data Lifecycle Connection, Cloud Innovation, Data for Enterprise AI, Security & Governance Leadership, Industry Transformation, People First, and Data for Good. There are multiple finalists highlighted in each category demonstrating innovative uses of data and analytics.
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
The document outlines the agenda for Cloudera's Enterprise Data Cloud event in Vienna. It includes welcome remarks, keynotes on Cloudera's vision and customer success stories. There will be presentations on the new Cloudera Data Platform and customer case studies, followed by closing remarks. The schedule includes sessions on Cloudera's approach to data warehousing, machine learning, streaming and multi-cloud capabilities.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
The document discusses the benefits and trends of modernizing a data warehouse. It outlines how a modern data warehouse can provide deeper business insights at extreme speed and scale while controlling resources and costs. Examples are provided of companies that have improved fraud detection, customer retention, and machine performance by implementing a modern data warehouse that can handle large volumes and varieties of data from many sources.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software
1. Storing and Indexing Social Media Content in the Hadoop Ecosystem Lance Riedel Brent Halsey Jive Software
2. Jive: Social Networking for the Enterprise Engage Employees Engage Customers Engage the Social Web What Matters Apps
3. Jive Social Media Engagement stores social media for monitoring (e.g. brand sentiment), searching, and analysis Jive Social Media Monitoring Overview
15. Why Flume? Need to distribute our data reliably to multiple locations and systems (e.g. servers in our datacenter, in ec2, to HBase, to Hadoop) Flume Design Goals Reliability – failover collectors, master failover Scalability – linear scale by adding collector nodes Manageability – central zookeeper managed configs Extensibility – custom sources and sinks Good match!
16. Flume Overview: The Canonical Use Case Flume Agent tier Collector tier Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Collector Collector Collector server server server server server server server server server server server server HDFS
18. Katta – distributed Lucene Katta Master Index 2 Katta Node Index 1 Index 2 Katta Node Index 1 Index 2 Hadoop HDFS Raw.seq Index 1 Katta Node Index 1
26. Distributed Lucene Indexer Job Map Map Map Map Raw Events Input HDFS Blocks Index 1 Index 2 Index 3 Index 4
27. Distributed Lucene Indexer Job Map Map Map Map Reduce Reduce Raw Events Input HDFS Blocks Shuffle/ Sort Key -> shard number Value -> path to index Shard 1 Shard 2 Index 1 Index 2 Index 3 Index 4
28. 5 Minute Index Deployment Incremental Indexer Job Raw.seq
34. Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job 4. Deploy index Katta Index.INCREMENTAL.time-1.6 Incremental Indexing raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
35. Custom sources / sinks / decorators HBase Sink - There is now a supported HBase sink, but we do some of our own transformations before insertion (e.g. understands our json data) Zoie Realtime Search Sink - real-time searching of events on flume (more details next slide) Regex Filter Decorator - allows only events through that match a key value
36. Real-time Search and Indexing 5 Minute Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
37. Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
42. Zoie Flume Sink Jetty Server 0-5 min 1 Search Broker Katta Zoie Sink
43. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 2 1 Search Broker Katta Zoie Sink
44. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 3 2 1 Search Broker Katta Zoie Sink
45. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 4 3 2 1 > 15 min Search Broker Katta Zoie Sink
46. Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
47. Track user activity to: Power recommendations What Matters Activity Stream people you should meet topics you are interested in Social Search search ranking based on social graph topical graph keywords Analytics community manager understands: what users are collaborating how engagement is increasing Hadoop Ecosystem @Jive
Collecting content from twitter, facebook, blogs, and news outlets. Allow our users to search on this content, monitor it, and analyze it.
Screen shot of the app shows a user's list of monitors and content matching those monitors. Users can filter by sentiment and by the content source. They can engage in social conversations through twitter and facebook. And they can create discussions within Jive SBS.
Users can analyze social media trends over time with graph views for sentiment and content sources.
Old system takes data from content sources and throws it on a queue. Queue acts as a buffer to processors that process the content and insert it into a MySQL DB. Some fault tolerance with multiple servers connecting to multiple queues. But required a fair bit of monitoring and manual intervention when problems arise.
Limited because we throw away most of our content. Pushing the limits of MySQL can be painful.
Wanted to store all content (limited window), search it, and analyze it.
Chose HBase for random lookup. HDFS for chronological streaming. Katta for distributing Lucene shards. Hadoop for running map reduce.
Built out prototype of new system using Amazon's EC2 and needed a way to stream data into these servers. Internal / External IP addresses of EC2 made it difficult to connect directly to HDFS and HBase. Flume provided this connectivity along with desirable delivery guarantees.
Additionally, can fan out the data to bring data into EC2 along with our production system.
Additionally, can fan out the data to bring data into EC2 along with our production system.
KATTA For those not familiar with Katta, it is a distributed search engine that has two major responsibilities The first is distributing indexes from HDFS to any number of katta nodes. Katta nodes can run across as many machines as you want, easy to add more, and katta will redistribute indexes if nodes fail Katta has a highly customizable distribution policy – you can round robin, have hot/cold topologies where newer indexes are placed on faster machines As part of the distribution there is also replication of indexes for increased load performance and failover All of this is managed through zookeeper, so it is quite resilient, and does a very good job at keeping indexes where zookeeper says it should The second responsibility of katta is to take a single search request and send the request to every katta node and gather the results
OVERVIEW OF SEARCH – 30 days of twitter, facebook, major news and blogs Next few slides are going to show how we tackled searching a moving window of 30 days of twitter (full firehose), public facebook feed, and Spinn3r (which includes all major news and blog sites) SEARCH IS USED – INVESTIGATE MONITOR CREATION, ADHOC ANALYTICS -search is used to investigate what monitor to create, so searching historical data is of course key -also allows to do ad-hoc analytics over recent history. Show me sentimate, or raw counts for an ad-hoc query over the last 30 days
TRANSITION – OTHER REQUIREMENTS NEED FLEXIBILITY Other requirements of course pop up, so it was good that we chose Flume so that we could add easily add on new functionality One of the key customization areas of Flume are the custom sources sinks and decorators you can supply SOURCES OVERVIEW Sources allow you to create custom hooks into data providers. There is a huge list of sources provided out of the box from tailing files to avro http end points where you can send raw events to flume over http with a flume event avro schema SINK OVERVIEW Sinks allow you to create custom places to put the events.. Again there are a slew of out of the box sinks such as hbase and hdfs DECORATOR OVERVIEW And then there are decorators that you can place pretty much add anywhere in the topology where you are allowed to inspect each event and add meta data, change the contents, or throw them on the floor SOME OF OUR OWN Want to highlight a few customizations we did: (rest on slide)