Cloudera Search provides full-text search capabilities for Hadoop data by integrating Apache Solr. It allows for near real-time and batch indexing from data sources like HDFS, HBase, and Flume. Cloudera Search uses components like SolrCloud, Morphlines, and Sentry to provide distributed, scalable, and secure search across the Hadoop ecosystem.
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLucidworks
Gregg Donovan presented on lessons learned from sharding Solr at Etsy over three versions:
1) Initially, Etsy did not shard to avoid problems, but the single node approach did not scale.
2) The first sharding version used local sharding across multiple JVMs per host for better latency and manageability.
3) The current version uses distributed sharding across data centers for further latency gains, but this introduced challenges of partial failures, synchronization, and distributed queries.
Searching for Better Code: Presented by Grant Ingersoll, LucidworksLucidworks
The document discusses Lucidworks' Fusion product, which is a search platform that enhances Apache Solr. It provides connectors to various data sources, integrated ETL pipelines, built-in recommendations, and security features. The document outlines Fusion's architecture, demo use cases for basic and code search, and next steps for integrating additional analysis tools like OpenGrok.
Cloudera Morphlines is a new open source framework, recently added to the CDK, that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix integration.
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...Lucidworks
The document discusses searching enterprise data lakes with Apache Solr. It begins with an overview of how data storage has evolved from single databases to data warehouses to modern data lakes that store vast amounts of raw and processed data. The challenge is finding needed data in this environment. The document then covers the process for indexing data lake contents with Solr, including ingesting data, configuring Solr, parsing and indexing data, searching and analyzing data. It concludes with a demonstration of performing these steps and resources for further information.
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
This document provides an overview of scaling a Splunk deployment from an initial use case to a larger enterprise deployment. It discusses growing use cases and data volume over time. The agenda covers use case mapping, simple scaling approaches, indexer and search head clustering, distributed management, and hybrid cloud deployments. Best practices are outlined for sizing storage, tuning indexers, and designing high availability into the forwarding, indexing, and search tiers. Clustering impacts on storage sizing and additional hosts are also addressed.
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
This document discusses strategies for scaling a Splunk deployment to handle more use cases, data, and critical needs. It covers expanding use cases through business cases, scaling indexers through clustering and storage optimization, scaling search heads through clustering, and using centralized management and hybrid cloud/on-premises deployments. The agenda also promotes attending the upcoming Splunk .conf2015 conference for sessions on high availability, large deployments, search head clustering, and more.
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Lucidworks
This document summarizes a presentation given by Steven Bower and Ken LaPorte of Bloomberg about building their search ecosystem. They started by reviewing Bloomberg's existing fragmented search solutions and selected Apache Solr as their new platform. They created a specialized search team and designed Solr as a middleware service. This supported migrating over 1000 applications and indexing over 10 billion documents. They discussed challenges around monitoring, configuration management, and infrastructure scaling. Their solutions involved improved monitoring tools, adopting DevOps practices like Git and continuous integration, and optimizing hardware resources. Future plans include containerization, failure prediction, and expanding Solr's capabilities.
This document discusses using big data tools to build a fraud detection system. It outlines using Azure infrastructure to set up a Hadoop cluster with HDFS, HBase, Kafka and Spark. Mock transaction data will be generated and sent to Kafka. Spark jobs will process the data in batches, identifying potentially fraudulent transactions and writing them to an HBase table. The data will be visualized using Zeppelin notebooks querying Phoenix SQL on HBase. This will allow analysts to further investigate potential fraud patterns in near real-time.
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
This document discusses securing Spark applications. It covers encryption, authentication, and authorization. Encryption protects data in transit using SASL or SSL. Authentication uses Kerberos to identify users. Authorization controls data access using Apache Sentry and the Sentry HDFS plugin, which synchronizes HDFS permissions with higher-level abstractions like tables. A future RecordService aims to provide a unified authorization system at the record level for Spark SQL.
SplunkLive! Atlanta Mar 2013 - University of Alabama at BirminghamSplunk
This document summarizes a presentation given by George Starcher at the University of Alabama at Birmingham about their use of Splunk for security and compliance. It discusses how Splunk helped UAB attribute IP addresses to users to resolve DMCA copyright infringement claims in minutes rather than days. It also describes how Splunk is used to identify compromised user credentials accessing their VPN from unusual locations like China or using proxies. The presentation outlines several saved searches and apps developed for UAB to monitor security events, login activity, and locations of wireless users. It concludes by discussing UAB's plans to implement more Splunk security applications and improve data retention, indexing, and compliance.
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
This document provides an overview and summary of Spark Streaming. It discusses Spark Streaming's architecture and APIs. Spark Streaming receives live input data streams and divides them into micro-batches, which it processes using Spark's execution engine to perform operations like transformations and actions. This allows for low-latency, high-throughput stream processing with fault tolerance. The document also covers Spark Streaming deployment and integrating it with sources like Kinesis, as well as monitoring and tuning Spark Streaming applications.
SplunkLive Melbourne Scaling and best practice for Splunk on premise and in t...Gabrielle Knowles
Leverage the Splunk architecture to provide the best possible performance. Whether deploying on premise, in the cloud or on Splunk Cloud, this session will guide you through scenarios that will assist in getting the best from all these options. The agenda also covers how you can plan your searches and reporting to provide the best results for your end users.
Spark after Dark by Chris Fregly of DatabricksData Con LA
Spark After Dark is a mock dating site that uses the latest Spark libraries, AWS Kinesis, Lambda Architecture, and Probabilistic Data Structures to generate dating recommendations.
There will be 5+ demos covering everything from basic data ETL to advanced data processing including Alternating Least Squares Machine Learning/Collaborative Filtering and PageRank Graph Processing.
There is heavy emphasis on Spark Streaming and AWS Kinesis.
Watch the video here
https://www.youtube.com/watch?v=g0i_d8YT-Bs
Rocketfuel processes over 120 billion ad auctions per day and needs to detect fraud in real time to prevent losses. They developed Helios, which ingests event data from Kafka and HDFS into Storm in real time, joins the streams in HBase, then runs MapReduce jobs hourly to populate an OLAP cube for analyzing feature vectors and detecting fraud patterns. This architecture on Hadoop allows them to easily scale real-time processing and experiment with different configurations to quickly react to fraud.
Case study of Rujhaan.com (A social news app )Rahul Jain
Rujhaan.com is a news aggregation app that collects trending news and social media discussions from topics of interest to users. It uses various technologies including a crawler to collect data from social media, Apache Solr for search, MongoDB for storage, Redis for caching, and machine learning techniques like classification and clustering. The presenter discussed the technical architecture and challenges of building Rujhaan.com to provide fast, personalized news content to over 16,000 monthly users while scaling to growing traffic levels.
When it comes to data security, Uber’s business has unique needs related to scale, use-case, and technical stacks. This talk will discuss how our data platform team addressed specific challenges in deploying Uber's security requirements for Apache Hadoop, including how we leveraged open source building blocks. We'll share insights on how we augmented our Kerberized Hadoop integration with additional authentications mechanisms as well as our approach to supporting custom authentication in Apache Knox. In particular, we will elaborate Uber’s contributions to Apache Knox, specifically a novel pluggable platform for custom validation of any user request. This talk will also cover how we address table, column, and partition-level access control while ensuring improved developer productivity. In particular, we will explain how we translate RBAC policy into HDFS ACL to control data access, our internal audit platform built to detect and analyze the common security infringements, and real-world examples from our experiences in production.
Speakers
Mohammad Islam, Staff Software Engineer, Uber
Wei Han, Manager, Uber
Embeddable data transformation for real time streamsJoey Echeverria
This document summarizes Joey Echeverria's presentation on embeddable data transformation for real-time streams. Some key points include:
- Stream processing requires the ability to perform common data transformations like filtering, extracting, projecting, and aggregating on streaming data.
- Tools like Apache Storm, Spark, and Flink can be used to build stream processing topologies and jobs, but also have limitations for embedding transformations.
- Rocana Transform provides a library and DSL for defining reusable data transformation configurations that can be run within different stream processing systems or in batch jobs.
- The library supports common transformations as well as custom actions defined through Java. Configurations can extract metrics, parse logs, and perform
Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit
Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to be provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever.
Container orchestrators like Kubernetes can be used to deploy and distribute modules quickly, easily, and reliably. The intent of this talk is to share the experience of building such a service and deploying it on a Kubernetes cluster. In this talk, we will discuss all the requirements which an enterprise grade Hadoop/Spark cluster running on containers bring in for a container orchestrator.
This talk will cover in details how Kubernetes orchestrator can be used to meet all our needs of resource management, scheduling, networking, and network isolation, volume management, etc. We will discuss how we have replaced our home grown container orchestrator with Kubernetes which used to manage the container lifecycle and manage resources in accordance to our requirements. We will also discuss the feature list as container orchestrator which is helping us deploy and patch 1000s of containers and also a list which we believe need improvement or can be enhanced in a container orchestrator.
Speaker
Rachit Arora, SSE, IBM
This document discusses key architectural considerations for Internet of Things (IoT) systems. It outlines three main tiers: origin, transport, and analytics. The origin tier includes sensors, devices, and gateways that generate IoT data. Common protocols at this tier are discussed. The transport tier orchestrates data flow and can perform transformations. Apache NiFi and minifi are presented as options. The analytics tier is where insights are derived from the data through streaming and batch processing. Apache Beam is highlighted as a framework that can unify both types of processing. The document also discusses firmware versions, parsers, schemas, and data ownership challenges.
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
An organization’s information is spread across multiple repositories, on-premise and in the cloud, with limited ability to correlate information and derive insights. The Smart Content Hub solution from HP and Hortonworks enables a shared content infrastructure that transparently synchronizes information with existing systems and offers an open standards-based platform for deep analysis and data monetization.
- Leverage 100% of your data: Text, images, audio, video, and many more data types can be automatically consumed and enriched using HP Haven (powered by HP IDOL and HP Vertica), making it possible to integrate this valuable content and insights into various line of business applications.
- Democratize and enable multi-dimensional content analysis: - Empower your analysts, business users, and data scientists to search and analyze Hadoop data with ease, using the 100% open source Hortonworks Data Platform.
- Extend the enterprise data warehouse: Synchronize and manage content from content management systems, and crack open the files in whatever format they happen to be in.
- Dramatically reduce complexity with enterprise-ready SQL engine: Tap into the richest analytics that support JOINs, complex data types, and other capabilities only available with HP Vertica SQL on the Hortonworks Data Platform.
Speakers:
- Ajay Singh, Director, Technical Channels, Hortonworks
- Will Gardella, Product Management, HP Big Data
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
1) Hadoop enables modern data architectures that can process both traditional and new data sources to power business analytics and other applications.
2) By 2015, organizations that build modern information management systems using technologies like Hadoop will financially outperform their peers by 20%.
3) Hadoop provides an agile "data lake" solution that allows organizations to capture, process, and access all their data in various ways for business intelligence, analytics, and other uses.
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks
The document discusses using Hortonworks Data Platform (HDP) and Red Hat JBoss Data Virtualization to create a data lake solution and virtual data marts. It describes how a data lake enables storing all types of data in a single repository and accessing it through tools. Virtual data marts allow lines of business to access relevant data through self-service interfaces while maintaining governance and security over the central data lake. The presentation includes demonstrations of virtual data marts integrating data from Hadoop and other sources.
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
A Windows Azure approach towards building SAAS Solutions.
SAAS is fundamentally a business model where the application is owned, operated and managed by the vendor. The consumer pays for the usage and consumes the application.
SAAS offers a “hands-off” model for consumers which frees the consumer from pain of server/application management and instead allows the consumer to focus on business.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
Solr + Hadoop: Interactive Search for Hadoopgregchanan
This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.
This document discusses integrating Apache Solr with Apache Hadoop for big data search capabilities. It provides background on Mark Miller and the history of search on Hadoop. It outlines how Solr, Lucene, Hadoop, and related projects can be integrated to allow full-text search across large datasets in HDFS. Specific integration points discussed include allowing Solr to read and write directly to HDFS, custom directory support in Solr, replication support, and using Morphlines for extraction, transformation, and loading of data into Solr.
This document summarizes how Solr and Lucidworks Fusion can be used for big data search and analytics. It discusses indexing strategies like using MapReduce, Spark, and Fusion connectors to index structured and unstructured data from HDFS. It also covers topics like Solr on HDFS, auto add replicas, security, cluster sizing, and using the lambda architecture with Spark streaming to enable real-time search over batch-processed historical data. The document promotes Lucidworks Fusion as a search platform that can handle massive scales of data, provide real-time search capabilities, and work with any data source securely.
This document provides an overview of SolrCloud on Hadoop. It discusses how SolrCloud allows for distributed, highly scalable search capabilities on Hadoop clusters. Key components that work with SolrCloud are also summarized, including HDFS for storage, MapReduce for processing, and ZooKeeper for coordination services. The document demonstrates how SolrCloud can index and query large datasets stored in Hadoop.
This document discusses integrating Hadoop and Solr. Hadoop is useful for storing and processing large amounts of data, while Solr enables fast search across structured and unstructured data. The document outlines how Hadoop can store documents and Solr can index them for search, as well as how technologies like Flume can process streaming data and index it in real-time in Solr.
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
This document discusses storage requirements for running Spark workloads on Kubernetes. It recommends using a distributed file system like HDFS or DBFS for distributed storage and emptyDir or NFS for local temp scratch space. Logs can be stored in emptyDir or pushed to object storage. Features that would improve Spark on Kubernetes include image volumes, flexible PV to PVC mappings, encrypted volumes, and clean deletion for compliance. The document provides an overview of Spark, Kubernetes benefits, and typical Spark deployments.
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Indexing with solr search server and hadoop frameworkkeval dalasaniya
Hadoop and Solr are used together for indexing large datasets across distributed systems. Hadoop provides a distributed file system and Solr provides search capabilities. Solr indexes data from Hadoop and allows for fast, scalable search across large datasets even when data and computing resources are spread across multiple machines and locations. The combination of Hadoop and Solr provides a fault-tolerant solution for storing, processing, and searching very large datasets in a distributed environment.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.
The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.
Introduction to Hive and HCatalog presentation by Mark Grover at NYC HUG. A video of this presentation is available at https://www.youtube.com/watch?v=JGwhfr4qw5s
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
The document discusses integrating Hadoop and Solr to enable fast, ad-hoc search across structured and unstructured big data stored in Hadoop. It provides examples of how Hadoop can be used for large-scale storage and processing while Solr is used for real-time querying and search. Specifically, it describes how the Lucidworks HDFS connector can process documents from HDFS and index them into SolrCloud for search, and how log data can be ingested from Flume into HDFS for archiving and extracted fields can be indexed into Solr in real-time for search and analytics dashboards.
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
1. Adding Search to the
Hadoop Ecosystem
Gregory Chanan (gchanan AT cloudera.com)
Frontier Meetup Dec 2013
1
2. Agenda
•
•
•
•
•
Big Data and Search – setting the stage
Cloudera Search Architecture
Component deep dive
Security
Conclusion
3. Why Search?
Hadoop for everyone
• Typical case:
•
•
•
Ingest data to storage engine (HDFS, HBase, etc)
Process data (MapReduce, Hive, Impala)
Experts know MapReduce
• Savvy people know SQL
• Everyone knows Search!
•
4. Why Search?
An Integrated Part of
the Hadoop System
One pool of data
One security framework
One set of system resources
One management interface
5. Benefits of Search
•
Improved Big Data ROI
•
•
•
Faster time to insight
•
•
•
An interactive experience without technical knowledge
Single data set for multiple computing frameworks
Exploratory analysis, esp. unstructured data
Broad range of indexing options to accommodate needs
Cost efficiency
•
•
Single scalable platform; no incremental investment
No need for separate systems, storage
6. What is Cloudera Search?
Full-text, interactive search with faceted navigation
• Apache Solr integrated with CDH
•
•
•
•
Established, mature search with vibrant community
In production environments for years
Open Source
•
•
100% Apache, 100% Solr
Standard Solr APIs
Batch, near real-time, and on-demand indexing
• Generally Available; released 1.1 last month
•
8. Apache Hadoop
•
Apache HDFS
•
•
•
•
Distributed file system
High reliability
High throughput
Apache MapReduce
•
•
•
Parallel, distributed programming model
Allows processing of large datasets
Fault tolerant
9. Apache Lucene
•
Full text search
•
•
Indexing
Query
Traditional inverted index
• Batch and Incremental indexing
• We are using version 4.4 in current release
•
10. Apache Solr
•
Search service built using Lucene
•
•
Ships with Lucene (same TLP at Apache)
Provides XML/HTTP/JSON/Python/Ruby/… APIs
Indexing
• Query
• Administrative interface
• Also rich web admin GUI via HTTP
•
11. Apache SolrCloud
Provides distributed Search capability
• Part of Solr (not a separate library/codebase)
• Shards – provide scalability
•
•
•
•
partition index for size
replicate for query performance
Uses ZooKeeper for coordination
•
•
No split-brain issues
Simplifies operations
12. SolrCloud Architecture
•
•
•
Updates automatically sent to
the correct shard
Replicas handle queries,
forward updates to the leader
Leader indexes the document
for the shard, and forwards
the index notation to itself
and any replicas.
17. Near Real Time Indexing with Flume
Other
Log File
Log File
Flume
Agent
Flume
Agent
Indexer
17
HDFS
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest
Indexer
18. Apache Flume - MorphlineSolrSink
•
A Flume Source…
•
•
A Flume Channel…
•
•
Carries the event – MemoryChannel or reliable FileChannel
A Flume Sink…
•
•
Receives/gathers events
Sends the events on to the next location
Flume MorphlineSolrSink
•
Integrates Cloudera Morphlines library
•
ETL, more on that in a bit
Does batching
• Results sent to Solr for indexing
•
20. +
Search
Near Real Time Indexing of Apache HBase
=
HBase
Replication
interactive load
B I G D ATA D ATA M A N A G E M E N T
HDFS
planet-sized tabular data
immediate access & updates
fast & flexible information
discovery
HBase
Indexer(s)
Solr server
Solr server
Solr server
Solr server
Solr server
21. Lily HBase Indexer
•
Collaboration between NGData & Cloudera
•
•
NGData are creators of the Lily data management platform
Lily HBase Indexer
•
Service which acts as a HBase replication listener
•
HBase replication features, such as filtering, supported
Replication updates trigger indexing of updates (rows)
• Integrates Cloudera Morphlines library for ETL of rows
• AL2 licensed on github https://github.com/ngdata
•
23. Scalable Batch Indexing
Solr
server
Solr and MapReduce
Index
shard
Solr
server
Index
shard
Indexer
HDFS
Indexer
Files
Files
23
• Flexible, scalable batch
indexing
• Start serving new indices
with no downtime
• On-demand indexing, costefficient re-indexing
24. MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed
•
•
Much like Unix “find” – see HADOOP-8989
Output is NLineInputFormat’ed file
2) Mapper/Reducer indexing step
Mapper extracts content via Cloudera Morphlines
• Reducer indexes documents via embedded Solr server
• Originally based on SOLR-1301
•
•
Many modifications to enable linear scalability
25. MapReduce Indexer “golive”
Cloudera created this to bridge the gap between NRT
(low latency, expensive) and Batch (high latency,
cheap at scale) indexing
• Results of MR indexing operation are immediately
merged into a live SolrCloud serving cluster
•
•
•
•
No downtime for users
No NRT expense
Linear scale out to the size of your MR cluster
26. HBase + MapReduce
•
New in search 1.1: run MapReduce job over HBase
tables
•
•
Same architecture as running over HDFS
Similar to HBase’s CopyTable,
27. Cloudera Morphlines
Open Source framework for simple ETL
• Simplify ETL
•
•
•
Built-in commands and library support (Avro format, Hadoop
SequenceFiles, grok for syslog messages)
Configuration over coding
Standardize ETL
• Ships as part of Kite SDK, formerly Cloudera
Developer Kit (CDK)
•
•
•
It’s a Java library
AL2 licensed on github https://github.com/kite-sdk
28. Cloudera Morphlines Architecture
Morphlines can be embedded in any application…
SolrCloud
Logs, tweets, social
media, html,
images, pdf, text….
Anything you want
to index
Flume, MR Indexer, HBase indexer, etc...
Or your application!
Solr
Solr
Morphline Library
Solr
29. Extraction and Mapping
syslog
Flume
Agent
Event
Solr sink
Morphline Library
Record
Command: readLine
Record
Command: grok
Record
Command: loadSolr
Document
Solr
• Modeled after Unix
pipelines
• Simple and flexible data
transformation
• Reusable across multiple
index workloads
• Over time, extend and reuse across platform
workloads
30. Morphline Example – syslog with grok
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ readLine {} }
{
grok {
dictionaryFiles : [/tmp/grok-dictionaries]
expressions : {
message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}
%{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%{GREEDYDATA:syslog_message}"""
}
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
}
Output Record
}
syslog_pri:164
{ loadSolr {} }
syslog_timestamp:Feb 4 10:46:14
]
syslog_hostname:syslog
}
syslog_program:sshd
]
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.
31. Current Command Library
•
•
•
•
•
•
•
•
Integrate with and load into Apache Solr
Flexible log file analysis
Single-line record, multi-line records, CSV files
Regex based pattern matching and extraction
Integration with Avro
Integration with Apache Hadoop Sequence Files
Integration with SolrCell and all Apache Tika parsers
Auto-detection of MIME types from binary data using
Apache Tika
32. Current Command Library (cont)
•
•
•
•
•
•
•
•
•
•
Scripting support for dynamic java code
Operations on fields for assignment and comparison
Operations on fields with list and set semantics
if-then-else conditionals
A small rules engine (tryRules)
String and timestamp conversions
slf4j logging
Yammer metrics and counters
Decompression and unpacking of arbitrarily nested
container file formats
Etc…
34. Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language
35. Security
Upstream Solr doesn’t deal with security
• Search 1.0 supports kerberos authentication
•
•
•
Similar to Oozie / WebHDFS
Search 1.1 supports index-level authorization via
Apache Sentry (incubating)
36. Index-Level Authorization
Sentry works via “policy files” stored in HDFS
• Can grant roles administrative-only, query-only,
update-only access
• Example:
[groups]
# Assigns each Hadoop group to its set of roles
dev_ops = engineer_role, ops_role
[roles]
engineer_role = collection = source_code->action=*
ops_role = collection = hbase_logs->action=Query
•
37. Index-Level Authorization 2
•
Works by hooking into Solr RequestHandlers:
<requestHandler name="/update“ class="solr.UpdateRequestHandler">
<lst name="defaults“>
<str name="update.chain">updateIndexAuthorization</str>
</lst>
</requestHandler>
Also includes secure impersonation support
• Unauthorized attempts get a 401 response and are
written to the solr log
• Future work: more fine grain authorization
•
38. Conclusion
•
Cloudera Search now Generally Available (1.1)
•
•
•
•
•
Cloudera Manager Standard (i.e. the free version)
•
•
•
Free Download
Extensive documentation
Send your questions and feedback to searchuser@cloudera.org
Take the Search online training
Simple management of Search
Free Download
QuickStart VM also available!