This document summarizes how Solr and Lucidworks Fusion can be used for big data search and analytics. It discusses indexing strategies like using MapReduce, Spark, and Fusion connectors to index structured and unstructured data from HDFS. It also covers topics like Solr on HDFS, auto add replicas, security, cluster sizing, and using the lambda architecture with Spark streaming to enable real-time search over batch-processed historical data. The document promotes Lucidworks Fusion as a search platform that can handle massive scales of data, provide real-time search capabilities, and work with any data source securely.
The document discusses Solr 4, an open source search platform built on Apache Lucene. Some key points:
- Solr 4 is a NoSQL search server that provides distributed indexing, fault tolerance, and real-time search capabilities.
- Solr Cloud is Solr's distributed architecture which uses Zookeeper for coordination to provide features like automatic sharding and replication of indexes across multiple servers.
- The document outlines Solr 4's capabilities including schema-less options, atomic updates, optimistic concurrency, and a REST API for managing the schema dynamically.
Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
Timothy Potter presented at a Big Data conference in Boston from October 11-14, 2016. He discussed how Lucidworks Fusion provides an alternative to traditional big data stacks that emphasizes fast access, agility and automation over integration. Fusion allows for common access patterns like fast lookups, ranked retrieval and distributed scans while integrating technologies like Solr, Spark, HDFS and more. It provides tools for data ingestion, time-based partitioning, analytics, machine learning and more to solve business problems rather than focus on infrastructure.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Lucidworks
This document describes Bloomberg's development of a search analytics component for Solr. It was created by their search team to enable complex calculations and aggregations on numerical time-series data. Key features include statistical and mathematical expressions to facet and analyze data, supporting int, long, float, date and string fields. Examples show calculating a weighted average and variance. Future plans include multi-shard support and filtering result sets based on calculated statistics.
1. Apache Spark is an open source cluster computing framework for large-scale data processing. It is compatible with Hadoop and provides APIs for SQL, streaming, machine learning, and graph processing.
2. Over 3000 companies use Spark, including Microsoft, Uber, Pinterest, and Amazon. It can run on standalone clusters, EC2, YARN, and Mesos.
3. Spark SQL, Streaming, and MLlib allow for SQL queries, streaming analytics, and machine learning at scale using Spark's APIs which are inspired by Python/R data frames and scikit-learn.
Cascalog at May Bay Area Hadoop User Groupnathanmarz
Cascalog is a Clojure-based query language for Hadoop that provides a powerful and easy-to-use tool for data analysis. It allows users to write queries as regular Clojure code, offering features like joins, aggregators, functions, and sorting. Cascalog is unique in that it offers the full power of Clojure at all times by integrating queries directly into the programming language. BackType uses Cascalog for tasks like identifying influencers on social media, determining exposure to URLs, and studying engagement over time.
This document discusses how search has evolved beyond traditional text search to support additional use cases like recommendations and analytics. It introduces LucidWorks' products like Solr and SiLK that leverage Hadoop to power search and discovery across large datasets. New features in Solr 4 like reduced memory usage and joins are highlighted. Demos are presented on applications in ecommerce, healthcare, and finance.
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
The document summarizes new features and capabilities in Lucene and Solr 4 for search. Key highlights include Lucene being faster and more memory efficient through improvements like native near real-time support and string handling. Solr 4 adds new features for search, faceting, relevance, indexing and geospatial search. It also improves capabilities for scaling Solr through distribution and dynamic scaling in SolrCloud. The document provides examples of how Lucene and Solr can be applied to problems beyond traditional search like recommendations, backups and indexing of documents.
This document discusses using Hadoop and Elasticsearch for real-time analytics. It provides an overview of Elasticsearch, including how it is document-oriented, schema-free, distributed and fast. It also demonstrates indexing, retrieving, updating and deleting documents from Elasticsearch. The demo portion involves extracting data from a SQL database using Hive, transforming it with Hadoop/Hive, and loading it into Elasticsearch to run queries. Lessons learned focus on concurrency, filtering, field data caching and JVM memory usage.
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
The document discusses the development of a new search system for PubChem to allow for exploration of multidimensional biomedical data. The new system was needed to address the challenges of handling large and heterogeneous datasets with many relationships between data types in a way that allows for fast querying. The system leverages Apache SOLR to provide features like full text search, faceting, molecule structure searching and joining of related data. It includes backend components like SOLR, SQL and specialized search engines as well as web APIs and frontend interfaces like reusable widgets and a new search interface.
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
Streaming Analytics with Spark, Kafka, Cassandra, and Akka discusses rethinking architectures for streaming analytics. The document discusses:
1) The need to build scalable, fault-tolerant systems to handle massive amounts of streaming data from different sources with varying structures.
2) An example use case of profiling cyber threat actors using streaming machine data to detect intrusions and security breaches.
3) Rethinking architectures by moving away from ETL pipelines and dual batch/stream systems like Lambda architecture toward unified stream processing with Spark Streaming, Kafka, Cassandra and Akka. This simplifies analytics and eliminates duplicate code and systems.
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Caserta
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr sponsored by O'Reilly Media!
Caserta Concepts shared one of their innovative DW projects using Solr. See how open source search technology can serve high performance analytic use cases. Presentation and solution walk-through given by Caserta Concepts' Joe Caserta and Elliott Cordo.
For more information, visit www.casertaconcepts.com
A 1 hour intro to search, Apache Lucene and Solr, and LucidWorks Search. Contains a quick start with LucidWorks Search and a demo using financial data (See Github prj: http://bit.ly/lws-financial) as well as some basic vocab and search explanations
When it comes to data security, Uber’s business has unique needs related to scale, use-case, and technical stacks. This talk will discuss how our data platform team addressed specific challenges in deploying Uber's security requirements for Apache Hadoop, including how we leveraged open source building blocks. We'll share insights on how we augmented our Kerberized Hadoop integration with additional authentications mechanisms as well as our approach to supporting custom authentication in Apache Knox. In particular, we will elaborate Uber’s contributions to Apache Knox, specifically a novel pluggable platform for custom validation of any user request. This talk will also cover how we address table, column, and partition-level access control while ensuring improved developer productivity. In particular, we will explain how we translate RBAC policy into HDFS ACL to control data access, our internal audit platform built to detect and analyze the common security infringements, and real-world examples from our experiences in production.
Speakers
Mohammad Islam, Staff Software Engineer, Uber
Wei Han, Manager, Uber
This document provides a summary of using Apache Spark for continuous analytics and optimization. It discusses using Spark for collecting data from various sources, processing the data using Spark's capabilities for streaming, machine learning and SQL queries, and reporting insights. An example use case is presented for social media analysis using Spark Streaming to process a real-time data stream from Kafka and analyze the data using both the Spark SQL and core Spark APIs in Scala.
This document summarizes Sarah Guido's talk on using Apache Spark for data science at Bitly. She discusses how Bitly uses Spark to extract, explore, and model subsets of their data including decoding Bitly links, performing topic modeling using LDA, and trend detection. While Spark provides performance benefits over MapReduce for these tasks, she notes issues with Hadoop servers, JVM, and lack of documentation that must be addressed for full production usage at Bitly.
From R Script to Production Using rsparkling with Navdeep GillDatabricks
The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms.
In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.
MongoDB uses replication to provide high availability and redundancy. The document discusses MongoDB replication fundamentals including replica sets, oplogs, and reading from secondary nodes. It provides an overview of primary/secondary roles in replica sets, how writes are logged to oplogs, and how secondaries replicate by reading the primary's oplog. It also covers read preference settings and write concerns in MongoDB replication.
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetLucidworks
This document summarizes Target's implementation of Solr as its search platform. It discusses how Target transitioned from Oracle-Endeca to Solr to handle its large scale data and enable more flexible relevancy controls. It describes how Target tested Solr through handling live guest traffic in two sprints and moving its typeahead functionality to the public cloud. Finally, it outlines how Target leverages key Solr capabilities like collection aliases, atomic updates, and configurable facets to synchronize designer and product launches.
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Lucidworks
Thoth is a real-time Solr monitoring system developed at Trulia to understand search infrastructure without accessing logs. It collects Solr request data, indexes it in another Solr core for search and analysis, and provides a dashboard and APIs for monitoring metrics. It also uses machine learning to predict query times and identify query patterns through topic modeling. The system was designed to be modular and its components like data collection, indexing, dashboard and monitoring are open-sourced.
This document discusses the author's experience with the ELK stack and Kibana. The author has been using ELK since 2012 and has published content on Logstash and written chapters about ELK in their book. The document then provides an overview of Kibana, describing its core components and features like dashboards, visualizations, and search functionality. It also outlines some custom panels the author created for Kibana through custom development, including range, percentile, and map panels. Lastly, it discusses the author's solution for adding authentication to Kibana.
This document provides an introduction to Kibana4 and how to use its features. It discusses the major components of Kibana4 including Discover, Visualize, and Dashboard. It also covers visualization types like metrics, buckets, and aggregations. The document provides examples of using aggregations versus facets and describes settings, scripted fields, and plugins. It concludes by discussing potential future directions for Kibana.
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
Solr is a search engine that can run indexes stored on HDFS for improved fault tolerance and operations. Past attempts had Solr directly run on HDFS, but were slow. The current approach uses a block cache to improve performance by keeping frequently accessed index blocks in memory. Future work includes improving the block cache and enabling HDFS-only replication for SolrCloud replicas to reduce unnecessary data copying.
The document discusses how real-time monitoring of cloud services works. It describes collecting metrics from various sources like CloudWatch, application logs stored in S3, and error logs. These metrics and logs are analyzed to diagnose issues, identify risks, and ensure service level agreements around availability, performance and errors are met. Unified monitoring and alerting across the stack is important for effective real-time monitoring.
Logstash is a tool for managing logs that allows for input, filter, and output plugins to collect, parse, and deliver logs and log data. It works by treating logs as events that are passed through the input, filter, and output phases, with popular plugins including file, redis, grok, elasticsearch and more. The document also provides guidance on using Logstash in a clustered configuration with an agent and server model to optimize log collection, processing, and storage.
The document discusses using Elasticsearch and Kibana for big data search and visualization. It describes some problems with traditional request-response models like long cycles and engineers being bottlenecks. It then outlines requirements for the solution like being easy to use, fast, scalable, and solving 80% of problems. Elasticsearch and Kibana are presented as a solution, with Elasticsearch indexing and storing large amounts of data and Kibana providing interactive visualizations without coding. Examples of how it has been used at Gogolook are provided, and future plans like logging all user events are discussed.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
This document discusses using the ELK stack (Elasticsearch, Logstash, Kibana) for attack monitoring. It provides an overview of each component, describes how to set up ELK and configure Logstash for log collection and parsing. It also demonstrates log forwarding using Logstash Forwarder, and shows how to configure alerts and dashboards in Kibana for attack monitoring. Examples are given for parsing Apache logs and syslog using Grok filters in Logstash.
Cloudera Search provides full-text search capabilities for Hadoop data by integrating Apache Solr. It allows for near real-time and batch indexing from data sources like HDFS, HBase, and Flume. Cloudera Search uses components like SolrCloud, Morphlines, and Sentry to provide distributed, scalable, and secure search across the Hadoop ecosystem.
The document discusses adding search capabilities to the Hadoop ecosystem through Cloudera Search. It provides an overview of Cloudera Search's architecture and components, which integrate Apache Solr with Cloudera Distribution of Hadoop to enable distributed, full-text search across data stored in HDFS. Key components described include HDFSDirectory, which allows Solr to read and write indexes and transaction logs to and from HDFS, and BlockDirectoryCache, which caches index file blocks in memory for performance.
1. Cloudera Search provides full-text search capabilities for Hadoop ecosystems by integrating Apache Solr. It allows batch, near real-time, and on-demand indexing of data in HDFS, HBase, and other data sources.
2. Indexing can be done through various methods like Flume for near real-time indexing, HBase indexer for indexing HBase data, and MapReduce jobs for scalable batch indexing. Extraction and mapping of data is done through the Cloudera Morphlines framework.
3. Queries can be done through the built-in Solr web UI, custom UIs like Hue, or Solr APIs. Security features include Kerberos authentication and
Solr + Hadoop: Interactive Search for Hadoopgregchanan
This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.
- Cloudera Search provides an overview of using Solr on Hadoop for search capabilities.
- Key projects involved include Lucene, Solr, and Hadoop which can be integrated to allow indexing of data on HDFS and querying via search.
- The presentation discusses architectural details of running Solr on HDFS and integrating other Hadoop projects like HBase, MapReduce, and Hue.
This document discusses Cloudera Search, which provides full-text search capabilities integrated with Apache Hadoop. It summarizes Cloudera Search's architecture, which uses Apache Lucene/Solr for indexing and search, Apache Flume and HBase for near real-time indexing, Apache MapReduce for batch indexing, and Apache Sentry for security. The document also discusses use cases for near real-time and batch search and concludes by encouraging questions.
This document discusses integrating Hadoop and Solr. Hadoop is useful for storing and processing large amounts of data, while Solr enables fast search across structured and unstructured data. The document outlines how Hadoop can store documents and Solr can index them for search, as well as how technologies like Flume can process streaming data and index it in real-time in Solr.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
The document introduces Yann Yu from Lucidworks and provides information about Lucidworks and its products Solr and Hadoop. It discusses how Solr can be used to provide search capabilities for large amounts of both structured and unstructured data stored in Hadoop. Integrating Solr and Hadoop allows for fast search across big data stored in Hadoop along with real-time indexing and querying capabilities. Examples discussed include enabling enterprise-wide search of documents stored in Hadoop and using Flume to index log data from Hadoop into Solr for real-time analytics and search.
This document provides an overview of SolrCloud on Hadoop. It discusses how SolrCloud allows for distributed, highly scalable search capabilities on Hadoop clusters. Key components that work with SolrCloud are also summarized, including HDFS for storage, MapReduce for processing, and ZooKeeper for coordination services. The document demonstrates how SolrCloud can index and query large datasets stored in Hadoop.
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
This document provides an introduction to Big Data and Apache Hadoop. It defines Big Data as large and complex datasets that are difficult to process using traditional database tools. It describes how Hadoop uses MapReduce and HDFS to provide scalable storage and parallel processing of Big Data. It provides examples of companies using Hadoop to analyze exabytes of data and common Hadoop use cases like log analysis. Finally, it summarizes some popular Hadoop ecosystem projects like Hive, Pig, and Zookeeper that provide SQL-like querying, data flows, and coordination.
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
This document provides an introduction and overview of Apache Drill, an open source distributed SQL query engine designed for interactive analysis of large-scale datasets. It describes Drill's architecture as being inspired by Google's Dremel, with support for standard SQL queries, pluggable data sources, and schema flexibility. Drill distributes query execution across multiple nodes to maximize data locality and parallelism. Key features highlighted include full ANSI SQL support, support for nested data, optional schemas, and extensibility points.
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
Slides from the STL Big Data IDEA meeting from January 2019. The presenters discussed technologies to continue using, stop using, and start using in 2019.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.
Similar to Webinar: Solr & Fusion for Big Data (20)
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
With ecommerce experiencing explosive growth, it seems intuitive that the B2B segment of that ecosystem is mirroring the same trajectory. That said, B2B has very different needs when it comes to transacting with the same style of experiences that we see in B2C. For instance, B2B ecommerce is about precision findability, whereas B2C customers can convert at higher rates when they’re just browsing online. In order for the B2B buying experience to be successful, search needs to be tuned to meet the unique needs of the segment.
In this webinar with Forrester senior analyst Joe Cicman, you’ll learn:
-Which verticals in B2B will drive the most growth, and how machine-learning powered personalization tactics can be deployed to support those specific verticals
-Why an omnichannel selling approach must be deployed in order to see success in B2B
-How deploying content search capabilities will support a longer sales cycle at scale
-What the next steps are to support a robust B2B commerce strategy supported by new technology
Speakers
Joe Cicman, Senior Analyst, Forrester
Jenny Gomez, VP of Marketing, Lucidworks
Customer loyalty starts with quickly responding to your customer’s needs. When it comes to resolving open support cases, time is of the essence. Time spent searching for answers adds up and creates inefficiencies in resolving cases at scale. Relevant answers need to be a few clicks away and easily accessible for agents directly from their service console.
We will explore how Lucidworks’ Agent Insights application automatically connects agents with the correct answers and resources. You’ll learn how to:
-Configure a proactive widget in an agent’s case view page to access resources across third-party systems (such as Sharepoint, Confluence, JIRA, Zendesk, and ServiceNow).
-Easily set up query pipelines to autonomously route assets and resources that are relevant to the case-at-hand—directly to the right agent.
-Identify subject matter experts within your support data and access tribal knowledge with lightning-fast speed.
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
Lunch and Learn during Retail TouchPoints #RIC21 virtual event.
***
Crate & Barrel’s previous search solution couldn’t provide its shoppers with an online search and browse experience consistent with the customer-centric Crate & Barrel brand. Meanwhile, Crate & Barrel merchandisers spent the bulk of their time manually creating and maintaining search rules. The search experience impacted customer retention, loyalty, and revenue growth.
Join this lunch & learn for an interactive chat on how Crate & Barrel partnered with Lucidworks to:
-Improve search and browse by modernizing the technology stack with ML-based personalization and merchandising solutions
-Enhance the experience for both shoppers and merchandisers
-Explore signals to transform the omnichannel shopping experience
Questions? Visit https://lucidworks.com/contact/
Learn how to guide customers to relevant products using eCommerce search, hyper-personalisation, and recommendations in our ‘Best-In-Class Retail Product Discovery’ webinar.
Nowadays, shoppers want their online experience to be engaging, inspirational and fulfilling. They want to find what they’re looking for quickly and easily. If the sought after item isn’t available, they want the next best product or content surfaced to them. They want a website to understand their goals as though they were talking to a sales assistant in person, in-store.
In this webinar, we explore IMRG industry data insights and a best-in-class example of retail product discovery. You’ll learn:
- How AI can drive increased revenue through hyper-personalised experiences
- How user intent can be easily understood and results displayed immediately
- How merchandisers can be empowered to curate results and product placement – all without having to rely on IT.
Presented by:
Dave Hawkins, Principal Sales Engineer - Lucidworks
Matthew Walsh, Director of Data & Retail - IMRG
Connected Experiences Are Personalized ExperiencesLucidworks
Many companies claim personalization and omnichannel capabilities are top priorities. Few are able to deliver on those experiences.
For a recent Lucidworks-commissioned study, Forrester Consulting surveyed 350+ global business decision-makers to see what gets in the way of achieving these goals. They discovered that inefficient technology, lack of behavioral insights, and failure to tie initiatives to enterprise-wide goals are some of the most frequent blockers to personalization success.
Join guest speaker, Forrester VP and Principal Analyst, Brendan Witcher, and Lucidworks CEO, Will Hayes, to hear the results of the Forrester Consulting study, how to avoid “digital blindness,” and how to apply VoC data in real-time to delight customers with personalized experiences connected across every touchpoint.
In this webinar, you’ll learn:
- Why companies who utilize real-time customer signals report more effective personalization
- How to connect employees and customers in a shared experience through search and browse
- How Lucidworks clients Lenovo, Morgan Stanley and Red Hat fast-tracked improvements in conversion, engagement and customer satisfaction
Featuring
- Will Hayes, CEO, Lucidworks
- Brendan Witcher, VP, Principal Analyst, Forrester
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
Intelligent Policing. Leveraging Data to more effectively Serve Communities.
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
-The technology needs of an intelligent police force.
-How a Global Search improves an officer's interaction with existing data.
Featuring:
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
The technology needs of an intelligent police force.
How a Global Search improves an officer's interaction with existing data.
Featuring
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
This document provides a framework for prioritizing onsite search problems and key performance indicators (KPIs) to measure for e-commerce search optimization. It recommends prioritizing fixing searches that yield no results, improving relevance of results, and reducing false positives. The most essential KPIs to measure include query latency, throughput, result relevance through click-through rates and NDCG scores. The document also provides tips for self-benchmarking search performance and examples of search performance benchmarks across nine e-commerce sites from various industries.
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
Wish your conversion rates were higher? Can’t figure out how to efficiently and effectively serve all the visitors on your site? Embarrassed by the quality of your product discovery experience? The bar is high and the influx of online shopping over recent months has reminded us that the opportunities are real. We’re all deep in holiday prep, but let’s take a few minutes to think about January 2021 and beyond. How can we position ourselves for success with our customers and against our competition?
Grab your lunch and let’s dive into three strategies that need to be part of your 2021 roadmap. You don’t need an army to get there. But you do need to take action and capitalize on the shoppers abandoning the product discovery journey on your site.
In this session, attendees will find out how to:
-Take control of merchandising at scale;
-Implement hands-free search relevancy; and
-Address personalization challenges.
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
For a personalized search experience, search curation requires robust text interpretation, data enrichment, relevancy tuning and recommendations. In order to achieve this, language and entity identification are crucial.
For teams working on search applications, advanced language packages allow them to achieve greater recall without sacrificing precision.
Join us for a guided tour of our new Advanced Linguistics packages, available in Fusion, thanks to the technology partnership between Lucidworks and Basistech.
We’ll explore the application of language identification and entity extraction in the context of search, along with practical examples of personalizing search and enhancing entity extraction.
In this webinar, we’ll cover:
-How Fusion uses the Rosette Basic Linguistics and Entity Extraction packages
-Tips for improving language identification and treatment as well as data enrichment for personalization
-Speech2 demo modeling Active Recommendation
-Use Rosette’s packages with Fusion Pipelines to build custom entities for specific domain use cases
Featuring:
-Radu Miclaus, Director of Product, AI and Cloud, Lucidworks, Lucidworks
-Robert Lucarini, Senior Software Engineer, Lucidworks
-Nick Belanger, Solutions Engineer, Basis Technology
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
Before COVID-19, almost 80% of the US workforce worked service in jobs that involve in-person interaction with strangers. Now, leaders of service organizations must reshape their offerings during the pandemic and prepare for whatever the new normal turns out to be. Our three panelists will share ideas for adapting their service businesses, now that closer-than-six-feet isn’t an option.
Join Lucidworks as we talk shop with 3 service business leaders, covering:
-Common impacts of the pandemic on service businesses (and what to do about them),
-How service teams can maintain a human touch across virtual channels, and
-Plans for the future, before and after the pandemic subsides.
Featuring
-Sara Nathan, President & CEO, AMIGOS
-Anthony Carruesco, Founder, AC Fly Fishing
-sara bradley, chef and proprietor, freight house
-Justin Sears, VP Product Marketing, Lucidworks
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
The COVID-19 pandemic has forced companies to support far more customers and employees through digital channels than ever before. Many are turning to chatbots to help meet increasing demand, but traditional rules-based approaches can’t keep up. Our new Smart Answers add-on to Lucidworks Fusion makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
Watch our on-demand webinar showcasing Smart Answers on Lucidworks Fusion. This technology makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
In this webinar, we’ll cover off:
-How search and deep learning extend conversational frameworks for improved experiences
-How Smart Answers improves customer care, call deflection, and employee self-service
-A live demo of Smart Answers for multi-channel self-service support
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
In the current climate, it’s now more important than ever to digitally enable your workforce and customers.
Hear from Simon Taylor, VP Global Partners & Alliances, Lucidworks and Matt Aslett, Research Vice President, 451 Research to get the inside scoop on how industry leaders in Europe are developing and executing their digital transformation strategies.
In this webinar, we’ll discuss:
The top challenges and aspirations European business and technology leaders are solving using AI and search technology
Which search and AI use cases are making the biggest impact in industries such as finance, healthcare, retail and energy in Europe
What technology buyers should look for when evaluating AI and search solutions
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
This document introduces Fusion 5.1 and its new capabilities for integrating with data science tools like Tensorflow, Scikit-Learn, and Spacy.
It provides an overview of Fusion's capabilities for understanding content, users, and delivering insights at scale. The document then demonstrates Fusion's Jupyter Notebook integration for reading and writing data and running SQL queries.
Finally, it shows how Fusion integrates with Seldon Core to easily deploy machine learning models with tools like Tensorflow and Scikit-Learn. A live demo is provided of deploying a custom model and using it in Fusion's query and indexing pipelines.
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
In this webinar with 451 Research, you'll understand how retailers are using AI to predict customer intent and learn which key performance metrics are used by more than 120 online retailers in Lucidworks’ 2019 Retail Benchmark Survey.
In this webinar, you’ll learn:
● What trends and opportunities are facing the ecommerce industry in 2020
● Why search is the universal path to understanding customer intent
● How large online retailers apply AI to maximize the effectiveness of their personalization efforts
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
Nordstrom Rack | Hautelook curates and serves customers a wide selection of on-trend apparel, accessories, and shoes at an everyday savings of up to 75 percent off regular prices. With over a million visitors shopping across different platforms every day, and a realization that customers have become accustomed to robust and personalized search interactions, Nordstrom Rack | Hautelook launched an initiative over a year ago to provide data science-driven digital experiences to their customers.
In this session, we’ll discuss Nordstrom Rack | Hautelook’s journey of operationalizing a hefty strategy, optimizing a fickle infrastructure, and rallying troops around a single vision of building an expansible machine-learning driven product discovery engine.
The audience will learn about:
-The key technical challenges and outcomes that come with onboarding a solution
-The lessons learned of creating and executing operational design
-The use of Lucidworks Fusion to plug custom data science models into search and browse applications to understand user intent and deliver personalized experiences
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
Knowledge graphs and machine learning are on the rise as enterprises hunt for more effective ways to connect the dots between the data and the business world. With newer technologies, the digital workplace can dramatically improve employee engagement, data-driven decisions, and actions that serve tangible business objectives.
In this webinar, you will learn
-- Introduction to knowledge graphs and where they fit in the ML landscape
-- How breakthroughs in search affect your business
-- The key features to consider when choosing a data discovery platform
-- Best practices for adopting AI-powered search, with real-world examples
Webinar: Building a Business Case for Enterprise SearchLucidworks
The document discusses building a business case for enterprise search. It notes that 85% of information is unstructured data locked in various locations and applications. Many knowledge workers spend a significant portion of their day searching across multiple systems for information. The rise of unstructured data and AI capabilities can help organizations unlock value from their information assets. Effective enterprise search powered by AI can provide real-time intelligence, personalized information, and more efficient research to help knowledge workers.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Increase Quality with User Access Policies - July 2024
Webinar: Solr & Fusion for Big Data
2. Solr & Fusion for Big Data
• Where search fits in the
big data landscape?
• Solr on HDFS
• Indexing strategies
• End-to-end security
• Lambda architecture
• Spark and how we use it
in Fusion
4. Why search for big data?
• Speed at scale
• Basic analytics (facets, pivot facets, facets + stats) +
visualizations
• Query structured and unstructured data
• Ad hoc exploration is inherent in big data
• People grok search
• Context for aggregations (drill into the numbers)
5. Common use case:
log analysis
• Time-ordered data
• Raw data stored in
HDFS
• How much data? How
fast?
• Access patterns?
• Schema design ~ no free
lunch at scale
6. Time-based Partitioning Scheme
Fusion
Log Analytics
Dashboard
logs_feb26
(daily collection)
logs_feb25
(daily collection)
logs_feb01
(daily collection)
h00
(shard)
h22
(shard)
h23
(shard)
h00
(shard)
h22
(shard)
h23
(shard)
Add replicas
to support higher
query volume &
fault-tolerance
recent_logs
(colllection alias)
Use a collection
alias to make multiple
collections look like a
single collection; minimize
exposure to partitioning
strategy in client layer
Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages
7. Solr on HDFS
• Maturing solution still some issues
• My test showed ~23-25% slower than local SSD
• Better ROI, operational efficiency, security
• Needed for YARN
• Enables auto add replicas
• Interesting features coming soon: ZooKeeper lock (SOLR-
8169) and replicas share index (SOLR-6237)
8. Solr on HDFS
Solr
shard1 / replica1
block cache
Solr
shard1 / replica2
block cache
writes
reads
HDFS
DataNode C
HDFS
DataNode B
HDFS
DataNode A writes
reads
HDFS block replication
Solr replication
12. Fusion Indexing Pipelines in MapReduce
Solr
Map Task (or reducer if needed)
ZooKeeper
CloudSolr
Client
HDFS
Get collection metadata
from ZooKeeper
(e.g. shard leader URL)
Send updates to shard
leaders in parallel
Fusion Pipeline
docs
…N map tasks (1 per block)
30+ index stages
- Field mapping
- JavaScript
- Tika parsing
- NLP
- Regex
- JDBC lookup
Many common file formats supported:
CSV, SequenceFile, grok, XML, warc
13. Security
• End-to-end security is now a reality for Hadoop
• Kerberos authentication (ZK, Solr, HDFS, jobs)
• Pluggable authorization framework
• Collection and document-level access controls (via Fusion)
• SSL
• Apache Ranger (centralized admin, auditing, monitoring for
Hadoop)
14. Cluster Sizing Worksheet
• There is no formula, only guidelines!
• # of documents / avg. doc size / number of fields
• Updates per second / soft-commit frequency
• Storage type (local SSD vs. HDFS)
• Sharding scheme (time-based vs. hash-based)
• Peak QPS / 95th percentile response time / query complexity
• Must test your data on your servers ;-)
15. • Search engine fits
perfectly with lambda
• Use batch layer to build
indexes instead of
“views”
• Speed layer uses Spark
streaming to build near
real-time index
• Aggregation collections
for historical data
Lambda Architecture
source: http://lambda-architecture.net/
I don’t have to tell you that big data is a popular topic of discussion in IT circles these days.
What I want to talk about today is how Solr and Lucidworks Fusion fit into the big data landscape. Don’t worry, I’ll try to keep the hype and grandiose statements to a minimum. I will get technical in a few places because it’s important to understand the details.
Search is a critical component of any big data strategy
Fusion & Solr are first-class citizens in the Hadoop ecosystem
Big data doesn’t have to be hard – Fusion makes it easy
Search engines contain mission-critical data and are typically on the front-line, directly serving users
Before I was a Solr committer, I was a Solr user, one of the first adopters of SolrCloud actually. I worked on a team that built and supported a big data framework built on Hadoop, Storm, Cassandra, Solr, and Postgres. Effectively, we computed performance metrics for brands by analyzing social media data
IT organizations are consolidating data infrastructure for improved ROI, efficiency, security, and governance.
Solr is included as part of Cloudera, Hortonworks, and MapR Hadoop distributions
Users get search because they see it everyday; BI / dashboards / SQL are powerful, but not necessarily intuitive
Vast amount of exhaust created by users interacting with searchable content
Often times, it’s a small department in a larger organization that uses search to expose medium data to deliver business insights and then the “search engine” evolves into an insights engine on larger and larger data sets.
If you could plan for all the possible queries you need to serve, then traditional BI / data warehousing techniques will still serve you well. Search fills the void where users need fast, ad hoc query capabilities to do exploratory analysis.
Let’s imagine we have time-ordered data, such as logs of user activity. You can insert any scale that fits your needs here. We work with customers that have a billions log events per day up to 10’s of billions.
Let’s work through a quick example to illustrate some of the questions that come up and how we tackle them at Lucidworks
Bunch of log data HDFS, want to index it for ad hoc queries and basic visualizations, i.e. the kinds that you can power with simple analytical functions like faceting
First thing we have to identify is what data are we indexing? where is it coming from?
how much data is there?
how quickly do we need it to be indexed?
But wait … step back a sec … how are people going to search this data?
There are three important decisions emerge when designing your search solution:
data partitioning scheme (time-based: hourly, daily, 15-minutes, etc)
doc values: fields you need to sort and facet on should have doc values
range queries need trie-fields to be indexed
what fields must be stored / indexed
What type of visualizations make sense for this data? What type of aggregations do you want to perform and at what time granularity?
So here we’re starting to see some of the same considerations when designing data warehouse, i.e. there’s no free lunch, esp. at scale
The key-takeaway here is that you use your investment in Hadoop to scale complex document processing using Fusion pipelines by running a pipeline in each map or reduce task