Running SolrCloud in Public Cloud is the future. This presentation and the code that will be contributed back to the community will allow such clusters to be highly efficient, scalable and elastic. Attendees will understand the challenges and potential of sharing index data between servers.
Speakers: Ilan Ginzburg & Yonik Seeley, Salesforce
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
This document summarizes a presentation given by Umesh Prasad and Thejus V M of Flipkart on building a real-time search index for e-commerce. It discusses the need for real-time indexing to support high update rates and microservices architecture at Flipkart. It evaluates using SolrCloud but finds that update-by-delete-and-add hinders performance. The presentation then describes Flipkart's approach using a near real-time Lucene store with optimized data structures and filtering to enable low-latency search across updated documents.
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Lucidworks
Thoth is a real-time Solr monitoring system developed at Trulia to understand search infrastructure without accessing logs. It collects Solr request data, indexes it in another Solr core for search and analysis, and provides a dashboard and APIs for monitoring metrics. It also uses machine learning to predict query times and identify query patterns through topic modeling. The system was designed to be modular and its components like data collection, indexing, dashboard and monitoring are open-sourced.
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetLucidworks
This document summarizes Target's implementation of Solr as its search platform. It discusses how Target transitioned from Oracle-Endeca to Solr to handle its large scale data and enable more flexible relevancy controls. It describes how Target tested Solr through handling live guest traffic in two sprints and moving its typeahead functionality to the public cloud. Finally, it outlines how Target leverages key Solr capabilities like collection aliases, atomic updates, and configurable facets to synchronize designer and product launches.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Lucidworks
This document describes Bloomberg's development of a search analytics component for Solr. It was created by their search team to enable complex calculations and aggregations on numerical time-series data. Key features include statistical and mathematical expressions to facet and analyze data, supporting int, long, float, date and string fields. Examples show calculating a weighted average and variance. Future plans include multi-shard support and filtering result sets based on calculated statistics.
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangDatabricks
At Facebook, millions of Hive queries are executed on a daily basis, and the workload contributes to important analytics that drive product decisions and insights. Spark SQL in Apache Spark provides much of the same functionality as Hive query language (HQL) more efficiently, and Facebook is building a framework to migrate existing production Hive workload to Spark SQL with minimal user intervention.
Before Facebook began large-scale migration to SparkSQL, they worked on identifying the gap between HQL and SparkSQL. They built an offline syntax analysis tool that parses, analyzes, optimizes and generates physical plans on daily HQL workload. In this session, they’ll share their results. After finding their syntactic analysis encouraging, they built tooling for offline semantic analysis where they run HQL queries in their Spark shadow cluster and validate the outputs. Output validation is necessary since the runtime behavior in Spark SQL may be different from HQL. They have built a migration framework that supports HQL in both Hive and Spark execution engines, can shadow and validate HQL workloads in Spark, and makes it easy for users to convert their workloads.
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
Timothy Potter presented at a Big Data conference in Boston from October 11-14, 2016. He discussed how Lucidworks Fusion provides an alternative to traditional big data stacks that emphasizes fast access, agility and automation over integration. Fusion allows for common access patterns like fast lookups, ranked retrieval and distributed scans while integrating technologies like Solr, Spark, HDFS and more. It provides tools for data ingestion, time-based partitioning, analytics, machine learning and more to solve business problems rather than focus on infrastructure.
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...Lucidworks
The document discusses search capabilities for the PlayStation Network. It describes how the initial system indexed 200,000 documents per second for the PS Store, but more capabilities were needed. It then details the challenges in moving from a relational database to NoSQL to support indexing 1 million documents per second across multiple services for 65 million monthly active users.
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Databricks
When you run an Apache Spark application on a large cluster, you want to make sure you’re getting the most from that cluster. Any CPU or memory left on the table represents either a waste of money or a lost opportunity to speed up your Spark jobs. What many people don’t realize is how sensitive Spark cluster utilization is to the resource manager. Resource managers decide how to allocate cluster resources among the many users and applications contending for them. In this deep dive session, we will discuss how Spark integrates with two common open source resource managers, YARN and Mesos, as well as a new commercial product called IBM Spectrum Conductor with Spark. You will learn how resource managers arbitrate resources in multi-user/multi-tenant Spark clusters, and how this affects application performance. You will come away with new techniques for tuning Spark resource management to optimize goals like speed and fairness. The session will include a demo of a new open source benchmark designed to help analyse Spark multi-user/multi-tenant performance. The benchmark uses Spark SQL and machine learning jobs to load the cluster, and can be used during a pre-production cycle to tune Spark and resource manager configurations.
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost.
While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.
We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost.
The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.
Searching for Better Code: Presented by Grant Ingersoll, LucidworksLucidworks
The document discusses Lucidworks' Fusion product, which is a search platform that enhances Apache Solr. It provides connectors to various data sources, integrated ETL pipelines, built-in recommendations, and security features. The document outlines Fusion's architecture, demo use cases for basic and code search, and next steps for integrating additional analysis tools like OpenGrok.
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
Designing Distributed Machine Learning on Apache SparkDatabricks
This document summarizes Joseph Bradley's presentation on designing distributed machine learning on Apache Spark. Bradley is a committer and PMC member of Apache Spark and works as a software engineer at Databricks. He discusses how Spark provides a unified engine for distributed workloads and libraries like MLlib make it possible to perform scalable machine learning. Bradley outlines different methods for distributing ML algorithms, using k-means clustering as an example of reorganizing an algorithm to fit the MapReduce framework in a way that minimizes communication costs.
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Caserta
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr sponsored by O'Reilly Media!
Caserta Concepts shared one of their innovative DW projects using Solr. See how open source search technology can serve high performance analytic use cases. Presentation and solution walk-through given by Caserta Concepts' Joe Caserta and Elliott Cordo.
For more information, visit www.casertaconcepts.com
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
Streaming Analytics with Spark, Kafka, Cassandra, and Akka discusses rethinking architectures for streaming analytics. The document discusses:
1) The need to build scalable, fault-tolerant systems to handle massive amounts of streaming data from different sources with varying structures.
2) An example use case of profiling cyber threat actors using streaming machine data to detect intrusions and security breaches.
3) Rethinking architectures by moving away from ETL pipelines and dual batch/stream systems like Lambda architecture toward unified stream processing with Spark Streaming, Kafka, Cassandra and Akka. This simplifies analytics and eliminates duplicate code and systems.
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologyLucidworks
This document describes a custom Solr plugin for fuzzy name matching. The plugin handles challenges like name variations and ambiguity. It creates a custom field type that scores name matches and supports multiple fields and values per document. At query time, it generates a custom Lucene query to find candidates, then uses Solr's rerank feature to rescore the top results based on the name matching algorithm. The plugin is configurable to trade off accuracy versus speed and supports multi-lingual name matching.
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
This document provides an overview of a presentation comparing Apache Flink and Apache Spark. The presentation aims to address marketing claims, confusing statements, and outdated information regarding Flink vs Spark. It outlines key criteria to evaluate the two platforms, such as streaming capabilities, state management, and scalability. The document then directly compares some criteria, such as their support for iterative processing and streaming engines. The presenter hopes this evaluation framework will help others assess Flink and Spark for stream processing use cases.
This document summarizes how Solr and Lucidworks Fusion can be used for big data search and analytics. It discusses indexing strategies like using MapReduce, Spark, and Fusion connectors to index structured and unstructured data from HDFS. It also covers topics like Solr on HDFS, auto add replicas, security, cluster sizing, and using the lambda architecture with Spark streaming to enable real-time search over batch-processed historical data. The document promotes Lucidworks Fusion as a search platform that can handle massive scales of data, provide real-time search capabilities, and work with any data source securely.
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...Databricks
Redis-ML is a Redis module for high performance, real-time serving of Spark-ML models. It allows users to train large complex models in Spark, and then store and query the models directly on Redis clusters. The high throughput and low latency of Redis-ML allows users to perform heavy classification operations in real time while using a minimal number of servers. This unique architecture enables significant savings in resources compared to current commonly used methods, without loss in precision or server performance.
This session will demonstrate how to build a production-level recommendation system from the ground up using Spark-ML and Redis-ML. It will also describe performance and accuracy benchmarks, comparing the results with current standard methods.
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
This document summarizes a presentation about annotating millions of documents at scale using dictionary-based annotation with Apache Spark, Apache Solr, and Apache OpenNLP. The key points discussed include:
- The problem of annotating millions of documents from science corpora and the need to do it efficiently without model training.
- The architecture of SoDA (Dictionary Based Named Entity Annotator), which uses Apache Solr, SolrTextTagger, and OpenNLP for annotation and can be run on Spark for scaling.
- Performance optimizations made including combining paragraphs, tuning Solr garbage collection, using a larger Spark cluster, and scaling out Solr. These helped achieve over 25 documents per second annotation throughput.
This document discusses MongoDB and scaling strategies when using MongoDB. It begins with an overview of MongoDB's architecture, data model, and operations. It then describes some early performance issues encountered with MongoDB including issues with durability settings, queries locking servers, and updates moving documents. The document recommends strategies for scaling such as adding more RAM, partitioning data through sharding, and monitoring replication delay closely for disaster recovery.
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
This document provides an overview of scaling a Splunk deployment from an initial use case to a larger enterprise deployment. It discusses growing use cases and data volume over time. The agenda covers use case mapping, simple scaling approaches, indexer and search head clustering, distributed management, and hybrid cloud deployments. Best practices are outlined for sizing storage, tuning indexers, and designing high availability into the forwarding, indexing, and search tiers. Clustering impacts on storage sizing and additional hosts are also addressed.
This document provides an overview of MongoDB including:
- MongoDB is an open-source document database that is schemaless and document-oriented.
- It has advantages like rich querying, horizontal scalability, high availability, and flexibility in schemas.
- The document includes information on MongoDB's data model, querying capabilities, indexing, availability through replication, and scaling through sharding.
- Case studies are presented showing how companies like Mailbox, Visual China, and Youku use MongoDB for applications processing large amounts of data.
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
This document provides an overview and agenda for taking a Splunk deployment to the next level by addressing scaling needs and high availability requirements. It discusses growing use cases and data volumes, making Splunk mission critical through clustering, and supporting global deployments. The agenda covers scaling strategies like indexer clustering, search head clustering, and hybrid cloud deployments. It also promotes justifying increased spending by mapping dependencies and costs of failures across an organization's systems.
ProxySQL provides native support for high availability solutions like PXC, InnoDB Cluster, and regular MySQL replication. It can monitor the health of nodes and redirect traffic away from unavailable or stale nodes, improving availability. It supports various topologies out of the box through host groups, health checks, and failure detection. ProxySQL helps implement robust HA architectures by integrating these functions and allowing automatic traffic redirection based on node status.
The document discusses NoSQL technologies including Cassandra, MongoDB, and ElasticSearch. It provides an overview of each technology, describing their data models, key features, and comparing them. Example documents and queries are shown for MongoDB and ElasticSearch. Popular use cases for each are also listed.
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
This document discusses Solr distributed indexing at WalmartLabs. It describes customizing an existing MapReduce indexing tool to index large XML files in a distributed manner across multiple servers. Key points covered include using two custom utilities for index generation and merging, experiments showing indexing is CPU-bound while merging is I/O-bound, and lessons learned around data locality and using n-way merging of shards for best performance. Solutions discussed include dedicating an indexing Hadoop cluster to improve I/O speeds for merging indexes.
This document summarizes Eliot Horowitz's presentation on practical scaling and sharding in MongoDB. The key points covered include:
1) Horizontal scaling is needed to scale beyond the limits of vertical scaling in the cloud. Replica sets allow scaling reads but not writes.
2) Sharding distributes write load across shards, keeps working data set in RAM, enables consistent reads, and allows capacity to increase with no downtime.
3) Sharding design goals are to scale linearly, increase capacity non-disruptively, be transparent to applications, and have low administration overhead.
This document provides an introduction to Docker and Openshift including discussions around infrastructure, storage, monitoring, metrics, logs, backup, and security considerations. It describes the recommended infrastructure for a 3 node Openshift cluster including masters, etcd, and nodes. It also discusses strategies for storage, monitoring both internal pod status and external infrastructure metrics, collecting and managing logs, backups, and security features within Openshift like limiting resource usage and isolating projects.
This is a summary of the sessions I attended at PASS Summit 2017. Out of the week-long conference, I put together these slides to summarize the conference and present at my company. The slides are about my favorite sessions that I found had the most value. The slides included screenshotted demos I personally developed and tested alike the speakers at the conference.
This document discusses PostgreSQL database architecture patterns for running PostgreSQL at scale when a relational database as a service like Amazon RDS won't meet needs. It describes challenges faced with MySQL, Redshift and Vertica and how PostgreSQL was better suited through techniques like partitioning by date, TOAST compression, foreign data wrappers, and poor man's parallel processing. Key takeaways are that PostgreSQL supported scaling to petabytes of data, sub-second queries across large date ranges, and custom extensions needed while avoiding limitations and expenses of other database options.
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
This document provides an agenda for scaling a Splunk deployment beyond initial use cases. It discusses growing use cases and data volume over time. As Splunk becomes mission critical, the document recommends implementing high availability through indexer and search head clustering. It also suggests using a distributed management console and centralized configuration management. Finally, the document briefly discusses Splunk Cloud and hybrid deployments as options to scale without waiting for additional on-premise hardware.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
1) MongoDB databases can grow very large due to flexible document schemas that allow large and denormalized data. This leads to increased storage requirements.
2) MongoDB replication can introduce lag on secondary nodes as they process write operations. This limits the ability to use secondary nodes for scaling reads.
3) MongoDB performance declines dramatically when indexes do not fit in memory, requiring more RAM, sharding, or reduced write performance.
4) MongoDB implements database-level locking, limiting write concurrency and the ability to run multiple shards on a single server.
5) MongoDB does not support ACID transactions, multi-version concurrency control, or consistent reads in the presence of concurrent writes. Workarounds
Cloud computing UNIT 2.1 presentation inRahulBhole12
Cloud storage allows users to store files online through cloud storage providers like Apple iCloud, Dropbox, Google Drive, Amazon Cloud Drive, and Microsoft SkyDrive. These providers offer various amounts of free storage and options to purchase additional storage. They allow files to be securely uploaded, accessed, and synced across devices. The best cloud storage provider depends on individual needs and preferences regarding storage space requirements and features offered.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax
We'll be covering some aspects of our architecture, highlighting differences between MongoDB and Cassandra. We'll go in depth to explain why Cassandra is a better choice for our general purpose Application Platform (SHIFT) as well as our Media Buying Analytics tool (the SHIFT Media Manager). We'll be going over common design patterns people might be familiar with coming from a background with MongoDB and highlight how Cassandra would be used as a better alternative. We'll also touch more on cqlengine which is nearing feature completeness as the Cassandra object mapper for Python.
The document discusses Snowflake architecture and data ingestion workflows. It provides an overview of Snowflake's architecture including its storage, compute, and virtual warehouse components. It also covers topics like file formats, stages, the COPY command, load metadata, and automation of data ingestion including for semi-structured data formats.
Sharding in MongoDB allows for horizontal scaling of data and operations across multiple servers. When determining if sharding is needed, factors like available storage, query throughput, and response latency on a single server are considered. The number of shards can be calculated based on total required storage, working memory size, and input/output operations per second across servers. Different types of sharding include range, tag-aware, and hashed sharding. Choosing a high cardinality shard key that matches query patterns is important for performance. Reasons to shard include scaling to large data volumes and query loads, enabling local writes in a globally distributed deployment, and improving backup and restore times.
This document discusses Scality's experiences building their first Node.js project. It summarizes that the project was building a TiVo-like cloud service for 25 million users, which required high parallelism and throughput of terabytes per second. It also discusses lessons learned around logging performance, optimizing the event loop and buffers, and useful Node.js tools.
Similar to SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan Ginzburg & Yonik Seeley, Salesforce (20)
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
With ecommerce experiencing explosive growth, it seems intuitive that the B2B segment of that ecosystem is mirroring the same trajectory. That said, B2B has very different needs when it comes to transacting with the same style of experiences that we see in B2C. For instance, B2B ecommerce is about precision findability, whereas B2C customers can convert at higher rates when they’re just browsing online. In order for the B2B buying experience to be successful, search needs to be tuned to meet the unique needs of the segment.
In this webinar with Forrester senior analyst Joe Cicman, you’ll learn:
-Which verticals in B2B will drive the most growth, and how machine-learning powered personalization tactics can be deployed to support those specific verticals
-Why an omnichannel selling approach must be deployed in order to see success in B2B
-How deploying content search capabilities will support a longer sales cycle at scale
-What the next steps are to support a robust B2B commerce strategy supported by new technology
Speakers
Joe Cicman, Senior Analyst, Forrester
Jenny Gomez, VP of Marketing, Lucidworks
Customer loyalty starts with quickly responding to your customer’s needs. When it comes to resolving open support cases, time is of the essence. Time spent searching for answers adds up and creates inefficiencies in resolving cases at scale. Relevant answers need to be a few clicks away and easily accessible for agents directly from their service console.
We will explore how Lucidworks’ Agent Insights application automatically connects agents with the correct answers and resources. You’ll learn how to:
-Configure a proactive widget in an agent’s case view page to access resources across third-party systems (such as Sharepoint, Confluence, JIRA, Zendesk, and ServiceNow).
-Easily set up query pipelines to autonomously route assets and resources that are relevant to the case-at-hand—directly to the right agent.
-Identify subject matter experts within your support data and access tribal knowledge with lightning-fast speed.
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
Lunch and Learn during Retail TouchPoints #RIC21 virtual event.
***
Crate & Barrel’s previous search solution couldn’t provide its shoppers with an online search and browse experience consistent with the customer-centric Crate & Barrel brand. Meanwhile, Crate & Barrel merchandisers spent the bulk of their time manually creating and maintaining search rules. The search experience impacted customer retention, loyalty, and revenue growth.
Join this lunch & learn for an interactive chat on how Crate & Barrel partnered with Lucidworks to:
-Improve search and browse by modernizing the technology stack with ML-based personalization and merchandising solutions
-Enhance the experience for both shoppers and merchandisers
-Explore signals to transform the omnichannel shopping experience
Questions? Visit https://lucidworks.com/contact/
Learn how to guide customers to relevant products using eCommerce search, hyper-personalisation, and recommendations in our ‘Best-In-Class Retail Product Discovery’ webinar.
Nowadays, shoppers want their online experience to be engaging, inspirational and fulfilling. They want to find what they’re looking for quickly and easily. If the sought after item isn’t available, they want the next best product or content surfaced to them. They want a website to understand their goals as though they were talking to a sales assistant in person, in-store.
In this webinar, we explore IMRG industry data insights and a best-in-class example of retail product discovery. You’ll learn:
- How AI can drive increased revenue through hyper-personalised experiences
- How user intent can be easily understood and results displayed immediately
- How merchandisers can be empowered to curate results and product placement – all without having to rely on IT.
Presented by:
Dave Hawkins, Principal Sales Engineer - Lucidworks
Matthew Walsh, Director of Data & Retail - IMRG
Connected Experiences Are Personalized ExperiencesLucidworks
Many companies claim personalization and omnichannel capabilities are top priorities. Few are able to deliver on those experiences.
For a recent Lucidworks-commissioned study, Forrester Consulting surveyed 350+ global business decision-makers to see what gets in the way of achieving these goals. They discovered that inefficient technology, lack of behavioral insights, and failure to tie initiatives to enterprise-wide goals are some of the most frequent blockers to personalization success.
Join guest speaker, Forrester VP and Principal Analyst, Brendan Witcher, and Lucidworks CEO, Will Hayes, to hear the results of the Forrester Consulting study, how to avoid “digital blindness,” and how to apply VoC data in real-time to delight customers with personalized experiences connected across every touchpoint.
In this webinar, you’ll learn:
- Why companies who utilize real-time customer signals report more effective personalization
- How to connect employees and customers in a shared experience through search and browse
- How Lucidworks clients Lenovo, Morgan Stanley and Red Hat fast-tracked improvements in conversion, engagement and customer satisfaction
Featuring
- Will Hayes, CEO, Lucidworks
- Brendan Witcher, VP, Principal Analyst, Forrester
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
Intelligent Policing. Leveraging Data to more effectively Serve Communities.
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
-The technology needs of an intelligent police force.
-How a Global Search improves an officer's interaction with existing data.
Featuring:
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
The technology needs of an intelligent police force.
How a Global Search improves an officer's interaction with existing data.
Featuring
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
This document provides a framework for prioritizing onsite search problems and key performance indicators (KPIs) to measure for e-commerce search optimization. It recommends prioritizing fixing searches that yield no results, improving relevance of results, and reducing false positives. The most essential KPIs to measure include query latency, throughput, result relevance through click-through rates and NDCG scores. The document also provides tips for self-benchmarking search performance and examples of search performance benchmarks across nine e-commerce sites from various industries.
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
Wish your conversion rates were higher? Can’t figure out how to efficiently and effectively serve all the visitors on your site? Embarrassed by the quality of your product discovery experience? The bar is high and the influx of online shopping over recent months has reminded us that the opportunities are real. We’re all deep in holiday prep, but let’s take a few minutes to think about January 2021 and beyond. How can we position ourselves for success with our customers and against our competition?
Grab your lunch and let’s dive into three strategies that need to be part of your 2021 roadmap. You don’t need an army to get there. But you do need to take action and capitalize on the shoppers abandoning the product discovery journey on your site.
In this session, attendees will find out how to:
-Take control of merchandising at scale;
-Implement hands-free search relevancy; and
-Address personalization challenges.
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
For a personalized search experience, search curation requires robust text interpretation, data enrichment, relevancy tuning and recommendations. In order to achieve this, language and entity identification are crucial.
For teams working on search applications, advanced language packages allow them to achieve greater recall without sacrificing precision.
Join us for a guided tour of our new Advanced Linguistics packages, available in Fusion, thanks to the technology partnership between Lucidworks and Basistech.
We’ll explore the application of language identification and entity extraction in the context of search, along with practical examples of personalizing search and enhancing entity extraction.
In this webinar, we’ll cover:
-How Fusion uses the Rosette Basic Linguistics and Entity Extraction packages
-Tips for improving language identification and treatment as well as data enrichment for personalization
-Speech2 demo modeling Active Recommendation
-Use Rosette’s packages with Fusion Pipelines to build custom entities for specific domain use cases
Featuring:
-Radu Miclaus, Director of Product, AI and Cloud, Lucidworks, Lucidworks
-Robert Lucarini, Senior Software Engineer, Lucidworks
-Nick Belanger, Solutions Engineer, Basis Technology
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
Before COVID-19, almost 80% of the US workforce worked service in jobs that involve in-person interaction with strangers. Now, leaders of service organizations must reshape their offerings during the pandemic and prepare for whatever the new normal turns out to be. Our three panelists will share ideas for adapting their service businesses, now that closer-than-six-feet isn’t an option.
Join Lucidworks as we talk shop with 3 service business leaders, covering:
-Common impacts of the pandemic on service businesses (and what to do about them),
-How service teams can maintain a human touch across virtual channels, and
-Plans for the future, before and after the pandemic subsides.
Featuring
-Sara Nathan, President & CEO, AMIGOS
-Anthony Carruesco, Founder, AC Fly Fishing
-sara bradley, chef and proprietor, freight house
-Justin Sears, VP Product Marketing, Lucidworks
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
The COVID-19 pandemic has forced companies to support far more customers and employees through digital channels than ever before. Many are turning to chatbots to help meet increasing demand, but traditional rules-based approaches can’t keep up. Our new Smart Answers add-on to Lucidworks Fusion makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
Watch our on-demand webinar showcasing Smart Answers on Lucidworks Fusion. This technology makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
In this webinar, we’ll cover off:
-How search and deep learning extend conversational frameworks for improved experiences
-How Smart Answers improves customer care, call deflection, and employee self-service
-A live demo of Smart Answers for multi-channel self-service support
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
In the current climate, it’s now more important than ever to digitally enable your workforce and customers.
Hear from Simon Taylor, VP Global Partners & Alliances, Lucidworks and Matt Aslett, Research Vice President, 451 Research to get the inside scoop on how industry leaders in Europe are developing and executing their digital transformation strategies.
In this webinar, we’ll discuss:
The top challenges and aspirations European business and technology leaders are solving using AI and search technology
Which search and AI use cases are making the biggest impact in industries such as finance, healthcare, retail and energy in Europe
What technology buyers should look for when evaluating AI and search solutions
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
This document introduces Fusion 5.1 and its new capabilities for integrating with data science tools like Tensorflow, Scikit-Learn, and Spacy.
It provides an overview of Fusion's capabilities for understanding content, users, and delivering insights at scale. The document then demonstrates Fusion's Jupyter Notebook integration for reading and writing data and running SQL queries.
Finally, it shows how Fusion integrates with Seldon Core to easily deploy machine learning models with tools like Tensorflow and Scikit-Learn. A live demo is provided of deploying a custom model and using it in Fusion's query and indexing pipelines.
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
In this webinar with 451 Research, you'll understand how retailers are using AI to predict customer intent and learn which key performance metrics are used by more than 120 online retailers in Lucidworks’ 2019 Retail Benchmark Survey.
In this webinar, you’ll learn:
● What trends and opportunities are facing the ecommerce industry in 2020
● Why search is the universal path to understanding customer intent
● How large online retailers apply AI to maximize the effectiveness of their personalization efforts
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
Nordstrom Rack | Hautelook curates and serves customers a wide selection of on-trend apparel, accessories, and shoes at an everyday savings of up to 75 percent off regular prices. With over a million visitors shopping across different platforms every day, and a realization that customers have become accustomed to robust and personalized search interactions, Nordstrom Rack | Hautelook launched an initiative over a year ago to provide data science-driven digital experiences to their customers.
In this session, we’ll discuss Nordstrom Rack | Hautelook’s journey of operationalizing a hefty strategy, optimizing a fickle infrastructure, and rallying troops around a single vision of building an expansible machine-learning driven product discovery engine.
The audience will learn about:
-The key technical challenges and outcomes that come with onboarding a solution
-The lessons learned of creating and executing operational design
-The use of Lucidworks Fusion to plug custom data science models into search and browse applications to understand user intent and deliver personalized experiences
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
Knowledge graphs and machine learning are on the rise as enterprises hunt for more effective ways to connect the dots between the data and the business world. With newer technologies, the digital workplace can dramatically improve employee engagement, data-driven decisions, and actions that serve tangible business objectives.
In this webinar, you will learn
-- Introduction to knowledge graphs and where they fit in the ML landscape
-- How breakthroughs in search affect your business
-- The key features to consider when choosing a data discovery platform
-- Best practices for adopting AI-powered search, with real-world examples
Webinar: Building a Business Case for Enterprise SearchLucidworks
The document discusses building a business case for enterprise search. It notes that 85% of information is unstructured data locked in various locations and applications. Many knowledge workers spend a significant portion of their day searching across multiple systems for information. The rise of unstructured data and AI capabilities can help organizations unlock value from their information assets. Effective enterprise search powered by AI can provide real-time intelligence, personalized information, and more efficient research to help knowledge workers.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan Ginzburg & Yonik Seeley, Salesforce
2. SolrCloud in Public Cloud: Scaling
Compute Independently from Storage
Ilan Ginzburg
Working on Search infra
Architect at Salesforce
Yonik Seeley
Creator of Solr
Lucidworks co-founder
Principal Architect at
Salesforce
3. Reduce cost to serve and improve
quality of service
by separating search compute and search index storage
Salesforce search traffic lends itself well to such optimizations:
● Multi tenant
● Unknown access patterns
● Unknown data growth (speed and scale)
● Indexing heavy
4. What this talk is about
Changes to SolrCloud to allow storing indexes on
shared storage and allow sharing indexes by multiple
nodes, to allow adjusting cluster size to current traffic.
An Activate 2018 presentation “Who moved my state? A Blob
Storage Solr story” described the integration of Blob storage and the
Salesforce non SolrCloud Solr cluster.
10. Storage format on Blob
Metadata file listing files in commit point
Files with actual segment data.
_1.si → _1.si.957a2cec
_0.cfs → _0.cfs.94639b68
…
_0.cfe → _0.cfe.5d8dbde9
Segments_3 → segments_3.241a8cbc
core.metadata.23393bd7 ....
_1.si.957a2cec
....
_0.cfe.5d8dbde9
...
...
11. Pushing changes to Blob
1. Compare Blob metadata to local commit point,
2. Push new files to Blob (with unique names),
3. Push new version of Blob metadata file describing
new commit point.
12. Supporting multiple writers
SolrCloud
Blob store
2 Zookeeper
Current suffix: xyz
3
4
1
5
segment files
core.metadata.xyz
New segment files
core.metadata.newSuffix
xyznewSuffix
13. Nodes are “stateless”
Local storage on Nodes can disappear at any time
Transaction log not durable
Pushing to Blob only durability option
→ Requires hard commits before ACK
14. SHARED replica type
- Does not forward any indexing from leader
(splits excepted)
- Does not replicate
- Pushes to Blob (leader), pulls from Blob (others)
- Leader election not required, best effort “leader”
selection would be sufficient
- Imposes commits
15. Minimum number of replicas
- Replicas no longer used for durability
(Blob takes care of that).
- Do we need 2 replicas tracked in ZK for fast failover?
- Can we have only 1?
- Goal is 0...
16. Indexes are loaded when needed
Large number of tenants, not all of them active.
Loading all indexes is not possible (too many).
Only the working set should be in memory.
Indexes on local storage disk of Solr Node but not open.
Indexes on Blob store but not on any Solr Node...
17. Indexing Performance Impacts
Each commit is more expensive
● Push of new segments to Blob
● Write to Zookeeper
Commit amplification
● Every update request needs an implicit commit
● Commit amplification causes write amplification
18. Node 2
Shard 1
Node 2
Shard 1
Node 1
Shard 1
Node 1
Shard 1
Data Loss Scenario
Local
storage
Blob store
1
Client
4
Local
storage
5
leader
leader
3
2
(without implicit commits)
19. Reducing commit cost
● Put transaction log on blob store
■ Exchanges pushing small segment files for pushing
transaction logs
■ Does not resolve commit amplification
■ Adds additional complexity
● Index with multiple threads to compensate for latency
● Flushing to blob store pre-commit
■ Start writing large files early
■ Max-sized segments won’t be merged
■ Directory based implementation may help
20. Reducing commits
● Share commits
■ Concurrent batches could share commits
■ Implement via waitFlush=true with commitWithin
■ Adds efficiency, but increases latency
● Increase batch size
■ Add delay (if possible) to coalesce incremental updates
● “Best effort” indexing flag
■ Requires good client strategy for detecting & fixing missing data
● Client update partitioning
■ Indexing fanout can be largest contributor to commit amplification
21. Commit Amplification from Fanout
● One high level request turns
into N sub-requests
■ Each one needs an implicit commit
● O(num_shards * num_batches)
● Each request limited by slowest
shard
● Many small writes to Blob
Shard1 Shard2 Shard3 Shard4 Shard5
…
Doc1
Doc2
Doc3
Doc4
….
SolrJ
Client
SolrJ
Client
Doc5
Doc6
Doc7
Doc8
….
22. Fanout Mitigation: Topology Knowledge
Shard1
…
Doc199
Doc248
Doc3743
Doc4295
….
SolrJ
Client
Line up indexing batches with shards
● Use Solr APIs to get sharding
■ Need hash range for each shard
● Hash IDs using CompositeIdRouter
● Make batches that don’t cross shards
● Hash partitioning often desired
anyway to avoid version reorders!
For custom sharding
● Easy, just do it!
Shard2
Doc87
Doc462
Doc744
Doc2001
….
SolrJ
Client
Shard3
Doc322
Doc547
Doc1011
Doc2539
….
SolrJ
Client
23. Fanout Mitigation: Hash Partitioning
Simpler approach: partition by hash
● Utilizes Solr hash, but not topology
● Scale number of partitions by how
much data needs indexing
● Under-partitioning: no harm if not
enough data for optimal sized
batches anyway
Shard
1
Shard
2
Shard
3
Shard
4
Shard
5
…
Doc2
Doc4
Doc5
Doc7
….
SolrJ
Client
SolrJ
Client
Doc1
Doc3
Doc6
Doc8
….
0x80000000-0xFFFFFFFF 0x00000000-0x7FFFFFFF
Shard
6
24. Fanout Mitigation: Hash Partitioning 2
Over-partitioning: more client
partitions than shards.
● Again, no harm
● Increases parallelism
● Easy to decrease parallelism if
desired
■ Keep num partitions the same
■ Use fewer indexing threads
Shard1 Shard2
Doc2
Doc4
Doc5
Doc7
….
SolrJ
Client
SolrJ
Client
Doc1
Doc3
Doc6
Doc8
….
Doc9
Doc14
Doc11
Doc15
….
SolrJ
Client
SolrJ
Client
Doc12
Doc13
Doc16
Doc10
….
25. Hash Partitioning with Composite Ids
● Composite ids like “yonik!email1” do not distribute randomly on hash ring
■ Use splitByPrefix flag for prefix aware shard splitting
● Determine amount of data to be indexed for collection:
● If small amount, just send it
● If large amount, analyze the different prefixes
■ Tiny prefixes: send them together as a single partition
■ Medium prefixes: use the prefix as a separate partition
■ Large prefixes: create multiple partitions by evenly splitting the prefix range
● Maybe pick number of partitions that is a power of 2
■ Index partitions with any number of threads <= num_partitions
● If partitions > threads, don’t concurrently index partitions next to each other
● Note: beware changing indexing partitions on-the-fly
26. Shard splits
Plan for “online” splitting same as for non-blob
■ Forward updates from shard leader to sub-shard leaders
■ Transaction log on sub-shard leader buffers
■ If sub-shard leader dies, split should fail
28. Summary
Elasticity IS THE initial goal.
Blob more cost effective than Block: less replicas!
Easily shut down servers. Be able to serve
all data regardless.
29. What’s next
Status of the work
Open Source
https://issues.apache.org/jira/browse/SOLR-13101
https://github.com/apache/lucene-solr/pull/864