در این اسلاید به مباحث زیر می پردازیم:
مقدمات پایگاه داده های غیر اس.کیو.ال، مبانی جستجوگرها
سپس معرفی ابزار جستجوی الاستیکی، کاربردها، معماری کلی، مقایسه با ابزارهای مشابه
افزودن تحلیلگر متن و در نهایت لینک آن با دات نت
ا
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Fwdays
В течении доклада мы с вами рассмотрим ряд принципов и техник, которые позволят вашей базе данных справляться с большей нагрузкой. P.S. Все примеры и демо будут проводиться на базе данных MS SQL Server. Все совпадения с другими базами данными случайны, но вполне вероятны :) так что знания, полученные в ходе доклада, могут вам пригодиться даже если вы работаете с другой базой данных.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
Deep Dive on ElasticSearch Meetup event on 23rd May '15 at www.meetup.com/abctalks
Agenda:
1) Introduction to NOSQL
2) What is ElasticSearch and why is it required
3) ElasticSearch architecture
4) Installation of ElasticSearch
5) Hands on session on ElasticSearch
Elasticsearch is a distributed, open source search and analytics engine that allows full-text searches of structured and unstructured data. It is built on top of Apache Lucene and uses JSON documents. Elasticsearch can index, search, and analyze big volumes of data in near real-time. It is horizontally scalable, fault tolerant, and easy to deploy and administer.
Elasticsearch 1.1.0 includes several new features and improvements such as new aggregation types like cardinality and percentiles, significant terms aggregation, and improvements to terms and multi-field search. It also includes breaking changes to configuration, multi-fields, stopwords, and return values. New features for aggregations include bucketing and metrics aggregations as well as the ability to add sub-aggregations. Backup and restore capabilities were added through repositories and snapshots. The tribe feature allows federation across multiple clusters.
This presentation summarizes how we use Elasticsearch for analytics at Wingify for our product Visual Website Optimizer (http://vwo.com). This presentation was prepared for my poster session at The Fifth Elephant (https://funnel.hasgeek.com/fifthel2014/1143-using-elasticsearch-for-analytics).
NoSQL, as many of you may already know, is basically a database used to manage huge sets of unstructured data, where in the data is not stored in tabular relations like relational databases. Most of the currently existing Relational Databases have failed in solving some of the complex modern problems like:
• Continuously changing nature of data - structured, semi-structured, unstructured and polymorphic data.
• Applications now serve millions of users in different geo-locations, in different timezones and have to be up and running all the time, with data integrity maintained
• Applications are becoming more distributed with many moving towards cloud computing.
NoSQL plays a vital role in an enterprise application which needs to access and analyze a massive set of data that is being made available on multiple virtual servers (remote based) in the cloud infrastructure and mainly when the data set is not structured. Hence, the NoSQL database is designed to overcome the Performance, Scalability, Data Modelling and Distribution limitations that are seen in the Relational Databases.
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced FeaturesAndrew Liu
Let's talk about how you can get the most out of Azure DocumentDB. In this session we will dive deep into the mechanics of DocumentDB and explain the various levers available to tune performance and scale. From partitioned collections to global databases to advanced indexing and query features - this session will equip you with the best practices and nuggets of information that will become invaluable tools in your toolbox for building blazingly fast large-scale applications.
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka Elasticsearch Tutorial will help you in understanding the fundamentals of Elasticsearch along with its practical usage and help you in building a strong foundation in ELK Stack. This video helps you to learn following topics:
1. What Is Elasticsearch?
2. Why Elasticsearch?
3. Elasticsearch Advantages
4. Elasticsearch Installation
5. API Conventions
6. Elasticsearch Query DSL
7. Mapping
8. Analysis
9 Modules
Case study of Rujhaan.com (A social news app )Rahul Jain
Rujhaan.com is a news aggregation app that collects trending news and social media discussions from topics of interest to users. It uses various technologies including a crawler to collect data from social media, Apache Solr for search, MongoDB for storage, Redis for caching, and machine learning techniques like classification and clustering. The presenter discussed the technical architecture and challenges of building Rujhaan.com to provide fast, personalized news content to over 16,000 monthly users while scaling to growing traffic levels.
The document discusses NoSQL databases and CouchDB. It provides an overview of NoSQL, the different types of NoSQL databases, and when each type would be used. It then focuses on CouchDB, explaining its features like document centric modeling, replication, and fail fast architecture. Examples are given of how to interact with CouchDB using its HTTP API and tools like Resty.
Elasticsearch Introduction to Data model, Search & AggregationsAlaa Elhadba
An overview of Elasticsearch features and explains performing smart search, data aggregations, and relevancy through scoring functions. How Elasticsearch works as a distributed scalable data storage. Finally, showcasing some use cases that are currently becoming core functionalities in Zalando.
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...DataWorks Summit
This document discusses using Hadoop, Elasticsearch, and Storm for real-time analytics and anomaly detection. Hadoop is used for its big data processing capabilities. Elasticsearch enables powerful search and analytics on large datasets. Storm allows for real-time computation on streaming data. Together this platform allows building models from large datasets using Hadoop, detecting anomalies in real-time with Elasticsearch, and reacting to live data flows using Storm's stream processing. Examples of uses include recommendations that optimize results in real-time and fraud prevention that can route suspicious transactions for further analysis.
This document discusses NoSQL and the CAP theorem. It begins with an introduction of the presenter and an overview of topics to be covered: What is NoSQL and the CAP theorem. It then defines NoSQL, provides examples of major NoSQL categories (document, graph, key-value, and wide-column stores), and explains why NoSQL is used, including to handle large, dynamic, and distributed data. The document also explains the CAP theorem, which states that a distributed data store can only satisfy two of three properties: consistency, availability, and partition tolerance. It provides examples of how to choose availability over consistency or vice versa. Finally, it concludes that both SQL and NoSQL have valid use cases and a combination
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
This document provides an overview of NoSQL databases and MongoDB. It defines NoSQL as non-relational, discusses why it is used like its ability to handle semi-structured data and ease of scaling out. It describes the different types of NoSQL databases like key-value, documents, column family and graph. It also lists some available NoSQL database management systems and provides more details about MongoDB, highlighting how it uses documents and master-slave replication.
This is an introduction to relational and non-relational databases and how their performance affects scaling a web application.
This is a recording of a guest Lecture I gave at the University of Texas school of Information.
In this talk I address the technologies and tools Gowalla (gowalla.com) uses including memcache, redis and cassandra.
Find more on my blog:
http://schneems.com
This document provides an overview of search domain basics. It discusses search goals and business models, structured versus unstructured content, common search terminologies, and technologies behind search. Key points include that most content is unstructured, SQL has limitations for search, sample search requests on Java classes, common search terminologies like stemming and stop words, differences between web search and enterprise search, and technologies used in search like indexing, caching, and parallelization.
This document provides an overview of different database types including relational, NoSQL, document, key-value, graph, and column family databases. It discusses the history and drivers behind the development of NoSQL databases, as well as concepts like horizontal scaling, the CAP theorem, and eventual consistency. Specific databases are also summarized, including MongoDB, Redis, Neo4j, and HBase.
Elasticsearch is a search engine based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. ElasticSearchis a free and open source distributed inverted index. So it’s a bunch of indexed documents in a repository. As well as it’s fast, incisive search against large volumes of data. And directly accessed to the data in the denormaliz document storage. Additionally in general distributable and highly scalable DB.
This document provides an overview of NoSQL databases and MongoDB. It states that NoSQL databases are more scalable and flexible than relational databases. MongoDB is described as a cross-platform, document-oriented database that provides high performance, high availability, and easy scalability. MongoDB uses collections and documents to store data in a flexible, JSON-like format.
MongoDB is a horizontally scalable, schema-free, document-oriented NoSQL database. It stores data in flexible, JSON-like documents, allowing for easy storage and retrieval of data without rigid schemas. MongoDB provides high performance, high availability, and easy scalability. Some key features include embedded documents and arrays to reduce joins, dynamic schemas, replication and failover for availability, and auto-sharding for horizontal scalability.
The document provides an introduction to NoSQL databases, including key definitions and characteristics. It discusses that NoSQL databases are non-relational and do not follow RDBMS principles. It also summarizes different types of NoSQL databases like document stores, key-value stores, and column-oriented stores. Examples of popular databases for each type are also provided.
This document provides an overview of using Perl and Elasticsearch. It discusses using Elasticsearch for log analysis and generating live graphs. It covers when Elasticsearch may or may not be a good fit compared to a SQL database. It provides terminology translations between SQL and Elasticsearch concepts. It also discusses the Elastic Stack including Elasticsearch, Logstash, and Kibana. It provides tips for using Rsyslog instead of Logstash and configuring Elasticsearch clusters for development and production. Finally, it discusses connecting to Elasticsearch and performing basic operations like indexing, searching, and retrieving documents using the Search::Elasticsearch Perl module.
The document discusses NoSQL technologies including Cassandra, MongoDB, and ElasticSearch. It provides an overview of each technology, describing their data models, key features, and comparing them. Example documents and queries are shown for MongoDB and ElasticSearch. Popular use cases for each are also listed.
This document provides information on and demonstrations of several bleeding edge database technologies: Aerospike, Algebraix Data, and Google BigQuery. It includes benchmark results, architecture diagrams, pricing and deployment details for each one. Example use cases and instructions for getting started with the technologies are also provided.
NoSQL is a non-relational database designed for large-scale data storage needs. It has several key features: it is non-relational, schema-free, uses simple APIs, and is distributed. The four main types of NoSQL databases are key-value, column-oriented, document-oriented, and graph-based. Key advantages of NoSQL include scalability, flexibility in data structures, and ease of development. However, NoSQL sacrifices some consistency and lacks standardization compared to SQL databases.
Elasticsearch as a search alternative to a relational databaseKristijan Duvnjak
The volume of data that we are working with is growing every day, the size of data is pushing us to find new intelligent solutions for problem’s put in front of us. Elasticsearch server has proved it self as an excellent full text search solution for big volume’s of data.
Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It allows storing and searching of documents in near real-time. Documents are stored in indexes which can be sharded across multiple nodes for horizontal scalability and high availability. Queries use a simple JSON over HTTP interface to retrieve and analyze documents. The PBZ bank uses Elasticsearch to index over 600 million documents from customer transactions for fast retrieval of turnovers by account number.
Elasticsearch is a powerful open source search and analytics engine. It allows for full text search capabilities as well as powerful analytics functions. Elasticsearch can be used as both a search engine and as a NoSQL data store. It is easy to set up, use, scale, and maintain. The document provides examples of using Elasticsearch with Rails applications and discusses advanced features such as fuzzy search, autocomplete, and geospatial search.
This document discusses several NoSQL databases including key-value, column-family, graph, and document databases. It provides information on Cassandra, DynamoDB, Riak, Redis, CouchDB, Azure Table Storage, BerkeleyDB, HBase, BigTable, HyperTable, Neo4j, and MongoDB, summarizing their architectures, features, uses cases, and advantages.
This slide deck talks about Elasticsearch and its features.
When you talk about ELK stack it just means you are talking
about Elasticsearch, Logstash, and Kibana. But when you talk
about Elastic stack, other components such as Beats, X-Pack
are also included with it.
what is the ELK Stack?
ELK vs Elastic stack
What is Elasticsearch used for?
How does Elasticsearch work?
What is an Elasticsearch index?
Shards
Replicas
Nodes
Clusters
What programming languages does Elasticsearch support?
Amazon Elasticsearch, its use cases and benefits
The document provides an overview of high performance scalable data stores, also known as NoSQL systems, that have been introduced to provide faster indexed data storage than relational databases. It discusses key-value stores, document stores, extensible record stores, and relational databases that provide horizontal scaling. The document contrasts several popular NoSQL systems, including Redis, Scalaris, Tokyo Tyrant, Voldemort, Riak, and SimpleDB, focusing on their data models, features, performance, and tradeoffs between consistency and scalability.
Elasticsearch and Spark is a presentation about integrating Elasticsearch and Spark for text searching and analysis. It introduces Elasticsearch and Spark, how they can be used together, and the benefits they provide for full-text searching, indexing, and analyzing large amounts of textual data.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Level: Intermediate
Speakers:
Tony Nguyen - Senior Consultant, ProServe, AWS
Hannah Marlowe - Consultant - Federal, AWS
Similar to Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی (20)
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
Combined supervised and unsupervised neural networks for pulse shape discrimi...Samuel Jackson
Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem.
Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
Big Data and Analytics Shaping the future of PaymentsRuchiRathor2
The payments industry is experiencing a data-driven revolution powered by big data and analytics.
Here's a glimpse into 5 ways this dynamic duo is transforming how we pay.
In essence, big data and analytics are playing a pivotal role in building a future filled with faster, more secure, and convenient payment methods for everyone.
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...rightmanforbloodline
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
3. NoSQL doesn’t mean no data
• It’s not Anti SQL or absolutely no SQL
• N(ot) O(nly) SQL
• Non-relational Databases
Why not SQL?
• Internet scale
• 100s of millions of concurrent users
• Massive data collections –
Terabytes to Petabytes of data
• 24/7 across the globe
What is NoSQL good for? BIG DATA
• High Availability
• High Performance
• Horizontal Scalability
5. • Consistency - This means that all nodes see the same data at the same time.
• Availability - This means that the system is always on, no downtime.
• Partition Tolerance - This means that the system continues to function even if the
communication among the servers is unreliable
•SQL - CA
•NoSQL - AP
CAP theorem (Consistency, Availability and Partition-tolerance)
6. NoSQL ACID Trade-offs
• Dropping Atomicity lets you shorten the time tables (sets
of data) are locked.
MongoDB, CouchDB.
• Dropping Consistency lets you scale up writes across
cluster nodes.
Riak, Cassandra.
• Dropping Durability lets you respond to write commands
without flushing to disk.
Memcache, Redis.
7. ACID vs. BASE
• ACID:
• Atomic: Every transaction should succeed else transaction is rolled back
• Consistent: Every transaction leaves database in a valid (consistent) state
• Isolation: Transactions don’t interfere with each other
• Durable: Completed transactions persist, even when servers restart
8. ACID vs. BASE
• BASE
• Basic Availability: The data store should be up all of the time
• Soft state: The data store can be cached somewhere else if the data
store is not available
• Eventual consistency: The data store can have conflicting
transactions, but should eventually reach a valid state
9. Different types of NOSQL
Key-Value Store
• A key that refers to a payload
• MemcacheDB, Azure Table Storage, Redis
Graph Store
• Nodes are stored independently, and the relationship between
nodes (edges) are stored with data
• Neo4j
Column Store
• Column data is saved together, as opposed to row data
• Super useful for data analytics
• Cassandra, Hypertable
Document / XML / Object Store
• Key (and possibly other indexes) point at a serialized object
• DB can operate against values in document
• MongoDB, CouchDB, RavenDB, ElasticSearch
10. NoSQL document example
{
"_id": ObjectId,
"description": String,
"total": Number,
"notes": [{
"_id": ObjectId,
"text": String
}],
"exclusions": [{
"_id": ObjectId,
"text": String
}],
"categories": {
"ref1": {
"name": String,
"status": String,
"price": Number
},
"ref2": {
"name": String,
"status": String,
"price": Number
}
}
}
ORM (Object-relational mapping)
● Entity Framework (.NET)
● Hibernate (Java)
● Django (Python)
● Sequelize (JS) is an ORM for Node.js and io.js
j = { name : "mongo" }
k = { x : 3 }
db.things.insert( j )
db.things.insert( k )
db.things.find()
{"_id" : ObjectId("4c2209f9f3924d31102bd84a"), "name" :
"mongo"}
{"_id" : ObjectId("4c2209fef3924d31102bd84b"), "x" : 3 }
ODM (Object Data Manager)
● Mongoose (Mongo)
11. Use SQL if:
● Data integrity is essential
● Standards-based proven technologies with good developer
experience and support
● Logical related discrete data requirements which can be identified
up-front
● Prefer SQL
12. Use NoSQL if:
● Data requirements are unrelated, indeterminate or evolving
● Project objectives are simpler of less specific and allow starting to
code immediately
● Speed and scalability is imperative
● Prefer NoSQL
16. NoSql Database Type Comparison
Data Model Performance Scalability Flexibility Complexity
Key–Value Store high high high none
Column-Oriented Store high high moderate low
Document-Oriented Store high variable (high) high low
Graph Database variable variable high high
Relational Database variable variable low moderate
17. Summary
SQL - works great, isn’t scalable for large data 😞
NoSQL - works great, isn’t suitable for everyone 😞
SQL + NoSQL 😊
19. • Efficient indexing of data
• On all fields / combination of fields
• Analyzing data
• Text Search
• Tokenizing
• Stemming
• Filtering
• Understanding locations
• Date parsing
• Relevance scoring
What is a search engine?
20. • Finding word boundaries
• Not just explode(‘ ‘, $text);
• Chinese has no spaces. (Not every single character is a word.)
• Understand patterns:
• URLs
• Emails
• #hashtags
• Twitter @mentions
• Currencies (EUR, €, …)
Tokenizing
21. • “Stemming is the process for reducing inflected (or sometimes derived)
words to their stem, base or root form.”
• Conjugations
• Plurals
• Example:
• Fishing, Fished, Fish, Fisher > Fish
• Better > Good
• Several ways to find the stem:
• Lookup tables
• Suffix-stripping
• Lemmatization
• …
• Different stemmers for every language.
Stemming
22. Filtering
• Remove stop words
• Different for every language
• HTML
• If you’re indexing web content, not every character is
meaningful.
23. Understanding locations
• Reverse geocoding of locations to longitude & latitude
• Search on location:
• Bounding box searches
• Distance searches
• Searching nearby
• Geo Polygons
• Searching a country
•(Note: Relational DBs also have geospatial indeces.)
24. Relevance Scoring
• From the matched documents, which ones do you show first?
• Several strategies:
• How many matches in document?
• How many matches in document as percentage of length?
• Custom scoring algorithms
• At index time
• At search time
• … A combination
Think of Google PageRank.
25. Apache lucene
•“Information retrieval software library”
•Free/open source
•Supported by Apache Foundation
•Created by Doug Cutting
•Written in 1999
•The latest version of Lucene is 6.5.0 which was released on March 27, 2017.
“There’s software a Java library for that.”
29. Comparing MS SQL Full Text Search and Lucene
Lucene MS SQL FTS
Index auto update No Yes
Store data in index Yes No
Location in RAM Yes No
Interface API SQL
Queering multiple columns Yes Yes
Stop words, synonyms, sounds-like Yes Yes
Custom Index Documents Structure Yes No
Wildcards Yes With restrictions
Spellchecking, hit-highlighting and other extensions
Provided in “contrib”
extensions library
No
30. Comparing MS SQL Full Text Search and Lucene
Lucene MS SQL FTS
Indexing Speed 3 MB/sec 1 MB/sec
Index Size 10-25% 25-30%
Simple query <20 ms < 20 ms
Query With
Custom Score
< 4 sec >20 sec
MS SQL FTS Lucene (File System) Lucene (RAM)
Cold System
Simple Query 56 643 21
Complex Query 19669* 859 27
Second
executions
Simple Query 14 8 <5
Complex Query 465 17 9
Indexing speed, size and single query execution time
Parallel Query Executions (10 threads, average execution time per query in ms)
32. Elasticsearch
• ElasticSearch is a free and open source distributed inverted index.
• Built on top of Lucene
Lucene is a most popular java-based full text search index implementation.
• Created by Shay Banon @kimchy
• Versions
First public release, v0.4 in February 2010
Now stable version at 5.3.0 (March 28, 2017)
• In Java, so inherently cross-platform
• Repository : github.com/elastic/elasticsearch
• Website : www.elastic.co/products/elasticsearch
33. Why ElasticSearch?
Easy to scale (Distributed)
Everything is one JSON call away (RESTful API)
Unleashed power of Lucene under the hood
Excellent Query DSL
Multi-tenancy (multi cluster, node, shards)
Support for advanced search features (Full Text)
Configurable and Extensible
Document Oriented
Schema free
Conflict management (Optimistic Concurrency Control using Versioning)
Active community
34. ElasticSearch is built to scale horizontally out of the box.
When ever you need to increase capacity, just add more
nodes, and let the cluster reorganize itself to take advantage
of the extra hardware.
One server can hold one or more parts of one or more
indexes, and whenever new nodes are introduced to the
cluster they are just being added to the party. Every such
index, or part of it, is called a shard, and ElasticSearch shards
can be moved around the cluster very easily.
Easy to Scale (Distributed)
RESTful API
ElasticSearch is API driven. Almost any action can be performed
using a simple RESTful API using JSON over HTTP. .
Responses are always in JSON format.
35. Apache Lucene is a high performance, full-featured
Information Retrieval library, written in Java. ElasticSearch uses
Lucene internally to build its state of the art distributed search
and analytics capabilities.
Since Lucene is a stable, proven technology, and continuously
being added with more features and best practices, having
Lucene as the underlying engine that powers ElasticSearch.
Build on top of Apache Lucene
Excellent Query DSL (Domain Specific Language)
The REST API exposes a very complex and capable query DSL, that is
very easy to use. Every query is just a JSON object that can practically
contain any type of query, or even several of them combined.
Using filtered queries, with some queries expressed as Lucene filters,
helps leverage caching and thus speed up common queries, or complex
queries with parts that can be reused.
Faceting, another very common search feature, is just something that
upon-request is accompanied to search results, and then is ready for
you to use.
36. Multiple indexes can be stored on one ElasticSearch
installation - node or cluster. Each index can have
multiple "types", which are essentially completely
different indexes.
The nice thing is you can query multiple types and
multiple indexes with one simple query.
Multi-tenancy
Support for advanced search features (Full Text)
ElasticSearch uses Lucene under the covers to provide the most
powerful full text search capabilities available in any open source
product.
Search comes with multi-language support, a powerful query
language, support for geolocation, context aware did-you-mean
suggestions, autocomplete and search snippets.
Script support in filters and scorers
37. Many of ElasticSearch configurations can be changed while ElasticSearch is running, but some will require a
restart (and in some cases re-indexing). Most configurations can be changed using the REST API too.
ElasticSearch has several extension points - namely site plugins (let you serve static content from ES - like
monitoring java script apps), rivers (for feeding data into ElasticSearch), and plugins to add modules or
components within ElasticSearch itself. This allows you to switch almost every part of ElasticSearch if so you
choose, fairly easily.
Configurable and Extensible
Document Oriented
Store complex real world entities in ElasticSearch as structured
JSON documents. All fields are indexed by default, and all the
indices can be used in a single query, to return results at breath
taking speed.
Per-operation Persistence
ElasticSearch primary moto is data safety. Document changes are
recorded in transaction logs on multiple nodes in the cluster to
minimize the chance of any data loss.
38. ElasticSearch allows you to get started easily. Send a
JSON document and it will try to detect the data
structure, index the data and make it searchable.
Schema free
Conflict management
Optimistic version control can be used where
needed to ensure that data is never lost due to
conflicting changes from multiple processes.
Active community
The community, other than creating nice tools and plugins, is very helpful and supporting.
The overall vibe is really great, and this is an important metric of any OSS project.
There are also some books currently being written by community members, and many blog
posts around the net sharing experiences and knowledge
39. Terminology
SQL Elastic Search
Database Index
Table Type
Row Document
Column Field
Schema Mapping
Index Everything is indexed
SQL Query DSL
SELECT * FROM table … GET http://…
UPDATE table SET … PUT http://…
50. What Does It Add To Lucene?
• RESTfull Service
• JSON API over HTTP
• Want to use it from .Net, Python, PHP, …?
• CURL Requests, as if you’d do requests to the Facebook Graph API.
• High Availability & Performance
• Clustering
• Distributed system on top of Lucene
• Provides other supporting features like thread pool, queues, node/cluster
monitoring API ,data monitoring API ,Cluster management etc.
51. … vs. SOLR
• +
• Also built on Lucene
• So similar feature set
• Also exposes Lucene functionality, like Elastic Search, so easy to
extend.
• A part of Apache Lucene project
• Perfect for Single Server search
• -
• ElasticSearch is easier to use and maintain
• Clustering is there. But it’s definitely not as simple as ElasticSearch’
• Fragmented code base. (Lots of branches.)
56. Cluster
logical grouping of multiple nodes
Nodes in cluster that store data or nodes that just help in speeding up search queries.
Node
An elasticsearch server instance
Usually you should have one node per server
Master – in charge of managing cluster-wide operations
Only one, responsible for distribution/balancing of shards
No bottleneck for queries
Master node is chosen automatically by the cluster
Shard
A shard in elasticsearch is a Lucene index, and a Lucene index is broken down into segments.
low-level worker instance that holds a slice of all data
Each document belongs to a single primary shard
Created during index creation
Determines the number of data stored in each shard
Replica
A copy of a master shard on a different node (increase failover + increase [search] performance)
Automatic Master detection + failover
Spreading over nodes => done automatically
Can be created any time
Scalability - Architecture
64. Data Synchronization
Elastic search is typically not the primary data store.
Implement a queue or use rivers or window service.
A river is a pluggable service running within elasticsearch cluster pulling data (or being pushed with data)
that is then indexed into the cluster. (https://github.com/jprante/ElasticSearch-river-jdbc)
Rivers are available for mongodb, couchdb, rabitmq, twitter, wikipedia, mysql, and etc
The relational data is internally transformed into structured JSON objects for the schema-less
indexing model of ElasticSearch documents.
The plugin can fetch data from different RDBMS source in parallel, and multithreaded bulk mode
ensures high throughput when indexing to ElasticSearch.
TypicallyElasticSearchimplementsworkerroleasalayerwithintheapplicationtopushdata/entitiestoElasticsearch.
68. Kibana
•Powerful front-end dashboard for visualizing indexed information from elastic cluster.
•Capable to providing historical data in form of graphs, charts, etc.
•Enables real-time search of indexed information.
Data Visualization + Data Discovery
75. Harry Potter and the Goblet of Fire
Tokenizer
Harry
Potter
and
the
Goblet
of
Fire
Lower case
filter
harry
potter
and
the
goblet
of
fire
Stop-words
filter
harry
potter
goblet
fire
Step 1: Tokenization
Step 2: Filtering
77. Harry Potter and the Goblet of Fire
Tokenizer
Harry
Potter
and
the
Goblet
of
Fire
Lower case
filter
harry
potter
and
the
goblet
of
fire
Stop-words
filter
harry
potter
goblet
fire
Potter
Tokenizer
Potter
Lower case
filter
potter
Stop-words
filter
potter
QueryIndexing
78. Analyzers
The quick brown fox jumped over the lazy dog,
bob@hotmail.com 123432.
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob@hotmail.com] [123432]
StopAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob] [hotmail] [com]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [bob] [hotmail]
[com]
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog,]
[bob@hotmail.com] [123432.]
KeywordAnalyzer:
[The quick brown fox jumped over the lazy dog, bob@hotmail.com 123432.]
82. What is NEST?
NEST
• All request & response objects represented
• Strongly typed Query DSL implementation
• Supports fluent syntax
• Uses ElasticSearch.net
ElasticSearch.NET
• Low-level, dependency-free client
• All ES endpoints are available as methods
ElasticSearch RESTFul API
http://nest.azurewebsites.net/
83. NEST – Connection Initialization
• Initialize an ElasticClient:
All actions on the ElasticSearch cluster are performed using the ElasticClient
For example:
• Search
• Index
• DeleteIndex/CreateIndex
• …
Uri node = new Uri("http://192.168.137.73:9200");
ConnectionSettings settings = new ConnectionSettings(node, defaultIndex: "products");
ElasticClient client = new ElasticClient(settings);
84. Index your Content - .NET
• Raw JSON string
• Type based indexation
• Modify out-of-the-box behavior using decorators
client.Raw.Index("products", "product", new JavaScriptSerializer().Serialize(prod));
client.Index(product);
[ElasticType(Name = "Product", IdProperty="id")]
public class Product
{
public int id { get; set; }
[ElasticProperty(Name = "name", Index = FieldIndexOption.Analyzed, Type = FieldType.String, Analyzer = "standard")]
public string name { get; set; }
…
85. Query your content – Query DSL .NET
• Retrieve all products from an index using a MatchAll search
• Retrieve all products by using a term query
• Search on all fields using the _all built-in property
• Search on a combination of fields using boolean operators (see fiddler result)
result = client.Search<Product>(s => s.MatchAll());
result = client.Search<Product>(s => s.Query(q => q.Term(t => t.name, "macbook")));
result = client.Search<Product>(s => s.Query(q => q.Term("name", "macbook")));
result = client.Search<Product>(s => s.Query(q => q.Term("_all", "macbook")));
result = client.Search<Product>(s => s.Query(q => q.Term("name", "macbook") ||
q.Term("descr","macbook")));
86. Query your content – Query DSL
• Search on a combination of fields using boolean operators and a date range
filter
• Some more advanced query examples:
• Wildcard Query - use wildcards to search for relevant documents
• Span Near - search for word combinations within a certain span in the document
• More like this query - finds documents which are ‘like’ a given set of documents using
representative terms
result = client.Search<Product>(s => s
.Query(q => (q.Term("name", "macbook") || q.Term("descr", "macbook"))
&& q.Range(r => r
.OnField("price")
.Greater(1000)
.LowerOrEquals(2000)
)));
87. Query your content – Fuzzy searches
• Perform a fuzzy search to overcome query string errors
result = client.Search<Product>(s => s
.Query(q => q
.Match(m => m
.Query("makboek")
.OnField("name")
.Fuzziness(10)
.PrefixLength(1)
)));
88. Query your content - Paging
• Select pages from the full result set using the From & Size filters
result = client.Search<Product>(s => s
.Query(q => q.Term("name", "macbook") || q.Term("descr", "macbook"))
.From(0)
.Size(1));