The document provides an overview of distributed database architecture and search technologies. It discusses Solr and ElasticSearch, including their history, key features, use cases, and migration process. A presentation is given covering basics, current usage, highlights, and taking questions. Examples are provided of companies using ElasticSearch for applications like resume recommendations, integration, and searching large collections of documents.
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka Elasticsearch Tutorial will help you in understanding the fundamentals of Elasticsearch along with its practical usage and help you in building a strong foundation in ELK Stack. This video helps you to learn following topics:
1. What Is Elasticsearch?
2. Why Elasticsearch?
3. Elasticsearch Advantages
4. Elasticsearch Installation
5. API Conventions
6. Elasticsearch Query DSL
7. Mapping
8. Analysis
9 Modules
Presented by David Taieb, Architect, IBM Cloud Data Services
Along with Spark Streaming, Spark SQL and GraphX, MLLib is one of the four key architectural components of Spark. It provides easy-to-use (even for beginners), powerful Machine Learning APIs that are designed to work in parallel using Spark RDDs. In this session, we’ll introduce the different algorithms available in MLLib, e.g. supervised learning with classification (binary and multi class) and regression but also unsupervised learning with clustering (K-means) and recommendation systems. We’ll conclude the presentation with a deep dive on a sample machine learning application built with Spark MLLib that predicts whether a scheduled flight will be delayed or not. This application trains a model using data from real flight information. The labeled flight data is combined with weather data from the “Insight for Weather” service available on IBM Bluemix Cloud Platform to form the training, test and blind data. Even if you are not a black belt in machine learning, you will learn in this session how to leverage powerful Machine Learning algorithms available in Spark to build interesting predictive and prescriptive applications.
About the Speaker: For the last 4 years, David has been the lead architect for the Watson Core UI & Tooling team based in Littleton, Massachusetts. During that time, he led the design and development of a Unified Tooling Platform to support all the Watson Tools including accuracy analysis, test experiments, corpus ingestion, and training data generation. Before that, he was the lead architect for the Domino Server OSGi team responsible for integrating the eXpeditor J2EE Web Container in Domino and building first class APIs for the developer community. He started with IBM in 1996, working on various globalization technologies and products including Domino Global Workbench (used to develop multilingual Notes/Domino NSF applications) and a multilingual Content Management system for the Websphere Application Server. David enjoys sharing his experience by speaking at conferences. You’ll find him at various events like the Unicode conference, Eclipsecon, and Lotusphere. He’s also passionate about building tools that help improve developer productivity and overall experience.
Site search is one of the core functionality of any website. This talk provides an overview of internal workings of CQ5 search, its limitations for implementing site search functionality and discusses design patterns & challenges for integrating various 3rd party search providers with CQ5/AEM.
Do you need an external search platform for Adobe Experience Manager?therealgaston
Experience Manager provides some basic search capabilities out of the box. In this talk, we'll explore an external search platform for implementing an Experience Manager powered, search-driven site. As an example, we will use Apache Solr as a reference implementation and describe best practices for indexing content, exposing non-Experience Manager content via search, delivering search-driven experiences, and deploying the solution in a production setting.
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.
Bio:
Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master's degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
Topics: feature store, MLOps.
StreamSQL Feature Store (Apache Pulsar Summit)Simba Khadder
Input features are the building blocks for machine learning models. You cannot have a great model without great features. By building on top of Apache Pulsar's infinite retention of events, we built infrastructure to serve features in production and to generate training datasets. It allowed our machine learning teams to change, test, and deploy personalization features at an extraordinary rate to 10s of millions of end-users.
This talk will discuss:
- What event-sourcing is and why it's so powerful for machine learning infrastructure.
- How we built the StreamSQL feature store on top of Pulsar, Flink, and Cassandra.
- How a feature store accelerates ML development.
Consuming External Content and Enriching Content with Apache Cameltherealgaston
This document discusses using Apache Camel as a document processing platform to enrich content from Adobe Experience Manager (AEM) before indexing it into a search engine like Solr. It presents the typical direct integration of AEM and search that has limitations, and proposes using Camel to offload processing and make the integration more fault tolerant. Key aspects covered include using Camel's enterprise integration patterns to extract content from AEM, transform and enrich it through multiple processing stages, and submit it to Solr. The presentation includes examples of how to model content as messages in Camel and build the integration using its Java DSL.
This document discusses Apache Atlas, an open source metadata management and governance framework for Hadoop ecosystems. It provides an overview of Atlas' features for modeling and classifying metadata, integrating with components like Hive and Ranger, and its architecture using a graph database and Kafka messaging. The document also outlines use cases for lineage tracking, compliance, and data governance as well as the roadmap for additional component integration and metadata export/import capabilities.
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...Jim Czuprynski
Oracle announced in December 2019 its Spatial and Graph features are now included without additional licensing costs for Oracle databases. This means application developers now have low-cost access to powerful geolocation, routing, and mapping capabilities – a welcome addition for any Application Express (APEX) application that previously shied away from implementing those features. I'll demonstrate a real-life use case – handling the changing demands of a modern election campaign, including managing widely-dispersed volunteers and voters, using geolocation for merchandise distribution, and identifying “flippable” voters with ML and analytics – through a mobile-capable APEX application.
What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.Jim Czuprynski
The document discusses leveraging machine learning capabilities with Oracle Autonomous Database and other Oracle technologies. It provides credentials for the author and an overview of several Oracle machine learning and analytics tools, including Oracle Machine Learning (OML), Oracle Analytics Cloud (OAC), and Application Express (APEX). Examples are given of building analyses with these tools using sample datasets on topics like voter demographics and electoral data. Useful documentation resources are also referenced.
Hopsworks data engineering melbourne april 2020Jim Dowling
This document provides information about Logical Clocks and their open-source Hopsworks platform for data-intensive AI with a feature store. It lists their leadership and offices in Stockholm, London, and Silicon Valley. It then provides details about Hopsworks and how it is used in production for finance, healthcare, and other industries. It describes common feature stores used in production and outlines key feature store concepts like features, feature groups, and training/test datasets. It shows how different types of data are ingested at different cadences into an online and offline feature store. Finally, it demonstrates how to register a feature group, create training datasets, build feature vectors for model prediction, and more using Hopsworks' feature
Analytics Metrics Delivery & ML Feature VisualizationBill Liu
GoPro's data platform evolved from 2016 to 2018 to meet growing analytics needs. In 2016, it focused on batch/streaming ingestion using Spark. In 2017, it transformed to use dynamic elastic clusters, centralized the Hive metastore, and replaced HDFS with S3. In 2018, it added data democratization features like delivering analytics metrics to Slack, added the Druid OLAP database for visualization, and began building out a machine learning infrastructure.
This document discusses a tech talk given by Makoto Yui at Treasure Data on May 14, 2015. It includes an introduction to Hivemall, an open source machine learning library built on Apache Hive. The talk covers how to use Hivemall for tasks like data preparation, feature engineering, model training, and prediction. It also discusses doing real-time prediction by training models offline on Hadoop and performing online predictions using the models on a relational database management system.
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.
Berlin buzzwords 2020-feature-store-dowlingJim Dowling
This document provides information about Logical Clocks and their Feature Store product. It discusses key leadership and offices for Logical Clocks. It then provides an overview of Feature Engineering and the Feature Store concepts including feature transformations, feature groups, training/test datasets, and online/offline feature stores. It demonstrates how to register and access feature groups from the feature store to create training datasets. Finally, it discusses online model serving from the online feature store and the Hopsworks platform more broadly.
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka tutorial on What Is ELK Stack will help you in understanding the fundamentals of Elasticsearch, Logstash, and Kibana together and help you in building a strong foundation in ELK Stack. Below are the topics covered in this ELK tutorial for beginners:
1. Need for Log Analysis
2. Problems with Log Analysis
3. What is ELK Stack?
4. Features of ELK Stack
5. Companies Using ELK Stack
ElasticSearch in Production: lessons learnedBeyondTrees
ElasticSearch is an open source search and analytics engine that allows for scalable full-text search, structured search, and analytics on textual data. The author discusses her experience using ElasticSearch at Udini to power search capabilities across millions of articles. She shares several lessons learned around indexing, querying, testing, and architecture considerations when using ElasticSearch at scale in production environments.
The document discusses Azure Data Lake and U-SQL. It provides an overview of the Data Lake approach to storing and analyzing data compared to traditional data warehousing. It then describes Azure Data Lake Storage and Azure Data Lake Analytics, which provide scalable data storage and an analytics service built on Apache YARN. U-SQL is introduced as a language that unifies SQL and C# for querying data in Data Lakes and other Azure data sources.
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAsJim Czuprynski
Autonomous Database Services have expanded well beyond their original scope of heavy analytical workloads (ADW) and hybrid transaction processing / reporting workloads (ATP) to include dedicated Cloud-based instances to eliminate contention between “noisy neighbors” in the same region and domain.
I'll explain how Oracle DBAs at any skill level can immediately leverage Autonomous resources as force multipliers to free them from most mundane administration tasks so they can concentrate on mastering the new skills required to become an Enterprise Data Architect - the emerging post-DBA role – and shift their focus towards building better enterprise systems in concert with their organization’s application developers, business analysts, and business units.
Serverless SQL provides a serverless analytics platform that allows users to analyze data stored in object storage without having to manage infrastructure. Key features include seamless elasticity, pay-per-query consumption, and the ability to analyze data directly in object storage without having to move it. The platform includes serverless storage, data ingest, data transformation, analytics, and automation capabilities. It aims to create a sharing economy for analytics by allowing various users like developers, data engineers, and analysts flexible access to data and analytics.
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
This document discusses Elasticsearch and how it can be used to search, analyze, and make sense of large amounts of data. It provides examples of how Elasticsearch is being used by large companies to handle petabytes of data and gain insights. Implementations in France are highlighted. The document concludes by demonstrating how easily Elasticsearch can be deployed and used to ingest and search sample data.
Solr at zvents 6 years later & still going stronglucenerevolution
Presented by Amit Nithianandan, Lead Engineer Search/Analytics New Platforms, Zvents/Stubhub
Zvents has been a user of Apache Solr since 2007 when it was very early. Since then, the team has made extensive use of the various features and most recently completed an overhaul of the search engine to Solr 4.0. We'll touch on a variety of development/operational topics including how we manage the build lifecycle of the search application using Maven, release the deployment package using Capistrano and monitor using NewRelic as well as the extensive use of virtual machines to simplify node management. Also, we’ll talk about application level details such as our unique federated search product, and the integration of technologies such as Hypertable, RabbitMQ, and EHCache to power more real-time ranking and filtering based on traffic statistics and ticket inventory.
ELK Stack workshop covers real-world use cases and works with the participants to - implement them. This includes Elastic overview, Logstash configuration, creation of dashboards in Kibana, guidelines and tips on processing custom log formats, designing a system to scale, choosing hardware, and managing the lifecycle of your logs.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
Enabling Self Service Business Intelligenceusing ExcelAlan Koo
This document discusses enabling self-service business intelligence using Excel. It introduces Power BI tools for Excel like Power Query for discovering and combining data from various sources. Power Pivot is for modeling and analyzing data in Excel using DAX. Power View and Power Map enable interactive visualizations. The presentation provides demonstrations of using these tools to clean, model and visualize sample sales data to gain insights. It highlights how Excel users can leverage familiar tools for self-service BI.
Comment transformer vos données en informations exploitablesElasticsearch
Découvrez des fonctionnalités stratégiques de la Suite Elastic, notamment Elasticsearch, un moteur de données incomparable, et Kibana, véritable fenêtre ouverte sur la Suite Elastic.
Dans cette session, vous apprendrez à :
injecter des données dans la Suite Elastic ;
stocker des données ;
analyser des données ;
exploiter des données.
See webinar recording of this presentation at: https://resource.alibabacloud.com/webinar/live.htm?&webinarId=67
In this presentation, you will learn all you need to know about Elasticsearch, one of the most widely used open source search platforms in the world. We will walk you through what Elasticsearch is, why you need it, and show common use cases. First, we will introduce Elastic Search and the best practices for deploying it, as well as show what some of the salient features of the platform are. In the second part of the webinar, we delve into the various use cases for Elasticsearch and show why it is an excellent platform to query a large dataset. This includes a demo on querying a cluster. Finally, we will show how you can launch an elastic cluster on Alibaba Cloud and how to use Elasticsearch to query a large dataset for an autocomplete use case.
Learn more about Alibaba Cloud’s Elasticsearch offering:
https://www.alibabacloud.com/product/elasticsearch
Cómo transformar los datos en análisis con los que tomar decisionesElasticsearch
Descubre las áreas de características estratégicas de Elastic Stack: Elasticsearch, un motor de datos inigualable y Kibana, la ventana que da acceso a Elastic Stack.
En la sesión hablaremos sobre:
Cómo incorporar datos a Elastic Stack
Almacenamiento de datos
Análisis de los datos
Actuar en función de los datos
Comment transformer vos données en informations exploitablesElasticsearch
Découvrez des fonctionnalités stratégiques de la Suite Elastic, notamment Elasticsearch, un moteur de données incomparable, et Kibana, véritable fenêtre ouverte sur la Suite Elastic.
Dans cette session, vous apprendrez à :
injecter des données dans la Suite Elastic ;
stocker des données ;
analyser des données ;
exploiter des données.
This document compares and contrasts microservice architecture (MSA) and service-oriented architecture (SOA). SOA defines application components as loosely coupled services that communicate over a network, while MSA develops applications as suites of small services communicating via lightweight mechanisms like REST. The document also discusses Netflix's transition from a monolithic to a microservices architecture led by Adrian Cockcroft, highlighting benefits like speed, autonomy, and flexibility.
AngularJS 1.x - your first application (problems and solutions)Igor Talevski
We will talk about all aspects of building a single page application with AngularJS, and we will discuss real examples from day-to-day work. We will also cover a large amount of theory about general web development, best practices, and today's client demands. We will focus on three (3) main points: architecture, security, and real time notification.
Just the Job: Employing Solr for Recruitment Search -Charlie Hull lucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Using a case study on a major European executive recruitment company, we will show how we used Apache Lucene/Solr to build powerful, flexible, accurate and scalable search services over tens of millions of CVs and candidate records, allowing the company to completely restructure their IT provision for both local and national offices.
Solr and Elasticsearch, a performance studyCharlie Hull
The document summarizes a performance comparison study conducted between Elasticsearch and SolrCloud. It found that SolrCloud was slightly faster at indexing and querying large datasets, and was able to support a significantly higher queries per second. However, the document notes limitations to the study and concludes that both Elasticsearch and SolrCloud showed acceptable performance, so the best option depends on the specific search application requirements.
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
This document discusses how SQL can be used in Lucidworks Fusion for various purposes like aggregating signals to compute relevance scores, ingesting and transforming data from various sources using Spark SQL, enabling self-service analytics through tools like Tableau and PowerBI, and running experiments to compare variants. It provides examples of using SQL for tasks like sessionization with window functions, joining multiple data sources, hiding complex logic in user-defined functions, and powering recommendations. The document recommends SQL in Fusion for tasks like analytics, data ingestion, machine learning, and experimentation.
Oracle Business Intelligence is a product of Oracle Corporation. It is a Data Warehousing BI tool. It is very user friendly. OBIEE is one of the most emerging reporting tools ever since Oracle has taken over Siebel. In the coming days there are going to be many existing and new projects that will be migrated to Obiee from their existing reporting tools. And above all Obiee is very easy to learn and fun to learn, you do not need coding skills, but just a little logic and familiarity with the tool. And we are training on version 11g (11.1.1.6) which is a new release.
Transforming data into actionable insightsElasticsearch
Learn about the strategic feature areas of the Elastic Stack—Elasticsearch, a data engine like no other, and Kibana, the window into the Elastic Stack.
The session will cover:
Bringing data into the Elastic Stack
Storing data
Analyzing data
Acting on data
Similar to Solr and ElasticSearch demo and speaker feb 2014 (20)
How i helped rue la la become a one stop ecommerce boutiquenkabra
1) Rue La La is an online shopping destination that curates fashion, home decor, and other products from boutiques. It uses data science and analytics to build a loyal customer base and lead in e-commerce.
2) In 2018, Rue La La acquired Gilt, another leading online retailer, to form Rue Gilt Groupe and leverage their combined advanced technology platform.
3) For its platform, Rue La La used AWS services like DynamoDB to store customer and product data at scale, API Gateway to access DynamoDB, and CloudFormation for repeatable infrastructure configurations. It also used data science to personalize product recommendations and pages for each customer.
How geo phy built a proprietary automated valuation platform for the commerci...nkabra
GeoPhy built a proprietary automated valuation platform for commercial real estate by using data science and machine learning techniques. They gathered and linked thousands of data sources using data fusion and natural language processing. Models were trained using this extensive data to evaluate properties and provide instant valuations with a median error rate of 5.85%, outperforming traditional appraisals. This platform allows lenders and investors to access more accurate property valuations instantly instead of waiting weeks for an appraisal.
How fleet advantage analytics uses predic engine and iot with machine learningnkabra
Fleet Advantage uses Predix data lake engine and IoT analytics to provide turnkey asset management solutions through monitoring individual vehicles and vehicle groups. Their ATLAAS software gives fleet executives pertinent fleet information and data visualizations through an easy-to-use interface to manage their fleet. Key disruptions in the commercial vehicle industry include increased telematics data, autonomous driving, electrification, and automating inspections. Fleet Advantage collects over 2PB of data monthly from 475,000 vehicles to generate health reports and reduce downtime, repairs, and maintenance costs through predictive maintenance.
Building a data science team at michelin tyresnkabra
Michelin tires built a data science team to help overcome business challenges. Key achievements included reducing scrap by 15%, reducing mixing cycle time by 2%, and improving demand forecasting accuracy. The team was hired both internally and externally from diverse backgrounds. An agile and transparent approach was used, with a flat structure reporting directly to the CTO. Business problems were identified and the team focused on solving problems with specific key performance indicators. Lessons included needing top management commitment, integrating with the business, managing stakeholder expectations, and recruiting and mentoring talent. The organization was made more data-driven by mapping data use, treating data as important, and creating a culture where decisions are based on data.
Inmemory db nick kabra june 2013 discussion at columbia universitynkabra
The document provides a metric weightage for evaluating various in-memory databases. It includes sub-metrics across various categories like storage, server type, use cases, benchmarks, integration, performance, operation, cost, and security. Each sub-metric is assigned a specific weightage based on its importance. Vendors like Oracle, SAP Hana, Kognitio, VoltDB, GridGain, MemSQL, SQLFire, and Altibase are then rated against each sub-metric to calculate a total score.
Couchbase is a document-oriented NoSQL database that provides a distributed key-value store with optional in-memory caching. It uses JSON documents with a schema-free approach and has built-in replication and high availability. Couchbase supports low-latency applications through its in-memory operations and integration with memcached. It allows flexible scaling through horizontal sharding of data across nodes.
Hadoop comparative scorecard nick kabra sr mgmt 04042014 and stack integrati...nkabra
This document compares different Hadoop distribution vendors (CDH, HW/HDP, MapR, Pivotal HD) based on various metrics across features, open source philosophy, products/technologies, performance, management capabilities, and ecosystem integration. Key criteria include scalability, multi-tenancy, open source contributions, cloud/mobile products, support offerings, data processing, analytics, SQL capabilities, security, management tools, and partner technologies/connectors. The industry norm is to use two implementations to avoid over-reliance on any one vendor and leverage different feature sets.
The document discusses 5 case studies of how a large bank used big data and Hadoop to address various business challenges:
1. Customer Risk Profiling - Analyzing customer data across siloed systems to build more accurate risk scores and improve risk management.
2. Trade Surveillance and Reporting - Monitoring trading data to detect illegal activity like fraud and ensure compliance with regulations.
3. Online Account Opening - Analyzing structured and unstructured data to validate identities and detect fraud in online account openings.
4. Legacy Migration to Hadoop - Migrating legacy customer data to Hadoop to build more accurate customer risk scores faster and at lower cost.
5. ATM/Mobile Adjustment Data
The document compares several compression formats used in Hadoop: Snappy, LZ4, LZO, bzip2, gzip, and zlib. It provides information on their algorithms, file extensions, Java support, strengths, weaknesses, compression ratios, and speeds on sample data. Snappy is the fastest for compression but slowest for decompression. LZ4 and LZO provide very fast compression and decompression. Bzip2 achieves the highest compression ratio but is the slowest overall.
The document discusses data compression in Hadoop. There are several benefits to compressing data in Hadoop including reduced storage needs, faster data transfers, and less disk I/O. However, compression increases CPU usage. There are different compression algorithms and file formats that can be used including gzip, bzip2, LZO, LZ4, zlib, snappy, Avro, SequenceFiles, RCFiles, ORC, and Parquet. The best options depend on factors like the data, query needs, support in Hadoop distributions, and whether the data schema may evolve. Columnar formats like Parquet provide better query performance but slower write speeds.
Future of big data nick kabra speaker compendium march 2013nkabra
1) The largest wave of big data value is still to come as infrastructure is used to create new applications that optimize business processes.
2) As more devices connect to the internet through technologies like IPv6, big data will continue to grow exponentially through the integration of physical and digital worlds.
3) Cloud computing trends like PaaS, DBaaS, and IaaS will continue rising with adoption, allowing data and analytics solutions to move to the center of business operations.
Big data in marketing at harvard business club nick1 june 15 2013nkabra
This document summarizes a presentation about using big data to transform marketing. It discusses the volume, velocity, variety and other characteristics of big data, and provides examples of how big data can be used for applications like investment recommendations, trading and risk management, and regulatory reporting. It also outlines considerations for starting a big data project, including identifying use cases, data sources, hardware and software requirements, and analytics approaches.
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
2. Presentation Agenda
Team Introduction
Basics and History
Use Cases & Current Usage
Highlights
Appendix
DISCLAIMER: This is a knowledge-sharing
session and not a recommendation for any
specific technology / product
From the web
Migration
Distributed Database Architecture 2
4. Basics
1
2
3
4
• Used for Indexing and Searching
• Built on top of Lucene API
• Solr and ES take Lucene API and build features on
top. API accessed through web server
• Smaller version of Google which has indexed and
ranked the web pages
Search platform for Web sites. Search platform for organization.
• Lucene – search engine packaged together in
set of jar files
Distributed Database Architecture 4
5. History
• Differences in design and architecture.
Distributed Database Architecture 5
ES was released in 2010.
Additional features.
Solr released in 2008.
6. Key Players: Solr and ElasticSearch
1
2
3
Latest Version= Solr 4.6.1
released on Jan 28, 2014
Collection – Main logical
structure for Solr
Index – Main logical structure for
ES
Architecture
• Distributed
• Fault tolerant and auto
replicas
• Coord: Only ElasticSearch
nodes + zen discovery. Split
brain.
• Single leader
• Automatic leader election
Solr ElasticSearch (ES)
Latest Version= ElasticSearch
1.0.0 released on Feb 12, 2014
Architecture
• Distributed
• Fault tolerant and auto
replicas
• Coord: Apache Solr +
ZooKeeper ensemble. So
quorum
• Leader per shard
• Automatic leader election
Distributed Database Architecture 6
7. Resume recommendations
UseCase1
Challenge
• Company ABC helps other firms hire skilled developers, project
managers. Empower customers to find the right job candidate
from a database of 8 million profiles.
• Need fast and predictable performance.
• Include geo-spatial.
Success
• Customer hires using the company ABC.
• ABC stores searches made by customers.
• Identify candidates, skills, compensation structure to
enhance the customer search experience with better
matches.
• Make recommendations to customers on salaries, future
market needs etc.
• Eliminate duplicate profiles with realtime indexing and
percolation.
• Provides enhanced customers experience, faster
responses
Opportunity
• Use ES as the search engine with realtime indexing
and nested querying.
Point
Distributed Database Architecture 7
8. Integration - Use Case 2
THE
FULL
CIRCLE
Kibana
Visualization engine for
dynamic dashboards created
in real-time or on-the-fly
ElasticSearch
Search, analyze in realtime
Logstash
Take logs, scrub, parse and
enrich the data
Distributed Database Architecture 8
9. Chatagent for 460 million documents – Use Case 3
9
Challenge
6,000 customers from around the world use LiveChat daily to communicate with their customers from one person owned businesses to
international organizations like LG, Apple, Adobe etc.
LiveChat customers conduct 3.6 million queries and 220 million “get” operations per day on 460 million documents. LiveChat keeps these
documents updated with 70 million indexing operations every day.
Solution
Advantage
• Reduce query time from 2 seconds to 100 ms
• Streamline updating from hours to seconds
• Guarantee maximum uptime
• Scale to meet the needs of 6,000 customers
• Store and search on 460 million documents
• Process 3.6 million queries per day
• Scalability, indexing, Full text search allows users to search through chat archives
• Faceting makes it possible to pull various statistics for LiveChat clients.
• ES acts as single datastore, data updates available immediately - Now each of the documents is updated in LiveChat on an average of 20 to
30 times every 20 to 60 seconds.
Distributed Database Architecture
10. Current Uses
1
2
3
4
• Use Case 1
• Use Case 2
• Use Case 4
• Use Case 3
x • Use Case X
10Distributed Database Architecture
11. Highlights
Schema and config –
Solrconfig.xml, es.yml – change
no. of shards and replicas live
Scaling - nodes autobalanced,
/ Solr -3755 or shard splitting /add a
document
Nesting (address, users & rights,
boolean, parent children)
Index=different types of
documents and analyzer
Point
Node discovery and fault
discovery. Zookeeper
Point
Multiple documents per schema
and parent-child
Point
Percolator
Point
Aggregation+facets in ES
/Facets in Solr
Distributed Database Architecture 11
12. Highlights (contd. 2)
Auto-load balancer and auto-sharding
Marvel metrics on 03/13/2014
Brain Split problem in ES
Structured queryDSL and query control
Real-time indexing /near real-time indexing
Query routing and Solr 5816 to be introduced
1
2
3
4
5
6
Distributed Database Architecture 12
13. ElasticSearch / Solr funnel
UIMA
Text analysis debugger,
spell check
Decision tree faceting /
Drilldown
Cloudera, Mapr, DataStax
support Solr
Filters for queries across
nested documents
Query handling analyzer and
language, term suggester,
autocomplete
Realtime GET with query routing
Hortonworks, Couchbase
support ElasticSearch
Distributed Database Architecture 13
14. FROM THE WEB
Web CPA
This is only an FYI: Found some customers moving from Solr to ElasticSearch but
could not find any article which mentioned that clients moved from ES to Solr.
Caveat: No prejudice but it would be good to hear what customers say.
Let us also check this site: http://www.ymc.ch/en/why-we-chose-solr-4-0-instead-of-elasticsearch
http://www.mgt-commerce.com/magento-elasticsearch.html
Foursquare= http://engineering.foursquare.com/2012/08/09/foursquare-now-uses-
elastic-search-and-on-a-related-note-slashem-also-works-with-elastic-search/
Jetwick= http://karussell.wordpress.com/2011/02/07/why-jetwick-moved-from-solr-
to-elasticsearch/
Netricos= http://www.netricos.com/blog/posts/how-we-are-using-elastic-search
Stumbleupon = http://www.elasticsearch.org/case-study/stumbleupon/
UK govt. site= https://gds.blog.gov.uk/2012/08/03/from-solr-to-elasticsearch/
Wikimedia= http://thenextweb.com/insider/2014/01/06/wikimedia-will-replace-
search-elasticsearch-beta-users-february-users-march-april/#!xDKnd
Distributed Database Architecture 14
15. 2 Parts of a whole – The Math
Solr performs very well on small
indexes that don’t change very often
1
Scalability, auto-sharding, GUI
admin, schemaless, real-time,
nested queries, routing and the
way indexing and queries are
handled which provide faster
execution of queries and better
indexing provide a distinct
advantage to using ES
2
Solr
ElasticSearch
Distributed Database Architecture 15
16. Migration
Step 1
Use river plugin to migrate
from existing Solr to ES.
Step 2
Pulls the content from
existing Solr cluster and
index it in ES
Step 3
When you decide to switch to
Elasticsearch permanently, you would
obviously switch your indexing to
directly index content from your
sources to Elasticsearch. Keeping Solr
in the middle is not a recommended
setup.
Distributed Database Architecture 16
17. If we have a small site and need
search features without the
distributed bells-and-whistles,
both Solr and ElasticSearch are
efficient
If we are planning a large
installation that requires
running distributed search
with nesting, scalability,
sharding, real-time
ElasticSearch can do a better
job.
Conclusion
Distributed Database Architecture 17
Both products
trying to catch-up
based on other
product’s capabilities
18. Where do we go from here ?
---------------------------------------
The best way to define this is:
Some possible next steps….
Question to ask
Distributed Database Architecture 18