Rujhaan.com is a news aggregation app that collects trending news and social media discussions from topics of interest to users. It uses various technologies including a crawler to collect data from social media, Apache Solr for search, MongoDB for storage, Redis for caching, and machine learning techniques like classification and clustering. The presenter discussed the technical architecture and challenges of building Rujhaan.com to provide fast, personalized news content to over 16,000 monthly users while scaling to growing traffic levels.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Lucidworks
The document discusses building a large scale SEO/SEM application using Apache Solr. It describes some of the key challenges faced in indexing and searching over 40 billion records in the application's database each month. It discusses techniques used to optimize the data import process, create a distributed index across multiple tables, address out of memory errors, and improve search performance through partitioning, index optimization, and external caching.
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLucidworks
Gregg Donovan presented on lessons learned from sharding Solr at Etsy over three versions:
1) Initially, Etsy did not shard to avoid problems, but the single node approach did not scale.
2) The first sharding version used local sharding across multiple JVMs per host for better latency and manageability.
3) The current version uses distributed sharding across data centers for further latency gains, but this introduced challenges of partial failures, synchronization, and distributed queries.
Jaws - Data Warehouse with Spark SQL by Ema OrhianSpark Summit
1) Jaws is a highly scalable and resilient data warehouse explorer that allows submitting Spark SQL queries concurrently and asynchronously through a RESTful API.
2) It provides features like persisted query logs, results pagination, and pluggable storage layers. Queries can be run on Spark SQL contexts configured to use data from HDFS, Cassandra, Parquet files on HDFS or Tachyon.
3) The architecture allows Jaws to scale on standalone, Mesos, or YARN clusters by distributing queries across multiple worker nodes, and supports canceling running queries.
Elasticsearch and Spark is a presentation about integrating Elasticsearch and Spark for text searching and analysis. It introduces Elasticsearch and Spark, how they can be used together, and the benefits they provide for full-text searching, indexing, and analyzing large amounts of textual data.
This document outlines a project to capture user location data and send it to a database for real-time analysis using Kafka and Spark streaming. It describes starting Zookeeper and Kafka servers, creating Kafka topics, producing and consuming messages with Java producers and consumers, using the Spark CLI, integrating Kafka and Spark for streaming, creating DataFrames and SQL queries, and saving data to PostgreSQL tables for further processing and analysis. The goal is to demonstrate real-time data streaming and analytics on user location data.
Centralized log-management-with-elastic-stackRich Lee
Centralized log management is implemented using the Elastic Stack including Filebeat, Logstash, Elasticsearch, and Kibana. Filebeat ships logs to Logstash which transforms and indexes the data into Elasticsearch. Logs can then be queried and visualized in Kibana. For large volumes of logs, Kafka may be used as a buffer between the shipper and indexer. Backups are performed using Elasticsearch snapshots to a shared file system or cloud storage. Logs are indexed into time-based indices and a cron job deletes old indices to control storage usage.
This document discusses Cloudera Search, which provides full-text search capabilities integrated with Apache Hadoop. It summarizes Cloudera Search's architecture, which uses Apache Lucene/Solr for indexing and search, Apache Flume and HBase for near real-time indexing, Apache MapReduce for batch indexing, and Apache Sentry for security. The document also discusses use cases for near real-time and batch search and concludes by encouraging questions.
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
This document discusses Solr distributed indexing at WalmartLabs. It describes customizing an existing MapReduce indexing tool to index large XML files in a distributed manner across multiple servers. Key points covered include using two custom utilities for index generation and merging, experiments showing indexing is CPU-bound while merging is I/O-bound, and lessons learned around data locality and using n-way merging of shards for best performance. Solutions discussed include dedicating an indexing Hadoop cluster to improve I/O speeds for merging indexes.
Searching for Better Code: Presented by Grant Ingersoll, LucidworksLucidworks
The document discusses Lucidworks' Fusion product, which is a search platform that enhances Apache Solr. It provides connectors to various data sources, integrated ETL pipelines, built-in recommendations, and security features. The document outlines Fusion's architecture, demo use cases for basic and code search, and next steps for integrating additional analysis tools like OpenGrok.
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
This document summarizes Uber's use of Spark as a data platform to support multi-tenancy and various data applications. Key points include:
- Uber uses Spark on YARN for resource management and isolation between teams/jobs. Parquet is used as the columnar file format for performance and schema support.
- Challenges include sharing infrastructure between many teams with different backgrounds and use cases. Spark provides a common platform.
- An Uber Development Kit (UDK) is used to help users get Spark jobs running quickly on Uber's infrastructure, with templates, defaults, and APIs for common tasks.
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
This document discusses best practices for optimizing Apache Spark applications. It covers techniques for speeding up file loading, optimizing file storage and layout, identifying bottlenecks in queries, dealing with many partitions, using datasource tables, managing schema inference, file types and compression, partitioning and bucketing files, managing shuffle partitions with adaptive execution, optimizing unions, using the cost-based optimizer, and leveraging the data skipping index. The presentation aims to help Spark developers apply these techniques to improve performance.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
This document discusses securing Spark applications. It covers encryption, authentication, and authorization. Encryption protects data in transit using SASL or SSL. Authentication uses Kerberos to identify users. Authorization controls data access using Apache Sentry and the Sentry HDFS plugin, which synchronizes HDFS permissions with higher-level abstractions like tables. A future RecordService aims to provide a unified authorization system at the record level for Spark SQL.
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
Streaming Analytics with Spark, Kafka, Cassandra, and Akka discusses rethinking architectures for streaming analytics. The document discusses:
1) The need to build scalable, fault-tolerant systems to handle massive amounts of streaming data from different sources with varying structures.
2) An example use case of profiling cyber threat actors using streaming machine data to detect intrusions and security breaches.
3) Rethinking architectures by moving away from ETL pipelines and dual batch/stream systems like Lambda architecture toward unified stream processing with Spark Streaming, Kafka, Cassandra and Akka. This simplifies analytics and eliminates duplicate code and systems.
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetLucidworks
This document summarizes Target's implementation of Solr as its search platform. It discusses how Target transitioned from Oracle-Endeca to Solr to handle its large scale data and enable more flexible relevancy controls. It describes how Target tested Solr through handling live guest traffic in two sprints and moving its typeahead functionality to the public cloud. Finally, it outlines how Target leverages key Solr capabilities like collection aliases, atomic updates, and configurable facets to synchronize designer and product launches.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
The document discusses designing robust data architectures for decision making. It advocates for building architectures that can easily add new data sources, improve and expand analytics, standardize metadata and storage for easy data access, discover and recover from mistakes. The key aspects discussed are using Kafka as a data bus to decouple pipelines, retaining all data for recovery and experimentation, treating the filesystem as a database by storing intermediate data, leveraging Spark and Spark Streaming for batch and stream processing, and maintaining schemas for integration and evolution of the system.
ElasticSearch is an open source, distributed, RESTful search and analytics engine. It allows storage and search of documents in near real-time. Documents are indexed and stored across multiple nodes in a cluster. The documents can be queried using a RESTful API or client libraries. ElasticSearch is built on top of Lucene and provides scalability, reliability and availability.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
A short presentation for beginners on Introduction of Machine Learning, What it is, how it works, what all are the popular Machine Learning techniques and learning models (supervised, unsupervised, semi-supervised, reinforcement learning) and how they works with various Industry use-cases and popular examples.
Internet traffic spikes aren't what they used to be. It is now evident that even the smallest sites can suffer the attention of the global audience. This presentation dives into techniques to avoid collapse under dire circumstances. Looking at some real traffic spikes, we'll pinpoint what part of the architecture is crumbling under the load; then, walk though stop-gaps and complete solutions.
Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data ...Lior Rokach
This document discusses bibliometrics and metrics for evaluating scientists, including the h-index and impact factor. It provides a quick guide to common bibliometric measures such as citations, the impact factor, and the h-index. It also discusses criticisms of citations as the sole metric and describes several proposed modified h-index metrics that aim to address issues like co-authorship, self-citations, and age of publications.
Decision Forest: Twenty Years of ResearchLior Rokach
A decision tree is a predictive model that recursively partitions the covariate's space into subspaces such that each subspace constitutes a basis for a different prediction function. Decision trees can be used for various learning tasks including classification, regression and survival analysis. Due to their unique benefits, decision trees have become one of the most powerful and popular approaches in data science. Decision forest aims to improve the predictive performance of a single decision tree by training multiple trees and combining their predictions.
The document discusses recommender systems and describes several techniques used in collaborative filtering recommender systems including k-nearest neighbors (kNN), singular value decomposition (SVD), and similarity weights optimization (SWO). It provides examples of how these techniques work and compares kNN to SWO. The document aims to explain state-of-the-art recommender system methods.
When Cyber Security Meets Machine LearningLior Rokach
This document discusses machine learning approaches for cyber security, specifically malware detection. It begins with an introduction to cyber security and machine learning. It then discusses using machine learning for malware detection, including analyzing files through static and dynamic analysis. The document outlines extracting features from files and using text categorization approaches. It evaluates various machine learning classifiers and features for malware detection. Finally, it discusses applying these techniques on Android devices for abnormal state detection.
Introduction to Elasticsearch with basics of LuceneRahul Jain
Rahul Jain gives an introduction to Elasticsearch and its basic concepts like term frequency, inverse document frequency, and boosting. He describes Lucene as a fast, scalable search library that uses inverted indexes. Elasticsearch is introduced as an open source search platform built on Lucene that provides distributed indexing, replication, and load balancing. Logstash and Kibana are also briefly described as tools for collecting, parsing, and visualizing logs in Elasticsearch.
Making Structured Streaming Ready for ProductionDatabricks
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
Speaker: Tathagata Das
This talk was originally presented at Spark Summit East 2017.
Parallelizing Existing R Packages with SparkRDatabricks
R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR.
Speaker: Hossein Falaki
This talk was originally presented at Spark Summit East 2017.
This document discusses Apache Kafka, an open-source distributed event streaming platform. It provides an introduction to Kafka's design and capabilities including:
1) Kafka is a distributed publish-subscribe messaging system that can handle high throughput workloads with low latency.
2) It is designed for real-time data pipelines and activity streaming and can be used for transporting logs, metrics collection, and building real-time applications.
3) Kafka supports distributed, scalable, fault-tolerant storage and processing of streaming data across multiple producers and consumers.
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications.
In this talk, we go over new and upcoming features in Spark that enabled it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
SparkSQL: A Compiler from Queries to RDDsDatabricks
SparkSQL, a module for processing structured data in Spark, is one of the fastest SQL on Hadoop systems in the world. This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes, plans and executes a user’s query.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
The document summarizes key concepts in machine learning, including defining learning, types of learning (induction vs discovery, guided learning vs learning from raw data, etc.), generalisation and specialisation, and some simple learning algorithms like Find-S and the candidate elimination algorithm. It discusses how learning can be viewed as searching a generalisation hierarchy to find a hypothesis that covers the examples. The candidate elimination algorithm maintains the version space - the set of hypotheses consistent with the training examples - by updating the general and specific boundaries as new examples are processed.
This document provides an introduction to machine learning. It discusses how machine learning allows computers to learn from experience to improve their performance on tasks. Supervised learning is described, where the goal is to learn a function that maps inputs to outputs from a labeled dataset. Cross-validation techniques like the test set method, leave-one-out cross-validation, and k-fold cross-validation are introduced to evaluate model performance without overfitting. Applications of machine learning like medical diagnosis, recommendation systems, and autonomous driving are briefly outlined.
Machine learning involves developing systems that can learn from data and experience. The document discusses several machine learning techniques including decision tree learning, rule induction, case-based reasoning, supervised and unsupervised learning. It also covers representations, learners, critics and applications of machine learning such as improving search engines and developing intelligent tutoring systems.
This document provides an introduction to machine learning, including:
- It discusses how the human brain learns to classify images and how machine learning systems are programmed to perform similar tasks.
- It provides an example of image classification using machine learning and discusses how machines are trained on sample data and then used to classify new queries.
- It outlines some common applications of machine learning in areas like banking, biomedicine, and computer/internet applications. It also discusses popular machine learning algorithms like Bayes networks, artificial neural networks, PCA, SVM classification, and K-means clustering.
Here are the key calculations:
1) Probability that persons p and q will be at the same hotel on a given day d is 1/100 × 1/100 × 10-5 = 10-9, since there are 100 hotels and each person stays in a hotel with probability 10-5 on any given day.
2) Probability that p and q will be at the same hotel on given days d1 and d2 is (10-9) × (10-9) = 10-18, since the events are independent.
Traackr evaluated several NoSQL database options to store its heterogeneous, unstructured web data. Document databases were the best fit due to their flexibility to store variable length text like tweets and blog posts without predefined schemas. MongoDB was selected due to its maturity, adoption, and support for ad-hoc queries and batch processing needed by Traackr in early 2010.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
The document provides an overview of Elasticsearch and how it can be used to make software smarter. It discusses how Elasticsearch works and its advantages over other search technologies like SQL and Sphinx. The document also includes case studies of four projects that used Elasticsearch for tasks like search, recommendations, and parsing classified ads. It covers how to install and configure Elasticsearch, as well as how to query an Elasticsearch index through its RESTful API.
No SQL : Which way to go? Presented at DDDMelbourne 2015Himanshu Desai
The document provides an overview of the NoSQL database technologies RavenDB, MongoDB, and DocumentDB. It discusses their features around scalability, querying, indexing, availability of tooling, and performance characteristics. The technologies are compared in terms of how they handle ACID properties, availability and tooling, querying and indexing, and performance considerations. Gotchas or limitations of each technology are also briefly outlined.
Presented in DDD Melbourne on on Sat Aug 8th 2015
Himanshu Desai, Ahmed El-Harouny & Daniel Janczak
DocumentDB, Mongo or RavenDB? If you are starting out on a new project and considering NoSQL database as an option, which one should you do choose? What if the option you choose today may not work out to be the best one for your needs?
Come and join us for this session, we will take you on a journey where we will explain each of these database on their merits and compare them and also share War stories.
http://dddmelbourne.com
This document provides an overview of NoSQL databases and their characteristics. It discusses the different eras of databases and pressures that led to the rise of NoSQL databases. It then categorizes and describes the different types of NoSQL databases, including key-value stores, document stores, column family stores, and graph databases. Specific examples like MongoDB, Cassandra, HBase, Neo4j are also outlined. The document emphasizes that the type of database chosen should depend on the problem to be solved and characteristics of the data.
No se pierda esta oportunidad de conocer las ventajas de NoSQL. Participe en nuestro seminario web y descubra:
Qué significa el término NoSQL
Qué diferencias hay entre los almacenes clave-valor, columna ancha, grafo y de documentos
Qué significa el término «multimodelo»
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
The document provides an overview of Big Data technology landscape, specifically focusing on NoSQL databases and Hadoop. It defines NoSQL as a non-relational database used for dealing with big data. It describes four main types of NoSQL databases - key-value stores, document databases, column-oriented databases, and graph databases - and provides examples of databases that fall under each type. It also discusses why NoSQL and Hadoop are useful technologies for storing and processing big data, how they work, and how companies are using them.
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example.
Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics.
Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance , and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics.
For more information, visit www.casertaconcepts.com
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
This document discusses Elasticsearch and how it can be used to search, analyze, and make sense of large amounts of data. It provides examples of how Elasticsearch is being used by large companies to handle petabytes of data and gain insights. Implementations in France are highlighted. The document concludes by demonstrating how easily Elasticsearch can be deployed and used to ingest and search sample data.
This document provides an overview of MongoDB for Java developers. It discusses what MongoDB is, how it compares to relational databases, common use cases, data modeling approaches, CRUD operations, indexing, aggregation, replication, sharding, and tools for integrating MongoDB with Java applications. The document contains multiple code examples and concludes with a demonstration of building a sample app with MongoDB.
The document discusses NoSQL databases and their advantages compared to SQL databases. It defines NoSQL as any database that is not relational and describes the main categories of NoSQL databases - key-value stores, document databases, wide column stores like BigTable, and graph databases. It also covers common use cases for different NoSQL databases and examples of companies using NoSQL technologies like MongoDB, Cassandra, and HBase.
History of NoSQL and Azure Documentdb feature setSoner Altin
Short history of database systems from DBMS, RDBMS to NoSQL solutions. Introduction to SQL query support of Azure DocumentDB and integrating DocumentDB with simple Java application from Maven repository.
Engineering patterns for implementing data science models on big data platformsHisham Arafat
Discussion of practically implementing data science models on big data platforms from engineering perspective. An eye opener on the engineering factors associated with designing and working solution. We use a simple text mining example on social media analytics for brand marketing. At the first while, it seems simple solution however if you go deeply and think on implementation aspects of even a simple analytics model, you can discover the degree of complexity at each part of the solution. An Abstraction of the Big Data key advantages would be very helpful to select appropriate Big Data technology components out of very large landscape. Two examples with reference are given for using Lambda Architecture and unusual way of image processing using Big Data abstraction provided.
This document provides an overview of Neo4j, a graph database management system. It discusses how Neo4j stores data as nodes and relationships, allowing for fast querying of connected data. Traditional relational databases struggle with complex relationships, while NoSQL databases don't support relationships at all. Neo4j addresses these issues through its native graph storage and processing capabilities. The document highlights key Neo4j features like scalability, high performance, and its Cypher query language.
Why do they call it Linked Data when they want to say...?Oscar Corcho
The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.
The document discusses NoSQL technologies including Cassandra, MongoDB, and ElasticSearch. It provides an overview of each technology, describing their data models, key features, and comparing them. Example documents and queries are shown for MongoDB and ElasticSearch. Popular use cases for each are also listed.
Similar to Case study of Rujhaan.com (A social news app ) (20)
Flipkart Strategy Analysis and RecommendationRahul Jain
Flipkart is India's largest e-commerce company. It has a 40% market share in India's online retail industry, which was $64 billion in 2020 and is projected to grow to $200 billion by 2027. Flipkart has made several acquisitions to expand into related businesses like online travel, financial services, and logistics. It aims to increase its market share in key categories like mobile, electronics, fashion, and grocery. To achieve this, Flipkart plans to expand its fulfillment center network to smaller cities, focus on private labels, and increase offerings in high-engagement categories. It also aims to leverage its investments in Myntra, PhonePe and Cleartrip to drive profit
This document discusses NoSQL and the CAP theorem. It begins with an introduction of the presenter and an overview of topics to be covered: What is NoSQL and the CAP theorem. It then defines NoSQL, provides examples of major NoSQL categories (document, graph, key-value, and wide-column stores), and explains why NoSQL is used, including to handle large, dynamic, and distributed data. The document also explains the CAP theorem, which states that a distributed data store can only satisfy two of three properties: consistency, availability, and partition tolerance. It provides examples of how to choose availability over consistency or vice versa. Finally, it concludes that both SQL and NoSQL have valid use cases and a combination
This document provides an introduction to Apache Lucene and Solr. It begins with an overview of information retrieval and some basic concepts like term frequency-inverse document frequency. It then describes Lucene as a fast, scalable search library and discusses its inverted index and indexing pipeline. Solr is introduced as an enterprise search platform built on Lucene that provides features like faceting, scalability and real-time indexing. The document concludes with examples of how Lucene and Solr are used in applications and websites for search, analytics, auto-suggestion and more.
A hibernate tutorial for beginners. It describe the hibernate concepts in a lucid manner and and test project(User application with database) to get hands on over the same.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
Demystifying Neural Networks And Building Cybersecurity Applications
Case study of Rujhaan.com (A social news app )
1. Case Study of Rujhaan.com
November 2014 Meetup
Rahul Jain
@rahuldausa
2. About Me…
• Big-data/Search Consultant based out of Hyderabad, India
• Provide Consulting services and solutions for Solr, Elasticsearch and other Big
data solutions (Apache Hadoop and Spark)
• Organizer of two Meetup groups in Hyderabad
• Hyderabad Apache Solr/Lucene
• Big Data Hyderabad
3. What it does?
Rujhaan which means "#interest" is a news app that
aggregates the Trending #News, #trends with #buzz
around them from social media.
It also works as a content discovery where user can see
information based on his interest (under development).
4. What I am going to talk
• Introduction
• Software Stack
• Crawler
• Apache Solr
• MongoDB
• Redis
• Machine Learning stack
• Classification
• Clustering
• NER
• POS Tagging
11. High level Flow: Processing
Fetch
Managed Cache
Internet
2
1
3 4
Topics
Extraction 1
8
5
Language
Detectio
6
Classification/
Clustering
7
Parse
MongoDB
HTML
Cleaner
Junk/Sp
am
Cleaner
(Text)
n
Scoring
Summary (Most
Meaningful text
of Story)
Social
Media
Apache
Solr
9
0
1
1
15. Crawler
• A web crawler (also known as a web spider or ant) is a program, which browses the
World Wide Web in a methodical, automated manner.
• Web crawlers are mainly used to create a copy of all the visited pages for later
processing by a search engine, that will index the downloaded pages to provide fast
searches.
http://www.codeproject.com/Articles/13486/A-Simple-Crawler-Using-C-Sockets
16. How it work?
http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web
17. Search@ApacheSolr
• Enterprise Search platform for Apache Lucene
• Open source
• Highly reliable, scalable, fault tolerant
• Support distributed Indexing (SolrCloud),
Replication, and load balanced querying
• http://lucene.apache.org/solr
17
18. High level overview
Source: http://www.slideshare.net/erikhatcher/solr-search-at-the-speed-of-light
19. Apache Solr - Features
• full-text search
• faceted search (similar to GroupBy clause in RDBMS)
• scalability
– caching
– replication
– distributed search
• near real-time indexing
• geospatial search
• and many more : highlighting, database integration, rich document
(e.g., Word, PDF) handling
19
20. Database: #MongoDB
• Document Oriented NoSQL
database
• Dynamic Schema
• JSON based
• Fast read and write
• Quite suitable for Non
Relational data
Stats:
• 2 million tweets
• 70k news articles
• ~25GB rawhtml unstructured data
• ~16GB structured data
21. Why NoSQL
• Large Volume of Data
• Dynamic Schemas
• Auto-sharding
• Replication
• Horizontally Scalable
* Some of these above Operations can be achieved by Enterprise class RDBMS software but with very High cost
22. Major NoSQL Categories
• Document databases
• pair each key with a complex data structure
known as a document.
• MongoDB
• Graph databases
• store information about networks, such as social
connections
• Neo4j
Contd.
23. Major NoSQL Categories
• Key-Value stores
• Every single item in the database is stored as an
attribute name (or "key"),
• Riak , Voldemort, Redis
• Wide-column stores
• store data in columns together, instead of row
• Google’s Bigtable, Cassandra and HBase
24. Sample Record (JSON)
{
"_id" : ObjectId("53f087c69144ca452acadfb0"),
"id" : "7a622c50e95d4debb1376d4f6e2d0a47",
"title" : "Yelp Swings To Profitability In Strong Q2 With $88.8M In Revenue, EPS Of $0.04",
"summary_gs" : "Today after the bell Yelp reported its second-quarter financial performance, including
revenue of $88.79 million, and a profit of $0.04 per share. The company had net income of $2.7 million
in the period, up from a $878,000 loss in the year-ago quarter. Investors had expected Yelp to lose
3 cents per share on revenue of $86.32 million. The company’s revenue tally for its most recent
quarter is up 61 percent on a year-over-year basis. The company also reported strong guidance for its
third quarter, with revenues forecasted to land in the $98 to $99 million range. ",
"link" : "http://techcrunch.com/2014/07/30/yelp-swings-to-profitability-in-strong-q2-with-88-8m-in-revenue-
eps-of-0-04/",
"category_label" : "business",
“image_url”:” http://tctechcrunch2011.files.wordpress.com/2014/04/yelp-earnings.jpg”,
“score”: 38.0,
“boost”:1.0,
“keywords”:[“news”, “yelp”, “revenue”]
}
25. Cache: #Redis
• Advanced In-Memory key-value store
• Insane fast
• Response time in order of 5-10ms
• Provides Cache behavior (set, get) with
advance data structures like hashes, lists,
sets, sorted sets, bitmaps etc.
• http://redis.io/
28. Classification
• classify a document into a predefined category.
– For e.g news can be classified into business, politics,
finance etc.
• documents can be text, images
• Popular one is Naive Bayes Classifier.
• Steps:
– Step1 : Train the program (Building a Model) using a
training set with a category for e.g. sports, cricket, news,
– Classifier will compute probability for each word, the
probability that it makes a document belong to each of
considered categories
– Step2 : Test with a test data set against this Model
• http://en.wikipedia.org/wiki/Naive_Bayes_classifier
29. Clustering
• clustering is the task of grouping a set of objects in
such a way that objects in the same group (called
a cluster) are more similar to each other
• objects are not predefined
• For e.g. these keywords
– “man’s shoe”
– “women’s shoe”
– “women’s t-shirt”
– “man’s t-shirt”
– can be cluster into 2 categories “shoe” and “t-shirt” or
“man” and “women”
• Popular ones are K-means clustering and Hierarchical
clustering
30. K-means Clustering
• partition n observations into k clusters in which each observation belongs
to the cluster with the nearest mean, serving as a prototype of the cluster.
• http://en.wikipedia.org/wiki/K-means_clustering
http://pypr.sourceforge.net/kmeans.html
31. Summarization
• Finding the most relevant text related to story/article
• There can be multiple approaches related to accuracy.
• Below is our approach:
Cleaned
Text
1 Find low 3
2
value cluster
4
5
Cluster based
on stop words
Score each
cluster
Take Highest
score cluster
Sentence
Extractor
Some more
Scoring…
Summary
text
6
7
*Summary can be a content curated by computer system. i.e. translating the story into its own sentences (out of scope)
32. POS (Part of Speech) Tagging
• process of marking up a word in a text (corpus) as
corresponding to a particular part of speech, its
definition, as well as its context
• relationship with adjacent and related words in a
phrase, sentence, or paragraph.
• 9 parts of speech in English: noun, verb, article,
adjective, preposition, pronoun, adverb,
conjunction, and interjection.
• “This is a sample sentence” will be output as
• This/DT is/VBZ a/DT sample/NN sentence/NN
• We use Stanford MaxentTagger
• http://nlp.stanford.edu/software/tagger.shtml
Number Tag Description
1. CC Coordinating
conjunction
2. CD Cardinal number
3. DT Determiner
4. JJ Adjective
8. JJR Adjective,
comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VBD Verb, past tense
32. VBZ Verb, 3rd person
singular present
33. NER
• Identifying the Named Entities like Person name, location, organization from a text
• Need a pre built trained model.
35. We are Hiring!
rockstar@rujhaan.com
35
Want to make an impact on millions of
lives ?
Join Us
36. Thanks!
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa
36
Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR
http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies.
http://www.meetup.com/Big-Data-Hyderabad/