This document contains an agenda for a presentation on using Hadoop and HBase for social media monitoring. The presentation covers why Hadoop and HBase are suitable technologies, challenges and lessons learned, and resources for getting started. It includes sections on the speaker's background, the social media monitoring process, using coprocessors in HBase, and testing performed on a test cluster.
El Emerging Sources Citation Index (ESCI) es un índice lanzado en 2015 por Clarivate Analytics que actualmente incluye más de 5,500 revistas revisadas por pares de 102 países. Aunque las revistas indexadas en ESCI no reciben un factor de impacto, proporciona una marca de calidad y mejora la visibilidad, y existe la posibilidad de que sean incluidas eventualmente en los principales índices de Web of Science. ESCI busca hacer más transparente el proceso de selección de revistas y mostrar contenidos regionales de alto interés.
Intelligent web crawling
Denis Shestakov, Aalto University
Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013
Outline:
- overview of web crawling;
- intelligent web crawling;
- open challenges
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
This document proposes CSMR, a scalable algorithm for text clustering that uses cosine similarity and MapReduce. CSMR performs pairwise text similarity by representing text documents as vectors in a vector space model and measuring similarity in parallel using MapReduce. It is a 4-phase algorithm that includes word counting, text vectorization using term frequencies, applying TF-IDF to document vectors, and measuring cosine similarity. The algorithm is designed to cluster large text corpora in a scalable manner on distributed systems like Hadoop. Future work includes implementing and testing CSMR on real data and publishing results.
Towards a Semantic Citation Index for the German Social SciencesGESIS
The document discusses building a semantic citation index for German social sciences research. It notes the inadequacies of existing citation indexes in representing this research. It outlines initial steps taken, including matching references in CSA databases to build connections between records. Future plans include expanding text collections, developing natural language processing methods to extract additional semantic information from citations, and creating an open and distributed citation index. The long term goal is a globally networked citation index to support bibliometric analysis of social sciences literature.
This document discusses building citation indexes from different databases including Scopus, Web of Science (WoS), and CSA (Cambridge Scientific Abstracts). It describes algorithms used to match cited references to full bibliographic records within and across these databases. For Scopus and WoS, high matching rates of 90% were achieved, while for CSA (social sciences) the matching rate was lower at 30% due to references being to non-journal materials like books. The document outlines challenges in citation indexing for social sciences and a vision for a distributed, semantic citation index based on full-text collections and reference linking.
This document discusses centralized logging and monitoring for Docker Swarm and Kubernetes orchestration platforms. It covers collecting container logs and metrics through agents, automatically tagging data with metadata, and visualizing logs and metrics alongside events through centralized log management and monitoring systems. An example monitoring setup is described for a Swarm cluster of 3000+ nodes running 60,000 containers.
Este documento explica cómo usar Google Analytics, una herramienta gratuita que permite a los dueños de sitios web obtener estadísticas detalladas sobre el tráfico y el comportamiento de los visitantes, como el país de origen, idioma y dispositivo utilizado. Para comenzar, el usuario debe crear un perfil de Analytics, agregar el código de seguimiento proporcionado a su sitio web, y luego podrá ver gráficos e informes detallados sobre las métricas del sitio en el panel de control de Google Analytics.
Brieven van en aan Willem Heeringa, fourier in militaire dienst, uit Dokkum, gelegerd in Maastricht, rond 1830 en geeerd met het IJzeren Kruis van de Tiendaagse Veldtocht. De brieven zijn in bezit van Museum Dokkum.
Strategies for Reducing Red Meat and Dairy Consumption in the UK WWF Imperialakleanthous
This document summarizes strategies for reducing red meat and dairy consumption in the UK to cut greenhouse gas emissions from the food sector. It finds that supply chain improvements alone could meet emissions reduction targets by 2020 and 2050, but consumer changes are also needed, particularly to meet the 2050 target. Options for reducing consumption are evaluated, finding that increasing meat/dairy substitutes and controlling portion sizes provide the biggest potential savings. Five priority commitments are recommended for retailers, including driving sustainable farming and raising alternatives. Government is urged to strengthen nutrition guidelines, provide clarity on lifecycle analysis, and catalyze farming reform.
Break Down the Content Barriers of Social NetworksTania Kasongo
The document provides strategies for using social networks like Facebook to increase website traffic and revenue. It recommends (1) optimizing social media profiles, (2) socializing all website content to drive more engagement and traffic, and (3) integrating social features into commerce functions to directly increase sales and customer loyalty. By fully integrating social media into all aspects of their operations, companies can realize significant gains in key performance metrics like traffic, engagement, subscriptions, and revenue.
This document summarizes a Bible lesson about Jesus speaking with a Samaritan woman and then with his disciples. It discusses Jesus correcting errors about the Sabbath and extending expectations of the law. It describes how the Samaritan woman told others about Jesus after meeting him, and how the Samaritans then came to believe in Jesus based on his words. The disciples were surprised Jesus spoke with the woman but did not question his authority. Jesus explained to the disciples that there is spiritual sustenance to be found in doing God's will and working together.
Near Real Time Processing of Social Media Data with HBaseChristian Gügi
Monitoring social media and news media in general is kind of building a search engine - a specific search engine called news agent. Where classical search engines are about to have all kind of content, newsagent focus on news and social media only. This is where content freshness matters - it has to be near real time.
In order to process several million news with hundreds of Megabytes per day one has to choose a system architecture that is reliable on one hand and massive scalable at the other hand. Those requirements leads to a distributed architecture, a NoSQL approach.
This talk will give you a brief overview about the Media Monitoring use case, the system architecture based on Hadoop and HBase, challenges and lessons learned.
1. The document discusses Project Geode, an open source distributed in-memory database for big data applications. It provides scale-out performance, consistent operations across nodes, high availability, powerful developer features, and easy administration of distributed nodes.
2. The document outlines Geode's architecture and roadmap. It also discusses why the project is being open sourced under Apache and describes some key use cases and customers of Geode.
3. The presentation includes a demo of Geode's capabilities including partitioning, queries, indexing, colocation, and transactions.
Top 10 Web and HTML5 Predictions for 2013Jonathan Jeon
The document discusses emerging trends in web technologies from 2012-2013, including:
1. Mobile will become the most common way to access the internet as HTML5 enabled devices proliferate.
2. HTML5 is establishing itself as the leading standard for web development, while plans are made to finalize the HTML5 recommendation in 2014 and develop HTML5.1.
3. A flood of new web APIs are being developed to provide access to device capabilities and overcome limitations of web applications.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.OW2
ETL is the process of extracting data from one location, transforming it, and loading it into a different location, often for the purposes of collection and analysis. As Hadoop becomes a common technology for sophisticated analysis and transformation of petabytes of structured and unstructured data, the task of moving data in and out efficiently becomes more important and writing transformation jobs becomes more complicated. Talend provides a way to build and automate complex ETL jobs for migration, synchronization, or warehousing tasks. Using Talend's Hadoop capabilities allows users to easily move data between Hadoop and a number of external data locations using over 450 connectors. Also, Talend can simplify the creation of MapReduce transformations by offering a graphical interface to Hive, Pig, and HDFS. In this talk, Cédric Carbone will discuss how to use Talend to move large amounts of data in and out of Hadoop and easily perform transformation tasks in a scalable way.
Josh Patterson gave a presentation on Hadoop and how it has been used. He discussed his background working on Hadoop projects including for the Tennessee Valley Authority. He outlined what Hadoop is, how it works, and examples of use cases. This includes how Hadoop was used to store and analyze large amounts of smart grid sensor data for the openPDC project. He discussed integrating Hadoop with existing enterprise systems and tools for working with Hadoop like Pig and Hive.
The business analytics marketplace is experiencing a challenge as classic BI tools meet up with evolving big data technologies, in particular Hadoop. We explore how IBM works to meet this challenge, providing a big picture perspective of their big data offerings around Hadoop, its open data platform and BigInsights.
This document discusses JPMorgan Chase's consideration of using Hadoop in the enterprise. It outlines the potential for Hadoop to reduce costs through lower hardware expenses and more efficient use of resources. Hadoop could also enable new types of data analysis and disrupt existing technologies. The document then describes JPMorgan Chase's active proof-of-concept projects evaluating Hadoop and how it positions Hadoop relative to traditional data warehousing. It concludes by identifying additional features needed to better support enterprise use of Hadoop.
Standard Issue: Preparing for the Future of Data ManagementInside Analysis
The Briefing Room with Robin Bloor and Jaspersoft
Slides from the Live Webcast on Sept. 18, 2012
As change continues to sweep across the data management industry, many organizations are looking for ways to prepare their systems and personnel for an unpredictable future. Forces such as Big Data and Cloud Computing are creating new opportunities and significant challenges for a world filled with legacy systems. Information architectures are fundamentally changing, and that's good news for companies that can take advantage of recent innovations.
Check out this episode of The Briefing Room to learn from veteran Analyst Robin Bloor, who will explain why the Information Oriented Architecture provides a stable roadmap for companies looking to harness a new era of corporate computing. He'll be briefed by Mike Boyarski of Jaspersoft, who will tout his company's history of integrating with highly diverse information systems. He'll also discuss Jaspersoft's standards-based, Cloud-ready architecture, and how it enables organizations to embed powerful Business Intelligence capabilities into their existing systems.
http://www.insideanalysis.com
1. The document discusses Pentaho's approach to big data analytics using a component-based data integration and visualization platform.
2. The platform allows business analysts and data scientists to prepare and analyze big data without advanced technical skills.
3. It provides a visual interface for building reusable data pipelines that can be run locally or deployed to Hadoop for analytics on large datasets.
Pankaj Resume for Hadoop,Java,J2EE - Outside WorldPankaj Kumar
Pankaj Kumar is seeking a challenging position utilizing his 7.9 years of experience in big data technologies like Hadoop, Java, and machine learning. He has deep expertise in technologies such as MapReduce, HDFS, Pig, Hive, HBase, MongoDB, and Spark. His experience includes successfully developing and delivering big data analytics solutions for healthcare, telecom, and other industries.
The Forrester Wave Enterprise Hadoop Solutions Q1 2012m_hepburn
The document evaluates 13 enterprise Hadoop solution providers based on 15 criteria related to their current offerings, strategies, and market presence. It finds that Amazon Web Services, IBM, EMC Greenplum, MapR, and Cloudera are Leaders due to their strong Hadoop solutions and presence. Pentaho is a Strong Performer for its Hadoop data integration tool. DataStax, Datameer, Platform Computing, Zettaset, and Outerthought are Contenders that provide niche Hadoop solutions. HStreaming is considered a Risky Bet due to limitations in its solution.
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
The document discusses big data concepts including data acquisition, analysis, sharing and visualization. It shows how data flows from acquisition and storage to analysis and consumption. It outlines how technologies like SQL Server, Hadoop and Azure can be used together for discovery, routing, storage and analysis. Specific solutions discussed include using Hadoop for centralized data aggregation, updating an SSAS cube continuously from Hadoop, and using SASS aggregations to reduce Hadoop processing. The benefits of this approach are also summarized.
This document is a resume for Karthick S. that summarizes his professional experience and qualifications. He has over 3 years of experience developing applications on Hadoop, Java, Linux and Windows. His skills include Apache Hadoop, MapReduce, Pig, Hive, Sqoop, Oozie, Neo4j and Mahout. He has worked on projects involving profile recommendation, search results filtering, and a customer insights platform for a large retailer. Karthick holds an engineering degree and IBM certifications in Big Data fundamentals and tools.
This document provides a summary of Anoop Saxena's professional experience and qualifications. It outlines his 10 years of experience in Java/J2EE and Hadoop technologies, including hands-on work with machine learning algorithms, HIVE, PIG, and Sqoop. It also lists his technical skills and project experience working on data analytics solutions and for clients such as Credit Suisse, HSBC, and Shinsei Bank.
This document provides an overview of the new features and capabilities of IBM's Big SQL 3.0, an SQL-on-Hadoop solution. Big SQL 3.0 replaces the previous MapReduce-based architecture with a massively parallel processing SQL engine that pushes processing down to HDFS data nodes for low-latency queries. It features a shared-nothing parallel database architecture, rich SQL support including stored procedures and functions, automatic memory management, workload management tools, and fault tolerance. The document discusses the new architecture, performance improvements, and how Big SQL 3.0 represents an important advancement for SQL-on-Hadoop solutions.
Madhava Reddy has over 11 years of experience as an IT consultant specializing in application development. He has extensive experience designing and developing enterprise web applications using Java/J2EE technologies. He also has expertise in databases like Oracle, SQL Server, and MySQL. Reddy currently works as a Senior Software Engineer for Legatus Solutions Corporation where he is involved in designing and developing a web application called Motor Carrier.
Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.
The final slides look at the challenge of an organisation becoming "data driven"
Hadoop as Data Refinery - Steve LoughranJAX London
1. Steve Loughran presented on using Hadoop as a data refinery to store, clean, and refine large amounts of raw data for business intelligence and analytics.
2. A data refinery uses Hadoop to ingest raw data from various sources, clean it, filter it, and forward it to destinations like data warehouses or new agile data systems. It retains raw data for future analysis and offloads work from core data warehouses.
3. Hadoop allows organizations to become more data-driven by supporting ad-hoc queries, storing more historical data affordably, and serving as a platform for data science applications and machine learning. This helps drive innovative business models and competitive advantages.
Similar to Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords - June 2012 (20)
Real-Time Fraud Detection in Payment TransactionsChristian Gügi
This document discusses building a real-time fraud detection system using big data technologies. It outlines the cyber threat landscape, what anomalies and fraud detection are, and proposes an architecture with a data layer to integrate various sources and an analytics layer using stream processing, rules engines, and machine learning to score transactions in real-time and detect fraud. The system aims to scalably and reliably detect threats for increased security.
Christian Gügi presented on building scalable big data pipelines. He discussed opportunities and challenges with big data, integrating Hadoop into existing systems, and the Lambda architecture for batch and real-time processing. He provided an example pipeline for online marketing analytics and recommendations for adopting the Lambda architecture approach.
This document contains information about a Hadoop analytics system setup including components like OpenWRT, Flume, HDFS, Hive, Impala, Pentaho and Excel. It describes how data from two WiFi access points is ingested using Flume and stored in HDFS. Pentaho is used to transform and load the data into a Hive data warehouse where the data can be queried using Impala. Various analyses are then performed on the data using Hive and Impala and visualized in Excel. The document also contains announcements about an upcoming Cloudera Administrator training course.
Apache HBase: Introduction to a column-oriented data storeChristian Gügi
HBase is an open source, distributed, column-oriented data store modeled after Google's BigTable. It stores data in sorted, sparse tables and supports real-time read/write access to big data. HBase runs on top of Hadoop and provides automatic sharding and scaling of data across nodes. It favors consistency over availability and integrates well with Hadoop for big data analytics.
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
This document provides an overview of important concepts for operating HBase, including:
- HBase stores data in columns families stored as files on disk and writes to memory before flushing to disk.
- Manual and automatic splitting of regions is covered, as well as challenges of improper splitting.
- Tools for monitoring, debugging, and visualizing HBase operations are discussed.
- Key lessons focus on proper data modeling, extensive monitoring, and understanding the whole Hadoop ecosystem.
Online Media Data Stream Processing with KafkaChristian Gügi
This document discusses streaming data and the Kafka messaging system. It provides an overview of what streaming data is, why Kafka is useful for streaming data use cases, Kafka's architecture including its publish-subscribe and persistent messaging design, and an example use case of prospective search that utilizes Kafka to process and analyze streaming data in real-time.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords - June 2012
1. CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3
2. 5. Juni
2012
• Why Hadoop and HBase? 2
• Social Media Monitoring
• Prospective Search and Coprocessors
• Challenges & Lessons Learned
• Resources to get started
Agenda
3. 5. Juni
2012
Software Architect 3
@ sentric
Co-founder and
organizer of the
Swiss HUG
Contact:
christian.guegi@sentric.ch
http://www.sentric.ch
@chrisgugi
About me
4. 5. Juni
2012
• Spin-off of MeMo News AG, the 4
leading provider for Social Media
Monitoring & Analytics in Switzerland
• Big Data expert, focused on Hadoop,
HBase and Solr
• Objective: Transforming data into
insights
About sentric
6. 5. Juni
2012
6
Information Information Analysis & Insight
Gathering Processing Interpretation Presentation
Why Hadoop and HBase?
Social Media Monitoring Process
7. 5. Juni
2012
7
Cost
effective
High
SMM
Reliable
scalable
Analytical
RT Alerting
capabilities
Why Hadoop and HBase?
Requirements
9. CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf
10. 5. Juni
2012
10
Downloaded Articles
match?
Search Agents
Output
Web-UI Reports RT Alerts
Icons by http://dryicons.com
Social Media Monitoring
Overview
11. 5. Juni
2012
11
n Crawler
REST
HBase
RowLog Coprocessor
Web-UI
MySQL Solr RT Alerts
Icons by http://dryicons.com
Social Media Monitoring
Solution Architecture
12. 5. Juni
2012
• Inspired by Google Bigtable 12
coprocessors
• HBase version 0.92
• Embed code directly into server
processes
• High-level call interface for clients
• Automatic scaling, load balancing,
request routing
Short Primer on Coprocessors
Overview
13. 5. Juni
2012
• Like a database trigger 13
• Provides event based hooks
• Concrete Implementations
• RegionObserver
• CRUD or DML type operations
• MasterObserver
• DDL or metadata operations and cluster
administration
• WALObserver
• Write-ahead-log appending and restoration
Short Primer on Coprocessors
Observer Classes
14. 5. Juni
2012
14
Client:Get()
CP1:preGet() CP2:preGet() CP3:preGet()
Hregion:Get()
CP1:postGet() CP2:postGet() CP3:postGet()
RegionServer
client response
Short Primer on Coprocessors
Observer Execution
15. 5. Juni
2012
• Comparable to stored procedures 15
• Custom RPC protocol, used between
client and region server
• Loaded in region server
• Client call APIs over single row or a
row range
• Framework translates row keys to region
location
• Parallel execution
Short Primer on Coprocessors
Endpoint Classes
16. 5. Juni
2012
16
Client code
Batch.Call<CountProtocol,int> Region Server 1
int call(CountProtocol p) {
table,,12345678 CountProtocol
return p.getRowCount();
} .
table,bbb,12345678 CountProtocol
HTable
coprocessorExec()
Region Server 2
table,ccc,12345678 CountProtocol
table,ddd,12345678 CountProtocol
Map<byte[], Integer> countsByRegion
Short Primer on Coprocessors
Endpoint Call Routine
17. 5. Juni
2012
• HBase Security (Version 0.94) 17
• Aggregate operations avg(), sum()
• AggregatorProtocol
• HBASE-3529: Embedded search
Short Primer on Coprocessors
Use Cases
18. 5. Juni
2012
18
Processing
Put operations
Prospective
Search
HRegion RT Alerts
HRegionServer
Icons by http://dryicons.com
Social Media Monitoring
Prospective Search with Coprocessors
19. 5. Juni
2012
• Standard, virtualized test cluster: 19
4RS/DN, 1HM, 1NN, 3ZK
• Test dataset created from 2h of live
index (1GB)
• Drive load on RS/DN
Social Media Monitoring
Testing Setup
20. 5. Juni
2012
1800 20
1600
1400
1200
Writes/sec
1000
800
600
400
200
0
0 10 50 100 200 400 800
# of agents
Social Media Monitoring
Test Results
22. 5. Juni
2012
• Everyone is still learning 22
• Some issues only appear at scale
• Production cluster configuration
• Hardware issues
• Tuning cluster configuration to our work
loads
• HBase stability
• Monitoring health of HBase
Challenges & Lessons Learned
Challenges
23. 5. Juni
2012
• Be careful with expensive operations 23
in coprocessors
• At scale, nothing works as advertised
• Monitoring/Operational tooling is
most important
• Play with all the configurations and
benchmark for tuning
Challenges & Lessons Learned
Lessons
24. 5. Juni
2012
• https://blogs.apache.org/hbase/ 24
entry/coprocessor_introduction
• http://hbase.apache.org/apidocs/
index.html
• http://www.lilyproject.org/lily/about/
playground/hbaserowlog.html
• http://www.github.com/sentric/
HBasePS
Resources to get started
25. 5. Juni
2012
25
Questions?
Christian Gügi
christian.guegi@sentric.ch
Berlin Buzzwords 2012
Thank you!