This document summarizes Lars George's presentation on moving from batch to real-time processing with Hadoop. It discusses using Hadoop (HDFS and MapReduce) for batch processing of large amounts of data and integrating real-time databases and stream processing tools like HBase and Storm to enable faster querying and analytics. Example architectures shown combine batch and real-time systems by using real-time tools to process streaming data and periodically syncing results to Hadoop and HBase for long-term storage and analysis.
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
Customer Data Platforms, commonly called CDPs, form an integral part of the marketing stack powering Zeotap's Adtech and Martech use-cases. The company offers a privacy-compliant CDP platform, and ScyllaDB is an integral part. Zeotap's CDP demands a mix of OLTP, OLAP, and real-time data ingestion, requiring a highly-performant store.
In this presentation, Shubham Patil, Lead Software Engineer, and Safal Pandita, Senior Software Engineer at Zeotap will share how ScyllaDB is powering their solution and why it's a great fit. They begin by describing their business use case and the challenges they were facing before moving to ScyllaDB. Then they cover their technical use-cases and requirements for real-time and batch data ingestions. They delve into our data access patterns and describe their data model supporting all use cases simultaneously for ingress/egress. They explain how they are using Scylla Migrator for our migration needs, then describe their multiregional, multi-tenant production setup for onboarding more than 130+ partners. Finally, they finish by sharing some of their learnings, performance benchmarks, and future plans.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeMichael Stack
CCSMap is a new data structure introduced by Alibaba to improve the performance of HBase. It aims to reduce the overhead of the default Java ConcurrentSkipListMap (CSLM) data structure and improve young garbage collection times. CCSMap chunks data into fixed size blocks for better memory management and uses direct pointers between nodes for faster access. It also provides various configuration options. Alibaba has achieved significant performance gains using CCSMap in HBase, including reduced young GC times, and it continues working to integrate CCSMap further and add new features.
This is the slide deck which was used for a talk 'Change Data Capture using Kafka' at Kafka Meetup at Linkedin (Bangalore) held on 11th June 2016.
The talk describes the need for CDC and why it's a good use case for Kafka.
1) Flickr is a photo sharing website built using PHP that allows users to upload and tag photos and share them publicly or privately.
2) The website uses a logical architecture with application logic, page logic, templates, and endpoints separating the different components.
3) While primarily built with PHP, the website also incorporates technologies like Smarty for templating, Perl for image processing, MySQL for the database, and Java for the node service.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
This document discusses the evolution of open source software (FOSS) for data, specifically focusing on the Hadoop ecosystem. It notes that FOSS has evolved through an iterative and collaborative process, driven by user demands for simplicity, data preservation, consistency, and high availability. It describes how the Hadoop ecosystem has rapidly advanced through many releases and over 18 components to meet these demands. However, challenges remain around operational complexity and achieving strong consistency and high availability. The document proposes consensus-based replication as an alternative to provide preventative fault tolerance rather than reactive solutions. It introduces WANdisco as a company that provides patented active-active replication technology to solve availability problems for enterprises.
Introduction to Prometheus Monitoring (Singapore Meetup) Arseny Chernov
Presented at inaugural Singapore Prometheus Meetup, videos on https://www.meetup.com/Singapore-Prometheus-Meetup/events/240844291/
Links to original slides from various blogposts provided.
HBaseConEast2016: Splice machine open source rdbmsMichael Stack
This document discusses Splice Machine, an open source RDBMS that runs queries using Apache Spark for analytics and Apache HBase for transactions. It provides concise summaries of how Splice Machine executes queries by parsing SQL, optimizing query plans, and generating byte code to run queries on either HBase or Spark. It also benchmarks Splice Machine's performance on loading and running TPCH queries compared to other systems like Phoenix and shows how it enables advanced Spark integration by creating RDDs directly from HFiles.
Urban Airship is a mobile platform that provides services to over 160 million active application installs across 80 million devices. They initially used PostgreSQL but needed a system that could scale writes more easily. They tried several NoSQL databases including MongoDB, but ran into issues with MongoDB's locking, long queries blocking writes, and updates causing heavy disk I/O. They are now converging on Cassandra and PostgreSQL for transactions and HBase for analytics workloads.
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)Amazon Web Services
This advanced technical session is ideal for customers that are looking to maximise the performance of AWS Elastic Block Store (EBS) storage to support workloads with demanding IO performance requirements. If you need to run high IO workloads on EBS such as NoSQL or RBDMS systems then attend this session to find out how to optimise your EBS configuration to enable this.
You want to use MySQL in Amazon RDS, Rackspace Cloud, Google Cloud SQL or HP Helion Public Cloud? Check this out, from Percona Live London 2014. (Note that pricing of Google Cloud SQL changed prices on the same day after the presentation)
This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.
This document discusses NoSQL databases and when they should be used. It describes what NoSQL databases are, when to consider using one over a relational database, and introduces DynamoDB as an AWS NoSQL solution. Specific topics covered include the differences between relational and NoSQL data models, common use cases for NoSQL databases, and how to access and query DynamoDB tables.
Scalable and Available, Patterns for SuccessDerek Collison
The document discusses patterns for building scalable and available systems. It covers key topics like:
- Understanding performance vs scalability and the tradeoffs between them
- Using patterns like event-driven architectures, load balancing, and parallel computing to improve scalability
- Managing state through techniques like replication, sharding, caching and NoSQL databases
- Ensuring high availability even during failures through failover and an eventual consistency model
The document emphasizes measuring systems, understanding user needs, and mastering the tradeoffs between performance, scalability, availability and consistency.
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
This document discusses Apache Kudu, an open source columnar storage system for analytics workloads on Hadoop. Kudu is designed to enable both fast analytics queries as well as real-time updates on fast changing data. It aims to fill gaps in the current Hadoop storage landscape by supporting simultaneous high throughput scans, low latency reads/writes, and ACID transactions. An example use case described is for real-time fraud detection on streaming financial data.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
For our eReader development project, we had to find a persistent storage for our JSON documents. After initial scanning we zeroed into two products DynamoDB and MongoDB. These slides take a deeper dive in the selection of our JSON data store.
MongoDB capacity planning involves determining hardware requirements and sizing to meet performance and availability expectations. Key aspects include measuring the working set, monitoring resource usage, and iteratively planning as requirements and data change over time. Resources like CPU, storage, memory and network need to be considered based on the application's throughput, responsiveness and availability needs.
Kafka is a high-throughput distributed messaging system with publish and subscribe capabilities. It provides persistence with replication to disk for fault tolerance. Kafka is simple to implement and runs efficiently on large clusters with low latency and high throughput. It was created at LinkedIn to process streaming data from the LinkedIn website and has since been open sourced.
Tech Talks @Rosenstraße discussed SPDY, a new protocol developed by Google to make web content delivery faster and more efficient than HTTP. SPDY uses many of HTTP's foundations but adds features like compression, multiplexing, and prioritization to reduce page load times. It is built on top of SSL for security and takes advantage of SSL's widespread support in firewalls. While still a draft, SPDY has shown promise in reducing page load times in tests. The talk covered the technical details of how SPDY works and potential issues like a lack of debugging tools.
The document discusses composing reusable extract-transform-load (ETL) processes on Hadoop. It covers the data science lifecycle of acquiring, analyzing and taking action on data. It states that 80% of work in data science is spent on acquiring and preparing data. The document then discusses using Cascading, an abstraction framework for building MapReduce jobs, to create reusable ETL processes that are linearly scalable and follow a single-purpose composable design.
Social Networks and the Richness of Datalarsgeorge
Social networks by their nature deal with large amounts of user-generated data that must be processed and presented in a time sensitive manner. Much more write intensive than previous generations of websites, social networks have been on the leading edge of non-relational persistence technology adoption. This talk presents how Germany's leading social networks Schuelervz, Studivz and Meinvz are incorporating Redis and Project Voldemort into their platform to run features like activity streams.
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation
HBase Applications - Atlanta HUG - May 2014larsgeorge
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
The document discusses several key factors for optimizing HBase performance including:
1. Reads and writes compete for disk, network, and thread resources so they can cause bottlenecks.
2. Memory allocation needs to balance space for memstores, block caching, and Java heap usage.
3. The write-ahead log can be a major bottleneck and increasing its size or number of logs can improve write performance.
4. Flushes and compactions need to be tuned to avoid premature flushes causing "compaction storms".
These are my slides for the 5 minute overview talk I gave during a recent workshop at the European Commission in Brussels, on the topic of "Big Data Skills in Europe".
Have a lot of data? Using or considering using Apache HBase (part of the Hadoop family) to store your data? Want to have your cake and eat it too? Phoenix is an open source project put out by Salesforce. Join us to learn how you can continue to use SQL, but get the raw speed of native HBase usage through Phoenix.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
This document provides a summary of new features and improvements in recent versions of Apache HBase, a distributed, scalable, big data store. It discusses major changes and enhancements in HBase 0.92+, 0.94+, and 0.96+, including new HFile formats, coprocessors, caching improvements, performance tuning, and more. The document is intended to bring readers up to date on the current state and capabilities of HBase.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.
Building Mission Critical Messaging System On Top Of HBase
Facebook chose HBase as the storage system for its messaging platform due to HBase's high write throughput, good random read performance, horizontal scalability, and automatic failover. Facebook stores messages, metadata, and search indices in HBase. To improve performance and reliability, Facebook developed the system on a production-stabilized branch of HBase, used shadow testing, added extensive monitoring, and contributed improvements back to the HBase community.
This document provides an introduction to Hadoop and big data. It discusses the new kinds of large, diverse data being generated and the need for platforms like Hadoop to process and analyze this data. It describes the core components of Hadoop, including HDFS for distributed storage and MapReduce for distributed processing. It also discusses some of the common applications of Hadoop and other projects in the Hadoop ecosystem like Hive, Pig, and HBase that build on the core Hadoop framework.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
This document summarizes Facebook's real-time analytics systems. It describes Data Freeway, which uses a scalable data streaming framework to collect log data with low latency. It also describes Puma, which performs reliable stream aggregation and storage by sharding computations in memory and checkpointing to HBase. Future work may include open sourcing components and adding scheduler support.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
This document summarizes Facebook's Data Freeway and Puma systems for real-time analytics. Data Freeway is a scalable data streaming framework that includes components like Scribe, Calligraphus, and Continuous Copier to reliably stream data at high throughput with low latency. Puma is Facebook's real-time aggregation engine that uses Data Freeway to perform aggregations on streaming data as it arrives and store results in HBase, enabling real-time queries with latency of only seconds. It was improved from initial versions Puma2 and Puma3 by supporting more complex aggregations and queries.
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudMichael Stack
New Journey of HBase in Alibaba and Cloud discusses Alibaba's use of HBase over 8 years and improvements made. Key points discussed include:
- Alibaba began using HBase in 2010 and has since contributed to the open source community while developing internal improvements.
- Challenges addressed include JVM garbage collection pauses, separating computing and storage, and adding cold/hot data tiering. A diagnostic system was also created.
- Alibaba uses HBase across many core scenarios and has integrated it with other databases in a multi-model approach to support different workloads.
- Benefits of running HBase on cloud include flexibility, cost savings, and making it
Technologies for Data Analytics PlatformN Masahiro
This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
Machine Learning for Smarter Apps with Tom Kraljevic
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Amazon Web Services
Amazon DynamoDB is a fully managed NoSQL database service provided by AWS that provides fast and predictable performance with seamless scalability. It offers a flexible data model and reliable access patterns. With DynamoDB, users do not need to provision, operate, or scale their own database clusters and can instead pay only for the storage and throughput capacity they need.
Speaker: Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.
1. The document discusses using AWS Lambda and Amazon Kinesis for real-time data processing in a serverless architecture. It describes how Lambda functions can be triggered by data ingestion in Kinesis streams to process streaming data without needing to manage servers.
2. Key benefits highlighted include automatic scaling of compute capacity, paying only for resources used, and focusing on business logic rather than infrastructure management. Best practices discussed include monitoring for errors/throttling and distributing load evenly across shards.
3. The demo portion shows how to set up a Kinesis stream, Lambda function, and configure the integration between the two for processing streaming data in real-time at scale in a serverless manner.
The document discusses new features in Apache Hadoop Common and HDFS for version 3.0. Key updates include upgrading the minimum Java version to Java 8, improving dependency management, adding a new Azure Data Lake Storage connector, and introducing erasure coding in HDFS to improve storage efficiency. Erasure coding in HDFS phase 1 allows for striping of small blocks and parallel writes/reads while trading off higher network usage compared to replication.
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
In this talk, we will discuss how we use Spark as part of a hybrid RDBMS architecture that includes Hadoop and HBase. The optimizer evaluates each query and sends OLTP traffic (including CRUD queries) to HBase and OLAP traffic to Spark. We will focus on the challenges of handling the tradeoffs inherent in an integrated architecture that simultaneously handles real-time and batch traffic. Lessons learned include: - Embedding Spark into a RDBMS - Running Spark on Yarn and isolating OLTP traffic from OLAP traffic - Accelerating the generation of Spark RDDs from HBase - Customizing the Spark UI The lessons learned can also be applied to other hybrid systems, such as Lambda architectures.
Bio:-
John Leach is the CTO and Co-Founder of Splice Machine. With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies. Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning. John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach is the organizer emeritus for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
1. From Batch to Realtime
with Hadoop
Berlin Buzzwords, June 2012
Lars George
lars@cloudera.com
2. About Me
• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Working with Hadoop & HBase since
2007
• Author of O’Reilly’s “HBase - The
Definitive Guide”
3. The Application Stack
• Solve Business Goals
• Rely on Proven Building Blocks
• Rapid Prototyping
‣ Templates, MVC, Reference
Implementations
• Evolutionary Innovation Cycles
“Let there be light!”
7. The Dawn of Big Data
• Industry verticals produce a staggering amount of data
• Not only web properties, but also “brick and mortar”
businesses
‣ Smart Grid, Bio Informatics, Financial, Telco
• Scalable computation frameworks allow analysis of all the data
‣ No sampling anymore
• Suitable algorithms derive even more data
‣ Machine learning
• “The Unreasonable Effectiveness of Data”
‣ More data is better than smart algorithms
8. Hadoop
• HDFS + MapReduce
• Based on Google Papers
• Distributed Storage and Computation
Framework
• Affordable Hardware, Free Software
• Significant Adoption
9. HDFS
• Reliably store petabytes of replicated data
across thousands of nodes
• Master/Slave Architecture
• Built on “commodity” hardware
10. MapReduce
• Distributed programming model to reliably
process petabytes of data
• Locality of data to processing is vital
‣ Run code where data resides
• Inspired by map and reduce functions in
functional programming
Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output
11. From Short to Long Term
Internet
LAM(M)P
• Serves the Client
• Stores Intermediate Data
Hadoop
• Background Batch Processing
• Stores Long-Term Data
12. Batch Processing
• Scale is Unlimited
‣ Bound only by Hardware
• Harness the Power of the Cluster
‣ CPUs, Disks, Memory
• Disks extend Memory
‣ Spills represent Swapping
• Trade Size Limitations with Time
‣ Jobs run for a few minutes to hours, days
13. From Batch to Realtime
• “Time is Money”
• Bridging the gap between batch and “now”
• Realtime often means “faster than batch”
• 80/20 Rule
‣ Hadoop solves the 80% easily
‣ The remaining 20% is taking 80% of the
effort
• Go as close as possible, don’t overdo it!
14. Stop Gap Solutions
• In Memory
‣ Memcached
‣ MemBase
‣ GigaSpaces
• Relational Databases
‣ MySQL
‣ PostgreSQL
• NoSQL
‣ Cassandra
‣ HBase
15. Complemental Design #1
Internet
• Keep Backup in HDFS
• MapReduce over HDFS
• Synchronize HBase
LAM(M)P ‣Batch Puts
‣Bulk Import
Hadoop HBase
16. Complemental Design #2
Internet
• Add Log Support
• Synchronize HBase
LAM(M)P ‣Batch Puts
Flume
‣Bulk Import
Hadoop HBase
17. Mitigation Planning
• Reliable storage has top priority
• Disaster Recovery
• HBase Backups
‣ Export - but what if HBase is “down”
‣ CopyTable - same issue
‣ Snapshots - not available (yet)
19. Facebook Insights
• > 20B Events per Day
• 1M Counter Updates per Second
‣ 100 Nodes Cluster
‣ 10K OPS per Node
Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase
20. Collection Layer
• “Like” button triggers AJAX request
• Event written to log file using Scribe
‣ Handles aggregation, delivery, file roll
over, etc.
‣ Uses HDFS to store files
✓ Use Flume or Scribe
21. Filter Layer
• Ptail “follows” logs written by Scribe
• Aggregates from multiple logs
• Separates into event types
‣ Sharding for future growth
• Facebook internal tool
✓ Use Flume
22. Batching Layer
• Puma batches updates
‣ 1 sec, staggered
• Flush batch, when last is done
• Duration limited by key distribution
• Facebook internal tool
✓ Use Coprocessors (0.92.0)
23. Counters
• Store counters per Domain and per URL
‣ Leverage HBase increment (atomic read-modify-
write) feature
• Each row is one specific Domain or URL
• The columns are the counters for specific metrics
• Column families are used to group counters by time
range
‣ Set time-to-live on CF level to auto-expire counters
by age to save space, e.g., 2 weeks on “Daily
Counters” family
24. Key Design
• Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog”
‣ Helps keeping pages per site close, as HBase efficiently scans blocks
of sorted keys
• Domain Row Key =
MD5(Reversed Domain) + Reversed Domain
‣ Leading MD5 hash spreads keys randomly across all regions for
load balancing reasons
‣ Only hashing the domain groups per site (and per subdomain if
needed)
• URL Row Key =
MD5(Reversed Domain) + Reversed Domain + URL ID
‣ Unique ID per URL already available, make use of it
25. Insights Schema
Row Key: Domain Row Key
Columns:
Hourly Counters CF Daily Counters CF Lifetime Counters CF
6pm 6pm 6pm 7pm 1/1 1/1 2/1
... 1/1 Total ... Total Male Female US ...
Total Male US ... Male US ...
100 50 92 45 1000 320 670 990 10000 6780 3220 9900
Row Key: URL Row Key
Columns:
Hourly Counters CF Daily Counters CF Lifetime Counters CF
6pm 6pm 6pm 7pm 1/1 1/1 2/1
... 1/1 Total ... Total Male Female US ...
Total Male US ... Male US ...
10 5 9 4 100 20 70 99 100 8 92 100
27. Batch + Stream
• Currently moves complexity into app layer
‣ Reads need to merge batch and stream results
• Stream results can be dropped once data is
persisted in batch layer
• Stream might not be 100% correct, but good
enough in most cases
‣ Eventual Accuracy
• Latency vs. Throughput - best of both worlds