From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
Parquet is a columnar storage format for Hadoop data. It was developed collaboratively by Twitter and Cloudera to address the need for efficient analytics on large datasets. Parquet provides more efficient compression and I/O compared to row-based formats by only reading and decompressing the columns needed by a query. It has been adopted by many companies for analytics workloads involving terabytes to petabytes of data. Parquet is language-independent and supports integration with frameworks like Hive, Pig, and Impala. It provides significant performance improvements and storage savings compared to traditional row-based formats.
The document discusses Hive's new ACID (atomicity, consistency, isolation, durability) functionality which allows for updating and deleting rows in Hive tables. Key points include Hive now supporting SQL commands like INSERT, UPDATE and DELETE; storing changes in delta files and using transaction IDs; and running minor and major compactions to consolidate delta files. Future work may include multi-statement transactions, updating/deleting in streaming ingest, Parquet support, and adding MERGE statements.
Parquet is a column-oriented storage format for Hadoop that supports efficient compression and encoding techniques. It uses a row group structure to store data in columns in a compressed and encoded column chunk format. The schema and metadata are stored in the file footer to allow for efficient reads and scans of selected columns. The format is designed to be extensible through pluggable components for schema conversion, record materialization, and encodings.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Fluentd is an open source data collector that allows flexible data collection, processing, and output. It supports streaming data from sources like logs and metrics to destinations like databases, search engines, and object stores. Fluentd's plugin-based architecture allows it to support a wide variety of use cases. Recent versions of Fluentd have added features like improved plugin APIs, nanosecond time resolution, and Windows support to make it more suitable for containerized environments and low-latency applications.
The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
This document summarizes Cloudflare's use of ClickHouse to analyze over 6 million HTTP requests per second. Some key points:
- Cloudflare previously used PostgreSQL, Citus, and Flink but these did not scale sufficiently.
- ClickHouse was chosen as it is fast, scalable, fault tolerant, and Cloudflare had existing expertise in it.
- Cloudflare designed ClickHouse schemas to aggregate HTTP data into totals, breakdowns by category, and unique counts into two tables using different engines.
- Tuning ClickHouse index granularity improved query latency by 50% and throughput by 3x.
- The new ClickHouse pipeline is more scalable, fault tolerant
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics - Apache Spark’s in memory capabilities catapulted it as the premier processing framework for Hadoop. Apache Ignite and Alluxio, both high-performance, integrated and distributed in-memory platform, takes Apache Spark to the next level by providing an even more powerful, faster and scalable platform to the most demanding data processing and analytic environments.
Speaker
Irfan Elahi, Consultant, Deloitte
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
Dynamic filtering for presto join optimisationOri Reshef
@Roman Zeyde Explains how to optimize Presto Joins in selective use cases.
Roman is a Talpiot graduate and an ex-googler, today working as Varada presto architect.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
Hbase schema design and sizing apache-con europe - nov 2012Chris Huang
The document provides an overview of HBase schema design and cluster sizing notes. It discusses HBase architecture including tables, regions, distribution, and compactions. It emphasizes the importance of schema design, including using intelligent keys, denormalization, and duplication to overcome limitations. The document also covers techniques like salting keys, hashing vs sequential keys, and examples of schema design for applications like mail inbox and Facebook insights. It stresses designing for the use case and avoiding hotspotting when sizing clusters.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
HBase is an open-source, distributed, column-oriented database that runs on top of Hadoop. It provides real-time read and write access to large amounts of data across clusters of commodity hardware. HBase scales to billions of rows and millions of columns and is used by companies like Twitter, Adobe, and Yahoo to store large datasets. It uses a master-slave architecture with a single HBaseMaster and multiple RegionServers and stores data in Hadoop's HDFS for high availability.
Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
HBase is a distributed, column-oriented database built on top of HDFS that can handle large datasets across a cluster. It uses a map-reduce model where data is stored as multidimensional sorted maps across nodes. Data is first written to a write-ahead log and memory, then flushed to disk files and compacted for efficiency. Client applications access HBase programmatically through APIs rather than SQL. Map-reduce jobs on HBase use input, mapper, reducer, and output classes to process table data in parallel across regions.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
This is an introduction to relational and non-relational databases and how their performance affects scaling a web application.
This is a recording of a guest Lecture I gave at the University of Texas school of Information.
In this talk I address the technologies and tools Gowalla (gowalla.com) uses including memcache, redis and cassandra.
Find more on my blog:
http://schneems.com
The workshop tells about HBase data model, architecture and schema design principles.
Source code demo:
https://github.com/moisieienko-valerii/hbase-workshop
NoSQL databases provide an alternative to traditional relational databases that is well-suited for large datasets, high scalability needs, and flexible, changing schemas. NoSQL databases sacrifice strict consistency for greater scalability and availability. The document model is well-suited for semi-structured data and allows for embedding related data within documents. Key-value stores provide simple lookup of data by key but do not support complex queries. Graph databases effectively represent network-like connections between data elements.
NoSQL databases were developed to address the limitations of relational databases in handling massive, unstructured datasets. NoSQL databases sacrifice ACID properties like consistency in favor of scalability and availability. The CAP theorem states that only two of consistency, availability, and partition tolerance can be achieved at once. Common NoSQL database types include document stores, key-value stores, column-oriented stores, and graph databases. NoSQL is best suited for large datasets that don't require strict consistency or relational structures.
HBase is an open-source, distributed, versioned, non-relational database built on top of Hadoop. It is modeled after Google's BigTable and provides random real-time read/write access to large datasets stored on HDFS. HBase scales to billions of rows and millions of columns and is used by companies like Twitter, Adobe, and Yahoo to store large datasets. It uses a master-slave architecture with an HBase master managing region servers that store the data.
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
This document provides an overview of HBase, including:
- HBase is a distributed, scalable, big data store modeled after Google's BigTable. It provides a fault-tolerant way to store large amounts of sparse data.
- HBase is used by large companies to handle scaling and sparse data better than relational databases. It features automatic partitioning, linear scalability, commodity hardware, and fault tolerance.
- The document discusses HBase operations, schema design best practices, hardware recommendations, alerting, backups and more. It provides guidance on designing keys, column families and cluster configuration to optimize performance for read and write workloads.
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
This document provides a summary of new features and improvements in recent versions of Apache HBase, a distributed, scalable, big data store. It discusses major changes and enhancements in HBase 0.92+, 0.94+, and 0.96+, including new HFile formats, coprocessors, caching improvements, performance tuning, and more. The document is intended to bring readers up to date on the current state and capabilities of HBase.
These are my slides for the 5 minute overview talk I gave during a recent workshop at the European Commission in Brussels, on the topic of "Big Data Skills in Europe".
HBase Applications - Atlanta HUG - May 2014larsgeorge
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
Parquet is an open-source columnar storage format that provides an efficient data layout for analytical queries. Twitter uses Parquet to store logs and analytics data across multiple large Hadoop clusters, saving petabytes of storage and reducing query times by up to 66% by reading only needed columns. Parquet defines a language-independent file format that stores data by column rather than row to optimize analytical access patterns.
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation
The document discusses several key factors for optimizing HBase performance including:
1. Reads and writes compete for disk, network, and thread resources so they can cause bottlenecks.
2. Memory allocation needs to balance space for memstores, block caching, and Java heap usage.
3. The write-ahead log can be a major bottleneck and increasing its size or number of logs can improve write performance.
4. Flushes and compactions need to be tuned to avoid premature flushes causing "compaction storms".
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
This document summarizes Lars George's presentation on moving from batch to real-time processing with Hadoop. It discusses using Hadoop (HDFS and MapReduce) for batch processing of large amounts of data and integrating real-time databases and stream processing tools like HBase and Storm to enable faster querying and analytics. Example architectures shown combine batch and real-time systems by using real-time tools to process streaming data and periodically syncing results to Hadoop and HBase for long-term storage and analysis.
Realtime Analytics with Hadoop and HBaselarsgeorge
The document discusses realtime analytics using Hadoop and HBase. It begins by introducing the speaker and their experience. It then discusses moving from batch processing with Hadoop to more realtime needs, and how systems like HBase can help bridge that gap. Several designs are presented for using HBase and Hadoop together to enable both realtime and batch analytics on large datasets.
Social Networks and the Richness of Datalarsgeorge
Social networks by their nature deal with large amounts of user-generated data that must be processed and presented in a time sensitive manner. Much more write intensive than previous generations of websites, social networks have been on the leading edge of non-relational persistence technology adoption. This talk presents how Germany's leading social networks Schuelervz, Studivz and Meinvz are incorporating Redis and Project Voldemort into their platform to run features like activity streams.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
5. HBase Tables
• From user perspective, HBase is similar to a database, or spreadsheet
• There are rows and columns, storing values
• By default asking for a specific row/column combination returns the
current value (that is, that last value stored there)
6. HBase Tables
• HBase can have a
different schema
per row
• Could be called
schema-less
• Primary access by
the user given row
key and column
name
• Sorting of rows and
columns by their
key (aka names)
7. HBase Tables
• Each row/column coordinate is tagged with a version number, allowing
multi-versioned values
• Version is usually
the current time
(as epoch)
• API lets user ask
for versions
(specific, by count,
or by ranges)
• Up to 2B versions
8. HBase Tables
• Table data is cut into pieces to distribute over cluster
• Regions split table into
shards at size boundaries
• Families split within
regions to group
sets of columns
together
• At least one of
each is needed
9. Scalability – Regions as Shards
• A region is served by exactly
one region server
• Every region server serves
many regions
• Table data is spread over servers
• Distribution of I/O
• Assignment is based on
configurable logic
• Balancing cluster load
• Clients talk directly to region
servers
10. Column Family-Oriented
• Group multiple columns into
physically separated locations
• Apply different properties to each
family
• TTL, compression, versions, …
• Useful to separate distinct data
sets that are related
• Also useful to separate larger blob
from meta data
11. Data Management
• What is available is tracked in three
locations
• System catalog table hbase:meta
• Files in HDFS directories
• Open region instances on servers
• System aligns these locations
• Sometimes (very rarely) a repair may
be needed using HBase Fsck
• Redundant information is useful to
repair corrupt tables
12. HBase really is….
• A distributed Hash Map
• Imagine a complex, concatenated key including the user given row key and
column name, the timestamp (version)
• Complex key points to actual value, that is, the cell
13. Fold, Store, and Shift
• Logical rows in tables are
really stored as flat key-value
pairs
• Each carries full coordinates
• Pertinent information can be
freely placed in cell to
improve lookup
• HBase is a column-family
grouped key-value store
14. HFile Format Information
• All data is stored in a custom (open-source) format, called HFile
• Data is stored in blocks (64KB default)
• Trade-off between lookups and I/O throughput
• Compression, encoding applied _after_ limit check
• Index, filter and meta data is stored in separate blocks
• Fixed trailer allows traversal of file structure
• Newer versions introduce multilayered index and filter structures
• Only load master index and load partial index blocks on demand
• Reading data requires deserialization of block into cells
• Kind of Amdahl’s Law applies
15. HBase Architecture
• One Master and many Worker servers
• Clients mostly communicate with workers
• Workers store actual data
• Memstore for accruing
• HFile for persistence
• WAL for fail-safety
• Data provided as regions
• HDFS is backing store
• But could be another
17. HBase Architecture (cont.)
• Based on Log-Structured Merge-Trees (LSM-Trees)
• Inserts are done in write-ahead log first
• Data is stored in memory and flushed to disk on regular intervals or based
on size
• Small flushes are merged in the background to keep number of files small
• Reads read memory stores first and then disk based files second
• Deletes are handled with “tombstone”
markers
• Atomicity on row level no matter how
many columns
• Keeps locking model easy
18. Merge Reads
• Read Memstore & StoreFiles
using separate scanners
• Merge matching cells into
single row “view”
• Delete’s mask existing data
• Bloom filters help skip
StoreFiles
• Reads may have to span
many files
20. HBase Clients
• Native Java Client/API
• Non-Java Clients
• REST server
• Thrift server
• Jython, Groovy DSL
• Spark
• TableInputFormat/TableOutputFormat for MapReduce
• HBase as MapReduce source and/or target
• Also available for table snapshots
• HBase Shell
• JRuby shell adding get, put, scan etc. and admin calls
• Phoenix, Impala, Hive, …
21. Java API
From Wikipedia:
• CRUD: “In computer programming, create, read, update, and delete are the
four basic functions of persistent storage.”
• Other variations of CRUD include
• BREAD (Browse, Read, Edit, Add, Delete)
• MADS (Modify, Add, Delete, Show)
• DAVE (Delete, Add, View, Edit)
• CRAP (Create, Retrieve, Alter, Purge)
Wait
what?
22. Java API (cont.)
• CRUD
• put: Create and update a row (CU)
• get: Retrieve an entire, or partial row (R)
• delete: Delete a cell, column, columns, or row (D)
• CRUD+SI
• scan: Scan any number of rows (S)
• increment: Increment a column value (I)
• CRUD+SI+CAS
• Atomic compare-and-swap (CAS)
• Combined get, check, and put operation
• Helps to overcome lack of full transactions
23. Java API (cont.)
• Batch Operations
• Support Get, Put, and Delete
• Reduce network round-trips
• If possible, batch operation to the server to gain better overall throughput
• Filters
• Can be used with Get and Scan operations
• Server side hinting
• Reduce data transferred to client
• Filters are no guarantee for fast scans
• Still full table scan in worst-case scenario
• Might have to implement your own
• Filters can hint next row key
25. Key Cardinality
• The best performance is gained from using row keys
• Time range bound reads can skip store files
• So can Bloom Filters
• Selecting column families
reduces the amount of data
to be scanned
• Pure value based access
is a full table scan
• Filters often are too, but
reduce network traffic
26. Key/Table Design
• Crucial to gain best performance
• Why do I need to know? Well, you also need to know that RDBMS is only working
well when columns are indexed and query plan is OK
• Absence of secondary indexes forces use of row key or column name
sorting
• Transfer multiple indexes into one
• Generate large table -> Good since fits architecture and spreads across cluster
• DDI
• Stands for Denormalization, Duplication and Intelligent Keys
• Needed to overcome trade-offs of architecture
• Denormalization -> Replacement for JOINs
• Duplication -> Design for reads
• Intelligent Keys -> Implement indexing and sorting, optimize reads
27. Pre-materialize Everything
• Achieve one read per customer request if possible
• Otherwise keep at lowest number
• Reads between 10ms (cache miss) and 1ms (cache hit)
• Use MapReduce or Spark to compute exacts in batch
• Store and merge updates live
• Use increment() methods
Motto: “Design for Reads”
28. Tall-Narrow vs. Flat-Wide Tables
• Rows do not split
• Might end up with one row per region
• Same storage footprint
• Put more details into the row key
• Sometimes dummy column only
• Make use of partial key scans
• Tall with Scans, Wide with Gets
• Atomicity only on row level
• Examples
• Large graphs, stored as adjacency matrix (narrow)
• Message inbox (wide)
29. Sequential Keys
<timestamp><more key>: {CF: {CQ: {TS : Val}}}
• Hotspotting on regions is bad!
• Instead do one of the following:
• Salting
• Prefix <timestamp> with distributed value
• Binning or bucketing rows across regions
• Key field swap/promotion
• Move <more key> before the timestamp (see OpenTSDB)
• Randomization
• Move <timestamp> out of key or prefix with MD5 hash
• Might also be mitigated by overall spread of workloads
30. Key Design Choices
• Based on access pattern, either use
sequential or random keys
• Often a combination of both is needed
• Overcome architectural limitations
• Neither is necessarily bad
• Use bulk import for sequential keys and
reads
• Random keys are good for random access
patterns
31. Checklist
• Design for Use-Case
• Read, Write, or Both?
• Avoid Hotspotting
• Hash leading key part, or use salting/bucketing
• Use bulk loading where possible
• Monitor your servers!
• Presplit tables
• Try prefix encoding when values are small
• Otherwise use compression (or both)
• For Reads: Restrict yourself
• Specify what you need, i.e. columns, families, time range
• Shift details to appropriate position
• Composite Keys
• Column Qualifiers
34. Cluster Tuning
• First, tune the global settings
• Heap size and GC algorithm
• Memory share for reads and writes
• Enable Block Cache
• Number of RPC handlers
• Load Balancer
• Default flush and compaction strategy
• Thread pools (10+)
• Next, tune the per-table and family settings
• Region sizes
• Block sizes
• Compression and encoding
• Compactions
• …
35. Region Balancer Tuning
• A background process in the HBase
Master is tracking load on servers
• The load balancer moves regions
occasionally
• Multiple implementations exists
• Simple counts number of regions
• Stochastic determines cost
• Favored Node pins HDFS block
replicas
• Can be tuned further
• Cluster-wide setting!
36. RPC Tuning
• Default is one queue for
all types of requests
• Can be split into
separate queues for
reads and writes
• Read queue can be
further split into reads
and scans
Stricter resource limits,
but may avoid cross-
starvation
37. Key Tuning
• Design keys to match use-case
• Sequential, salted, or random
• Use sorting to convey meaning
• Colocate related data
• Spread load over all servers
• Clever key design can make use
of distribution: aging-out regions
38. Compaction Tuning
• Default compaction settings are aggressive
• Set for update use-case
• For insert use-cases, Blooms are effective
• Allows to tune down compactions
• Saves resources by reducing write amplification
• More store files are also enabling faster full
table scans with time range bound scans
• Server can ignore older files
• Large regions may be eligible for advanced
compaction strategies
• Stripe or date-tiered compactions
• Reduce rewrites to fraction of region size
40. Placing the Use-Case
• HBase chooses to work best for random access
• You can optimize a table to prefer scans over gets
• Fewer columns with larger payload
• Larger HFile block sizes (maybe even
duplicate data in two differently
configured column families)
• After that is the realm of hybrid systems
• For fastest scans use brute force HDFS
and native query engine with a
columnar format
42. Big Data Workloads
Low
latency
Batch
Random Access Full ScanShort Scan
HDFS + MR/Spark
(Hive/Pig)
HBase
HBase + Snapshots
-> HDFS + MR/Spark
HDFS
+ SQL
HBase + MR/Spark
Current Metrics
Graph data
Simple Entities
Hybrid Entity Time series
+ Rollup serving
Messages
Analytic archive
Hybrid Entity Time series
+ Rollup generation
Index building
Entity Time series
44. Optimizations
Mostly Inserts Use-Cases
• Tune down compactions
• Compaction ratio, max store file size
• Use Bloom Filters
• On by default for row keys
Mostly Update Use-Cases
• Batch updates if possible
Mostly Serial Keys
• Use bulk loading or salting
Mostly Random Keys
• Hash key with MD5 prefix
Mostly Random Reads
• Decrease HFile block size
• Use random keys
Mostly Scans
• Increase HFile (and HDFS) block size
• Reduce columns and increase cell sizes
45. What matters…
• For optimal performance, two things need to be considered:
• Optimize the cluster and table settings
• Choose the matching key schema
• Ensure load is spread over tables and cluster nodes
• HBase works best for random access and bound scans
• HBase can be optimized for larger scans, but its sweet spot is short burst scans (can
be parallelized too) and random point gets
• Java heap space limits addressable space
• Play with region sizes, compaction strategies, and key design to maximize result
• Using HBase for a suitable use-case will make for a happy customer…
• Conversely, forcing it into non-suitable use-cases may be cause for trouble