Motivation and goals for off-heap storage
Off-heap features and usage
Implementation overview
Preliminary benchmarks: off-heap vs. heap
Tips and best practices
This document discusses strategies for maintaining very large MySQL tables that have grown too big. It recommends creating a new database server with different configuration settings like InnoDB file per table to reduce size, using tools like MySQLTuner and tuning-primer to analyze settings, archiving old historical data with ptArchiver to reduce table sizes, and considering partitioning or changing the MySQL version. Monitoring tools like InnoDB status, global status, cacti and innotop are recommended to analyze server performance.
Do you need to scale your application, share data across cluster, perform massive parallel processing on many JVMs or maybe consider alternative to your favorite NoSQL technology? Hazelcast to the rescue! With Hazelcast distributed development is much easier. This presentation will be useful to those who would like to get acquainted with Hazelcast top features and see some of them in action, e.g. how to cluster application, cache data in it, partition in-memory data, distribute workload onto many servers, take advantage of parallel processing, etc.
Presented on JavaDay Kyiv 2014 conference.
The document discusses tuning Java for large data workloads. It covers symptoms of memory issues like jobs getting stuck or failing. It then discusses various Java and Hadoop configuration settings to optimize memory usage like mapreduce.child.java.opts and mapreduce.map.memory.mb. Finally, it provides an overview of different garbage collectors in Java and factors like generation sizes and concurrent marking that impact performance.
Various HA and DR setups for Postgres Plus Advanced Server -
Active – Passive OS HA Clustering
Log Shipping Replication (Hot Standby Mode)
Hot Streaming Replication (Hot Standby Mode)
EDB Postgres Plus Failover Manager
HA with read scaling (with pg-pool)
xDB Single Master Replication (SMR)
xDB Multi Master Replication (MMR)
Use Cases
Elastic HBase on Mesos aims to improve resource utilization of HBase clusters by running HBase in Docker containers managed by Mesos and Marathon. This allows HBase clusters to dynamically scale based on varying workload demands, increases utilization by running mixed workloads on shared resources, and simplifies operations through standard containerization. Key benefits include easier management, higher efficiency through elastic scaling and resource sharing, and improved cluster tunability.
The document discusses Snowflake, a cloud data warehouse that is built for the cloud, multi-tenant, and highly scalable. It uses a shared-data, multi-cluster architecture where compute resources can be scaled independently from storage. Data is stored immutably in micro-partitions across an object store. Virtual warehouses provide isolated compute resources that can access all the data.
Parallel processing involves executing multiple tasks simultaneously using multiple cores or processors. It can provide performance benefits over serial processing by reducing execution time. When developing parallel applications, developers must identify independent tasks that can be executed concurrently and avoid issues like race conditions and deadlocks. Effective parallelization requires analyzing serial code to find optimization opportunities, designing and implementing concurrent tasks, and testing and tuning to maximize performance gains.
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network
Monte Zweben Co-Founder and CEO of Splice Machine, will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
This document discusses using PostgreSQL with Amazon RDS. It begins with an introduction to Amazon RDS and then discusses setting up a PostgreSQL RDS instance, available features like backups and monitoring, limitations, pricing, and references for further reading. The document is intended to provide an overview of deploying and managing PostgreSQL on Amazon RDS.
Breaking the Sound Barrier with Persistent Memory HBaseCon
Liqi Yi and Shylaja Kokoori (Intel)
A fully optimized HBase cluster could easily hit the limit of the underlying storage device’s capability, which is beyond the reach of software optimization alone. To get around this constraint, we need a new design that brings data processing and data storage closer together. In this presentation, we will look at how persistent memory will change the way large datasets are stored. We will review the hardware characteristics of 3D XPoint™, a new persistent memory technology with low latency and high capacity. We will also discuss opportunities for further improvement within the HBase framework using persistent memory.
It has just been a few months since the PostgreSQL9.5 is released. We have got some of our customers excited about great new features and performance enhancements in v9.5. But here we are already taking a peak into the next version, and we find it awesome! One of the most awaited features – parallelism makes it to Postgres. The infrastructure for parallelism has been added over last few releases but the first parallel operation in query execution will be seen only in v9.6.
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
This document summarizes a USF Spark workshop that covers Spark internals and how to optimize Spark jobs. It discusses how Spark works with partitions, caching, serialization and shuffling data. It provides lessons on using less memory by partitioning wisely, avoiding shuffles, using the driver carefully, and caching strategically to speed up jobs. The workshop emphasizes understanding Spark and tuning configurations to improve performance and stability.
This document discusses various techniques for optimizing MySQL queries and databases, including:
- Using the query cache can speed up repeating queries with the same syntax.
- Choosing optimal variable types like integers over strings for size and performance.
- Indexing the right columns and testing indexes regularly as tables change.
- Using EXPLAIN to check query performance and identify optimization opportunities.
- Profiling queries with other tools to see where time is spent.
- Testing changes thoroughly before deploying optimizations.
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketCloudera, Inc.
Solbase is an exciting new open-source, real-time search engine being developed at Photobucket to service the over 30 million daily search requests Photobucket handles. Solbase replaces Lucene’s file system-based index with HBase. This allows the system to update in real-time and linearly scale to serve millions of daily search requests on a large dataset. This session will explore the architecture of Solbase as well as some of Lucene/Solr’s inherent issues we overcame. Finally, we’ll go over performance metrics of Solbase against production traffic.
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon
Zen is a storage service built at Pinterest that offers a graph data model of top of HBase and potentially other storage backends. In this talk, Zen's architects go over the design motivation for Zen and describe its internals including the API, type system, and HBase backend.
Introduction to Prometheus Monitoring (Singapore Meetup) Arseny Chernov
Presented at inaugural Singapore Prometheus Meetup, videos on https://www.meetup.com/Singapore-Prometheus-Meetup/events/240844291/
Links to original slides from various blogposts provided.
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeMichael Stack
CCSMap is a new data structure introduced by Alibaba to improve the performance of HBase. It aims to reduce the overhead of the default Java ConcurrentSkipListMap (CSLM) data structure and improve young garbage collection times. CCSMap chunks data into fixed size blocks for better memory management and uses direct pointers between nodes for faster access. It also provides various configuration options. Alibaba has achieved significant performance gains using CCSMap in HBase, including reduced young GC times, and it continues working to integrate CCSMap further and add new features.
HBaseCon 2013: Apache HBase Operations at PinterestCloudera, Inc.
This document summarizes Jeremy Carroll's presentation on HBase operations at Amazon. It discusses how Amazon uses HBase across 5 clusters with billions of page views. Key points include:
- HBase is deployed on Amazon Web Services using CDH and customized for EC2 instability and lack of rack locality
- Puppet is used to provision nodes and apply custom HDFS and HBase configurations
- Extensive monitoring of the clusters is done using OpenTSDB, Ganglia, and custom dashboards to ensure high availability
- Various techniques are used to optimize performance, handle large volumes, and back up data on EC2 infrastructure.
This technical presentation by EDB Dave Thomas, Systems Engineer provides an overview of:
1) BGWriter/Writer Process
2) Wall Writer Process
3) Stats Collector Process
4) Autovacuum Launch Process
5) Syslogger Process/Logger process
6) Archiver Process
7) WAL Send/Receive Processes
Java performance tuning involves diagnosing and addressing issues like slow application performance and out of memory errors. The document discusses Java performance problems and their solutions, tuning tips, and monitoring tools. Some tips include tuning JVM parameters like heap size, garbage collection settings, and enabling parallel garbage collection for multi-processor systems. Tools mentioned include JConsole, VisualVM, JProfiler, and others for monitoring memory usage, thread activity, and garbage collection.
This document provides best practices for optimizing Blackboard Learn performance. It recommends deploying for performance from the start, optimizing platform components continuously through measurements, using scalable deployments like 64-bit architectures and virtualization, improving page responsiveness through techniques like gzip compression and image optimization, optimizing the web server, Java Virtual Machine, and database through configuration and tools. It emphasizes the importance of understanding resource utilization, wait events, execution plans, and statistics/histograms for database optimization.
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)srisatish ambati
Cache & Concurrency considerations for a high performance Cassandra deployment.
SriSatish Ambati
Cassandra has hit it's stride as a distributed java NoSQL database! It's fast, it's in-memory, it's scalable, it's seda; It's eventually consistent model makes it practical for the large & growing volumes of unstructured data usecases. It is also time to run it through the filters of performance analysis. For starters it runs on the java virtual machine and inherits the capabilities and culpabilities of the platform. This presentation reviews the runtime architecture, cache behavior & performance of a real-world workload on Cassandra. We blend existing system & jvm tools to get a quick overview & a breakdown of hotspots in the get, put & update operations. We highlight the role played by garbage collection & fragmentation due to long lived objects; We investigate lock contention in the data structures under concurrent usage. Cassandra uses UDP for management & TCP for data: we look at robustness of the communication patterns during high spikes and cluster-wide events. We review Non-Blocking Hashmap modifications to Cassandra that improve concurrency & amplify performance of this frontrunner in the NoSQL space
ApacheCon2010 NA
Wed, 03 November 2010 15:00
cassandra
The document provides an overview and best practices for tuning an Alfresco installation for performance. It discusses disabling unused services, limiting folder hierarchies and group nesting, monitoring resources, tuning Solr indexes and caches, and using separate servers for specific tasks like indexing. General tips include testing changes thoroughly before deploying, adjusting sizing for increased usage, and following the standard performance methodology.
The document provides an overview and best practices for tuning an Alfresco installation. It discusses disabling unused services, limiting group hierarchies, monitoring resources, optimizing Solr configuration, indexing processes, and query caching. General tips include separating custom configurations, testing backups and changes, and using support tools for troubleshooting performance issues.
The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.
MariaDB Server Performance Tuning & OptimizationMariaDB plc
This document discusses various techniques for optimizing MariaDB server performance, including:
- Tuning configuration settings like the buffer pool size, query cache size, and thread pool settings.
- Monitoring server metrics like CPU usage, memory usage, disk I/O, and MariaDB-specific metrics.
- Analyzing slow queries with the slow query log and EXPLAIN statements to identify optimization opportunities like adding indexes.
Taming Go's Memory Usage — and Avoiding a Rust RewriteScyllaDB
Last summer, my team and I faced a question many young startups face. Should we rewrite our system in Rust?
At the time of the decision, we were primarily writing in Go. I was working on an agent that passively watches network traffic, parses API calls, and sends obfuscated summaries back to our service for analysis. As users were starting to run more traffic through us, memory usage by the agent grew to an unacceptably high level, impacting performance.
This led me to spend 25 days in despair and immerse myself in the details of Go’s memory management, our technology stack, and the profiling tools available – trying to get our memory footprint back under control. Go’s fully automatic memory management makes this no easy feat.
Spoiler: I emerged victorious and our team still uses Go. In this talk, I’ll talk about key steps and lessons learned from my project. I intend this talk to be helpful for people curious about reducing their memory footprint in Go, or anybody wondering about the tradeoffs of switching to or from Go.
Evolution of MongoDB Replicaset and Its Best PracticesMydbops
There are several exciting and long-awaited features released from MongoDB 4.0. He will focus on the prime features, the kind of problem it solves, and the best practices for deploying replica sets.
This document discusses the differences between the stack and the heap in computing memory. The stack is a temporary storage area where function variables are stored. Data is added or removed in a last-in, first-out manner. The stack has a fixed size and data is automatically deleted when a function exits. The heap is used for dynamic memory allocation and data remains until manually deleted. The stack is faster than the heap for memory allocation due to its structure. Examples are given showing how variables are allocated on the stack or heap.
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
Microservices can provide terabytes of data in microseconds by mapping data from SQL databases into in-memory key-value stores and column key stores within JVMs. This is done through periodic synchronization of changed data from databases into memory and mapping the in-memory data into fast access structures. The in-memory data is then exposed through Java Stream and REST APIs to microservices for high performance querying and analysis of large datasets. This architecture allows microservices to quickly share access to large datasets and restart rapidly by reloading from the synchronized persistent stores.
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
By leveraging memory-mapped files, Speedment and the Chronicle Engine supports large Java maps that easily can exceed the size of your server’s RAM.Because the Java maps are mapped onto files, these maps can be shared instantly between several microservice JVMs and new microservice instances can be added, removed, or restarted very quickly. Data can be retrieved with predictable ultralow latency for a wide range of operations. The solution can be synchronized with an underlying database so that your in-memory maps will be consistently “alive.” The mapped files can be tens of terabytes, which has been done in real-world deployment cases, and a large number of micro services can share these maps simultaneously. Learn more in this session.
Speedment SQL Reflector is a software solution that allows applications to get automatically updated data in real time. The SQL Reflector loads data from your existing SQL database and feeds it into an in-memory data grid e.g. GridGain. When started, the SQL reflector will load your selected existing relational data into your map cluster. Also, any subsequent changes that are made to the relational database (regardless how, via your application, script, SQL commands or even stored procedures) are then continuously fed to your GridGain nodes. Even SQL-transactions are preserved so that your maps will always reflect a valid state of the underlying SQL database.
Optimizing elastic search on google compute engineBhuvaneshwaran R
If you are running the elastic search clusters on the GCE, then we need to take a look at the Capacity planning, OS level and Elasticsearch level optimization. I have presented this at GDG Delhi on Feb 22,2020.
This document discusses Spectrum Scale memory usage. It outlines Spectrum Scale basics like clusters, nodes, and filesystems. It describes the different Spectrum Scale memory pools: pagepool for data, shared segment for metadata references, and external heap for daemons. It provides information on calculating memory needs based on parameters like files to cache, stat cache size, nodes, and access patterns. Other topics covered include related Linux memory usage and out of scope memory components.
Here are the slides for Greenplum Chat #8. You can view the replay here: https://www.youtube.com/watch?v=FKFiyJDgdQk
The increased frequency and sophistication of high-profile data breaches and malicious hacking is putting organizations at continued risk of data theft and significant business disruption. Complicating this scenario is the unbounded growth of Big Data and petabyte-scale data storage, new open source database and distribution schemes, and the continued adoption of cloud services by enterprises.
Pivotal Greenplum customers often look for additional encryption of data-at-rest and data-in-motion. The massively parallel processing (MPP) architecture of Pivotal Greenplum provides an architecture that is unlike traditional OLAP on RDBMS for data warehousing, and encryption capabilities must address the scale-out architecture.
The Zettaset Big Data Encryption Suite has been designed for optimal performance and scalability in distributed Big Data systems like Greenplum Database and Apache HAWQ.
Here is a replay of our recent Greenplum Chat with Zettaset:
00:59 What is Greenplum’s approach for encryption and why Zettaset?
02:17 Results of field testing Zettaset with Greenplum
03:50 Introduction to Zettaset, the security company
05:36 Overview of Zettaset and their solutions
14:51 Different layers for encrypting data at rest
16:50 Encryption key management for big data
20:51 Zettaset BD Encrypt for data at rest and data in motion
22:19 How to mitigate encryption overhead with an MPP scale-out system
24:12 How to deploy BD Encrypt
25:50 Deep dive on data at rest encryption
30:44 Deep dive on data in motion encryption
36:72 Q: How does Zettaset deal with encrypting Greenplums multiple interfaces?
38:08 Q: Can I encrypt data for a particular column?
40:26 How Zettaset fits into a security strategy
41:21 Q: What is the performance impact on queries by encrypting the entire database?
43:28 How Zettaset helps Greenplum meet IT compliance requirements
45:12 Q: How authentication for keys is obtained
48:50 Q: How can Greenplum users try out Zettaset?
50:53 Q: What is a ‘Zettaset Security Coach’?
How to use the WAN Gateway feature of Apache Geode to implement multi-site and active-active failover, disaster recovery, and global scale applications.
#GeodeSummit: Easy Ways to Become a Contributor to Apache GeodePivotalOpenSourceHub
The document provides steps for becoming a contributor to the Apache Geode project, beginning with joining online conversations about the project, then test-driving it by building and running examples, and finally improving the project by reporting findings, fixing bugs, or adding new features through submitting code. The key steps are to join mailing lists or chat forums to participate in discussions, quickly get started with the project by building and testing examples in 5 minutes, and then test release candidates and report any issues found on the project's issue tracker or documentation pages. Contributions to the codebase are also welcomed by forking the GitHub repository and submitting pull requests with bug fixes or new features.
#GeodeSummit Keynote: Creating the Future of Big Data Through 'The Apache Way"PivotalOpenSourceHub
Keynote at Geode Summit 2016 by Dr. Justin Erenkrantz, Bloolmberg LP. Creating the Future of Big Data Through "The Apache Way" and why this matters to the community
#GeodeSummit: Combining Stream Processing and In-Memory Data Grids for Near-R...PivotalOpenSourceHub
This document discusses combining stream processing and in-memory data grids for near-real-time aggregation and notifications. It describes storing immutable event data and filtering and aggregating events in real-time based on requested perspectives. Perspectives can be requested at any time for historical or real-time event data. The solution aims to be scalable, resilient, and low latency using Apache Storm for stream processing, Apache Geode for the event log and storage, and deployment patterns to collocate them for better performance.
This document discusses implementing a Redis adaptor using Apache Geode. It provides an overview of Redis data structures and commands, describes how Geode partitioned regions and indexes can be used to store and access Redis data, outlines advantages like scalability and high availability, and presents a roadmap for further development including supporting additional commands and performance optimization.
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & GeodePivotalOpenSourceHub
In this session we review the design of the current state of support for Apache Geode by Spring Cloud Data Flow, and explore additional use cases and future direction that Spring Cloud Data Flow and Apache Geode might evolve.
#GeodeSummit - Modern manufacturing powered by Spring XD and GeodePivotalOpenSourceHub
This document summarizes a presentation about how TEKsystems Global Services helps modern manufacturing industries address challenges through big data solutions. It outlines TEKsystems' services and capabilities, as well as real-world applications for manufacturing, financial services, and life sciences. The presentation describes reference architectures and customer success stories in marine seismic data and gaming industries. It positions TEKsystems as having expertise, proven track records, and packaged offerings to provide big data solutions from pilot to production.
#GeodeSummit - Using Geode as Operational Data Services for Real Time Mobile ...PivotalOpenSourceHub
One of the largest retailers in North America are considering Apache Geode for their new mobile loyalty application, to support their digital transformation effort. They would use Geode to provide operational data services for their mobile cloud service. This retailer needs to replace sluggish response times with sub-second response which will improved conversion rates. They also want to able to close the loop between data science findings and app experience. This way the right customer interaction is suggested when it is needed such as when customers are looking at their mobile app while walking in the store, or sending notifications at the individuals most likely shopping times. The final benefits of using Geode will include faster development cycles, increased customer loyalty, and higher revenue.
#GeodeSummit - Large Scale Fraud Detection using GemFire Integrated with Gree...PivotalOpenSourceHub
In this session we explore a case study of a large-scale government fraud detection program that prevents billions of dollars in fraudulent payments each year leveraging the beta release of the GemFire+Greenplum Connector, which is planned for release in GemFire 9. Topics will include an overview of the system architecture and a review of the new GemFire+Greenplum Connector features that simplify use cases requiring a blend of massively parallel database capabilities and accelerated in-memory data processing.
#GeodeSummit: Democratizing Fast Analytics with Ampool (Powered by Apache Geode)PivotalOpenSourceHub
Today, if events change the decision model, we wait until the next batch model build for new insights. By extending fast “time-to-decisions” into the world of Big Data Analytics to get fast “time-to-insights”, apps will get what used to be batch insights in near real time. The technology enabling this includes smart in-memory data storage, new storage class memory, and products designed to do one or more parts of an analysis pipeline very well. In this talk we describe how Ampool is building on Apache Geode to allow Big Data analysis solutions to work together with a scalable smart storage class memory layer to allow fast and complex end-to-end pipelines to be built -- closing the loop and providing dramatically lower time to critical insights.
#GeodeSummit: Architecting Data-Driven, Smarter Cloud Native Apps with Real-T...PivotalOpenSourceHub
This talk introduces an open-source solution that integrates cloud native apps running on Cloud Foundry with an open-source hybrid transactions + analytics real-time solution. The architecture is based on the fastest scalable, highly available and fully consistent In-Memory Data Grid (Apache Geode / GemFire), natively integrated to the first open-source massive parallel data warehouse (Greenplum Database) in a hybrid transactional and analytical architecture that is extremely fast, horizontally scalable, highly resilient and open source. This session also features a live demo running on Cloud Foundry, showing a real case of real-time closed-loop analytics and machine learning using the featured solution.
Apache Apex and Apache Geode are two of the most promising incubating open source projects. Combined, they promise to fill gaps of existing big data analytics platforms. Apache Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream and batch processing. Apex is highly scalable, performant, fault tolerant, and strong in operability. Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing. We will also look at some use cases where how these two projects can be used together to form distributed, fault tolerant, reliable in memory data processing layer.
#GeodeSummit - Where Does Geode Fit in Modern System ArchitecturesPivotalOpenSourceHub
The document discusses how Apache Geode fits into modern system architectures using the Command Query Responsibility Segregation (CQRS) pattern. CQRS separates reads and writes so that each can be optimized independently. Geode is well-suited as the read store in a CQRS system due to its ability to efficiently handle queries and cache data through regions. The document provides references on CQRS and related patterns to help understand how they can be applied with Geode.
How Southwest Airlines Uses Geode
Distributed systems and fast data require new software patterns and implementation skills. Learn how Southwest Airlines uses Apache Geode, organizes team responsibilities, and approaches design tradeoffs. Drawing inspiration from real whiteboard conversations, we’ll explore: common development pitfalls, environment capacity planning, streaming data patterns like consumer checkpointing, support roles, and production lessons learned.
Every day, Apache Geode improves how Southwest Airlines schedules nearly 4,000 flights and serves over 500,000 passengers. It’s an essential component of Southwest’s ability to reduce flight delays and support future growth.
#GeodeSummit - Wall St. Derivative Risk Solutions Using GeodePivotalOpenSourceHub
In this talk, Andre Langevin discusses how Geode forms the core of many Wall Street derivative risk solutions. By externalizing risk from trading systems, Geode-based solutions provide cross-product risk management at speeds suitable for automated hedging, while simultaneously eliminating the back office costs associated with traditional trading system based solutions.
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
Slides from the Meetup Monday March 7, 2016 just before the beginning of #GeodeSummit, where we cover an introduction of the technology and community that is Apache Geode, the in-memory data grid.
GPORCA is newly open source advanced query optimizer that is a subproject of Greenplum Database open source project. GPORCA is the query optimizer used in commercial distributions of both Greenplum and HAWQ. In these distributions GPORCA has achieved 1000x performance improvement across TPC-DS queries by focusing on three distinct areas: Dynamic Partition Elimination, SubQuery Unnesting, and Common Table Expression.
Now that GPORCA is open source, we are looking for collaborators to help us realize the ultimate dream for GPORCA - to work with any database.
The new breed of data management systems in Big Data have to process so much data that optimization mistakes are magnified in traditional optimizers. Furthermore, coding and manual optimization of complex queries has proven to be hard.
In this session, Venkatesh will discuss:
- Overview of GPORCA
- How to add GPORCA to HAWQ with a build option
- How GPORCA could be made to work with any database
- Future vision for GPORCA and more immediate plans
- How to work with GPORCA, and how to contribute to GPORCA
Pivoting Spring XD to Spring Cloud Data Flow with Sabby AnandanPivotalOpenSourceHub
Pivoting Spring XD to Spring Cloud Data Flow: A microservice based architecture for stream processing
Microservice based architectures are not just for distributed web applications! They are also a powerful approach for creating distributed stream processing applications. Spring Cloud Data Flow enables you to create and orchestrate standalone executable applications that communicate over messaging middleware such as Kafka and RabbitMQ that when run together, form a distributed stream processing application. This allows you to scale, version and operationalize stream processing applications following microservice based patterns and practices on a variety of runtime platforms such as Cloud Foundry, Apache YARN and others.
About Sabby Anandan
Sabby Anandan is a Product Manager at Pivotal. Sabby is focused on building products that eliminate the barriers between application development, cloud, and big data.
Big Data and Analytics Shaping the future of PaymentsRuchiRathor2
The payments industry is experiencing a data-driven revolution powered by big data and analytics.
Here's a glimpse into 5 ways this dynamic duo is transforming how we pay.
In essence, big data and analytics are playing a pivotal role in building a future filled with faster, more secure, and convenient payment methods for everyone.
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...JeevanKp7
Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients.
The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease.
A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataSamuel Jackson
We present our work to improve data accessibility and performance for data-intensive tasks within the fusion research community. Our primary goal is to develop services that facilitate efficient access for data-intensive applications while ensuring compliance with FAIR principles [1], as well as adoption of interoperable tools, methods and standards.
The major outcome of our work is the successful creation and deployment of a data service for the MAST (Mega Ampere Spherical Tokamak) experiment [2], leading to substantial enhancements in data discoverability, accessibility, and overall data retrieval performance, particularly in scenarios involving large-scale data access. Our work follows the principles of Analysis-Ready, Cloud Optimised (ARCO) data [3] by using cloud optimised data formats for fusion data.
Our system consists of a query-able metadata catalogue, complemented with an object storage system for publicly serving data from the MAST experiment. We will show how our solution integrates with the Pandata stack [4] to enable data analysis and processing at scales that would have previously been intractable, paving the way for data-intensive workflows running routinely with minimal pre-processing on the part of the researcher. By using a cloud-optimised file format such as zarr [5] we can enable interactive data analysis and visualisation while avoiding large data transfers. Our solution integrates with common python data analysis libraries for large, complex scientific data such as xarray [6] for complex data structures and dask [7] for parallel computation and lazily working with larger that memory datasets.
The incorporation of these technologies is vital for advancing simulation, design, and enabling emerging technologies like machine learning and foundation models, all of which rely on efficient access to extensive repositories of high-quality data. Relying on the FAIR guiding principles for data stewardship not only enhances data findability, accessibility, and reusability, but also fosters international cooperation on the interoperability of data and tools, driving fusion research into new realms and ensuring its relevance in an era characterised by advanced technologies in data science.
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18
[2] M Cox, The Mega Amp Spherical Tokamak, Fusion Engineering and Design, Volume 46, Issues 2–4, 1999, Pages 397-404, ISSN 0920-3796, https://doi.org/10.1016/S0920-3796(99)00031-9
[3] Stern, Charles, et al. "Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production." Frontiers in Climate 3 (2022): 782909.
[4] Bednar, James A., and Martin Durant. "The Pandata Scalable Open-Source Analysis Stack." (2023).
[5] Alistair Miles (2024) ‘zarr-developers/zarr-python: v2.17.1’. Zenodo. doi: 10.5281/zenodo.10790679
[6] Hoyer, S. & Hamman, J., (20
2. Agenda
• Motivation and goals for off-heap storage
• Off-heap features and usage
• Implementation overview
• Preliminary benchmarks: off-heap vs. heap
• Tips and best practices
4. Why Off-heap
•
• Increase data density and reduce memory overhead
• 50+ GB user data in one JVM
• 10+ TB user data in one cluster
• Usable out-of-box without extensive GC tuning of JVM
• Maintain existing throughput performance
6. Off-heap: How Do I Use It?
• Set the off-heap memory size for the process
– Using the new property: off-heap-memory-size
• Mark regions whose entry values should be stored off-heap
– Using the new region attribute: off-heap (false | true)
• Adjust the JVM heap memory size down accordingly
– The smaller the better; at least try to keep it below 32G
• Optionally
– Configure Resource Manager
7. Off-heap Features
• Startup options
• Interaction with other features
• Resource Manager
• Monitoring & Management
• Limitations
8. Startup Options
• --off-heap-memory-size – specifies amount of off-heap
memory to allocate
• -lock-memory – specifies to lock memory from the OS
• Example:
gfsh start server –initial-heap=10G –max-heap=10G –off-heap-
memory-size=200G –lock-memory=true
9. Off-heap Interaction with Other Features
• PDX
– Values currently copied from off-heap to create a PDXInstance
• Deltas: expensive
• Compression: compatible with off-heap
• Querying: more expensive with off-heap
• EntryEvents
– Limited availability of oldValue, newValue
• Indexes
– Functional range indexes not supported (too expensive)
10. Off-heap and Resource Manager
• Out of Memory Semantics
• Eviction and Critical Thresholds
• Resource Manager API
11. Out of Memory occurs when...
• Java heap runs out of memory
– Threads start throwing OutOfMemoryError
• Off-heap runs out of memory
– Threads start throwing OutOfOffHeapMemoryException
• => causing the Geode member to close and disconnect
– Closes the Cache to prevent reading inconsistent data
– Disconnects from the Geode cluster to prevent distribution problems
or hangs
12. Eviction and Critical Thresholds for Java Heap
• CriticalHeapPercentage
– triggers LowMemoryException for puts into heap regions
– default is 90%
– critical member informs other members that it is critical
• EvictionHeapPercentage
– triggers eviction of entries in heap regions configured with
LRU_HEAP
– default is 90% of CriticalHeapPercentage
13. Eviction and Critical Thresholds for Off-heap
• CriticalOffHeapPercentage
– triggers LowMemoryException for puts into off-heap regions
– default is 90% if –off-heap-memory-size is specified
– critical member informs other members that it is critical
• EvictionOffHeapPercentage
– triggers eviction of entries in off-heap regions configured with
LRU_HEAP
– default is 90% of CriticalOffHeapPercentage if –off-heap-memory-
size is specified
15. ResourceManager API
• GemFireCache#getResourceManager()
• com.gemstone.gemfire.cache.control.ResourceManager
– exposes getters/setters for all of the heap and off-heap threshold
percentages
– Examples:
▪ public void setCriticalOffHeapPercentage(float offHeapPercentage);
▪ public float getCriticalOffHeapPercentage();
17. Statistics
name description
compactions The total number of times off-heap memory has been compacted.
compactionTime The total time spent compacting off-heap memory.
fragmentation The percentage of off-heap memory fragmentation. Updated every time a compaction is
performed.
fragments The number of fragments of free off-heap memory. Updated every time a compaction is done.
freeMemory The amount of off-heap memory, in bytes, that is not being used.
largestFragment The largest fragment of memory found by the last compaction of off heap memory. Updated
every time a compaction is done.
maxMemory The maximum amount of off-heap memory, in bytes. This is the amount of memory allocated at
startup and does not change.
objects The number of objects stored in off-heap memory.
reads The total number of reads of off-heap memory.
usedMemory The amount of off-heap memory, in bytes, that is being used to store data.
18. MBeans
MemberMXBean
getOffHeapCompactionTime -- provides the value of the compactionTime statistic
getOffHeapFragmentation -- provides the value of the fragmentation statistic
getOffHeapFreeMemory -- provides the value of the freeMemory statistic
getOffHeapObjects -- provides the value of the objects statistic
getOffHeapUsedMemory -- provides the value of the usedMemory statistic
getOffHeapMaxMemory -- provides the value of freeMemory + usedMemory
RegionMXBean
listRegionAttributes (operation)
enableOffHeapMemory (true | false)
19. Gfsh Support for Off-heap Memory
• alter disk-store: new option "--off-heap" for setting off-heap for each
region in the disk-store
• create region: new option "--off-heap" for setting off-heap
• describe member: now displays the off-heap size
• describe offline-disk-store: now shows if a region is off-heap
• describe region: now displays the off-heap region attribute
• show metrics: Now has an offheap category. The offheap metrics
are: maxMemory, freeMemory, usedMemory, objects, fragmentation,
and compactionTime
• start server: added --lock-memory, --off-heap-memory-size, --critical-
off-heap-percentage, and --eviction-off-heap-perentage
20. Off-heap Limitations
• Maximum object size limited to slightly less than 2 GB
• All data nodes must consistently configure a region to be off-
heap
• Functional Range Indexes not supported
• Keys, subscription queue entries not stored off-heap
• Fragmentation statistic is only updated during off-heap
compactions
22. Off-heap: How are We Doing It?
• Using memory that is separate from the Java heap
– Build our own Memory Manager
– Memory Manager is very finely tuned and specific to our usage
– Avoid GC overhead
▪ Avoid copying of objects for promotion between generations
▪ Garbage Collector is a major performance killer
– Use sun.misc.Unsafe API for performance
• Optimizing code to minimize usage of heap memory
• Using off-heap as primary store instead of overflowing to it
24. Off-heap Implementation
• Memory allocated in 2GB slabs
– Max data value size: ~2GB
– Object values stored serialized; blobs stored as byte arrays
– Allocation faster for values < 128KB
▪ Controlled by a system property: gemfire.OFF_HEAP_FREE_LIST_COUNT
▪ First try to allocate from the free list; if that fails, allocate from unused memory
▪ Small values (< 8B) inlined (not using any off-heap space)
• Compaction consolidates free memory to minimize
fragmentation
– Blocks writes; best to avoid by minimizing fragmentation
25. Off-heap Implementation (cont’d)
• Allocated chunks
– Header
▪ isSerialized
▪ isCompressed
▪ Size
▪ Padding size
• Free chunks
– Header
▪ Size
▪ Address of next chunk in the list
26. What is Stored On-heap vs. Off-heap
Stored On-heap Stored Off-heap
Region Meta-Data Values
Entry Meta-Data Reference Counts
Off-Heap Addresses Lists of Free Memory Blocks
Keys WAN Queue Elements
Indexes
Subscription Queue Elements
28. Off-heap: Initial Testing Results
• 256 GB user data per node across 8 nodes for total of 2 TB
of user data
• Heap-only test worked twice as hard to produce 1/3 the
updates as test using Off-Heap
– Details on the next slide
• Succeeded in scaling up to much larger in-memory data
• Increased throughput of operations for large data sets
29. Heap vs. Off-Heap Comparison
Java Heap Off-Heap
creates/sec 30,000 45,000
updates/sec 17,000 (std dev: 2130) 51,000 (std dev: 737)
Java RSS size 50 GB 32 GB
CPU load 70% (load avg 10 cpus) 32% (load avg 5 cpus)
JVM GC ConcurrentMarkSweep ConcurrentMarkSweep
GC ms/sec 777 ms 24 ms
GC marks (GC pauses) 1 per 30 sec never
31. Off-heap Rules of Thumb
• Avoid fragmentation
– In order to avoid compaction
– Avoid usage patterns that lead to fragmentation
– Many updates of varying value size
• Avoid “unfriendly” features
– Deltas
– Functional Range Indexes
– Querying
32. Off-heap Recommendations
• Do use when
– The values are relatively uniform in size
– The values are mostly less than 128K in size
– The usage patterns involve cycles of many creates followed by
destroys or clear
– The values do not need to be frequently deserialized
• Configure all data nodes with the same off-heap-memory-
size
34. We’d appreciate your thoughts...
• Would you like an API to invoke a compaction?
• Would you like to be able to configure the slab size?
• Would you like to configure the max value size for the most
efficient off-heap allocation, or maybe the size increment?
• Anything else?
• Full spec at:
https://cwiki.apache.org/confluence/display/GEODE/Off-
Heap+Memory+Spec