Presentation by Kadir Ozdemir, Principal Architect - Salesforce, recorded at Distributed SQL Summit on Sept 20, 2019.
https://vimeo.com/362358494
distributedsql.org/
The document discusses local secondary indexes in Apache Phoenix. Local indexes are stored in the same region as the base table data, providing faster index building and reads compared to global indexes. The write process involves preparing index updates along with data updates and writing them atomically to memstores and the write ahead log. Reads scan the local index and retrieve any missing columns from the base table. Local indexes improve write performance over global indexes due to reduced network utilization. The document provides performance results and tips on using local indexes.
The document summarizes Apache Phoenix and HBase as an enterprise data warehouse solution. It discusses how Phoenix provides OLTP and analytics capabilities over HBase. It then covers various use cases where companies are using Phoenix and HBase, including for web analytics and time series data. Finally, it discusses optimizations that can be made to the schema design, queries, and writes in Phoenix to improve performance.
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
Furthermore, different kinds of data have distinct properties. We've used three real schemas:
* the NYC taxi data http://tinyurl.com/nyc-taxi-analysis
* the Github access logs http://githubarchive.org
* a typical sales fact table with generated data
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
This document summarizes techniques for optimizing Hive queries, including recommendations around data layout, format, joins, and debugging. It discusses partitioning, bucketing, sort order, normalization, text format, sequence files, RCFiles, ORC format, compression, shuffle joins, map joins, sort merge bucket joins, count distinct queries, using explain plans, and dealing with skew.
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
This document provides an overview of Apache Phoenix, including:
- A brief history of how it originated as an internal project at Salesforce before becoming a top-level Apache project.
- An architectural overview explaining that Phoenix provides a SQL interface for Apache HBase and runs on top of HDFS to enable next-generation data applications on HBase.
- Descriptions of Phoenix's key capabilities like SQL support, transactions, user-defined functions, and secondary indexes to boost query performance.
- Examples of how Phoenix can be used for common scenarios like analyzing server metrics data.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
The document discusses local secondary indexes in Apache Phoenix. Local indexes are stored in the same region as the base table data, providing faster index building and reads compared to global indexes. The write process involves preparing index updates along with data updates and writing them atomically to memstores and the write ahead log. Reads scan the local index and retrieve any missing columns from the base table. Local indexes improve write performance over global indexes due to reduced network utilization. The document provides performance results and tips on using local indexes.
The document summarizes Apache Phoenix and HBase as an enterprise data warehouse solution. It discusses how Phoenix provides OLTP and analytics capabilities over HBase. It then covers various use cases where companies are using Phoenix and HBase, including for web analytics and time series data. Finally, it discusses optimizations that can be made to the schema design, queries, and writes in Phoenix to improve performance.
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
Furthermore, different kinds of data have distinct properties. We've used three real schemas:
* the NYC taxi data http://tinyurl.com/nyc-taxi-analysis
* the Github access logs http://githubarchive.org
* a typical sales fact table with generated data
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
This document summarizes techniques for optimizing Hive queries, including recommendations around data layout, format, joins, and debugging. It discusses partitioning, bucketing, sort order, normalization, text format, sequence files, RCFiles, ORC format, compression, shuffle joins, map joins, sort merge bucket joins, count distinct queries, using explain plans, and dealing with skew.
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
This document provides an overview of Apache Phoenix, including:
- A brief history of how it originated as an internal project at Salesforce before becoming a top-level Apache project.
- An architectural overview explaining that Phoenix provides a SQL interface for Apache HBase and runs on top of HDFS to enable next-generation data applications on HBase.
- Descriptions of Phoenix's key capabilities like SQL support, transactions, user-defined functions, and secondary indexes to boost query performance.
- Examples of how Phoenix can be used for common scenarios like analyzing server metrics data.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
SQL Performance Improvements at a Glance in Apache Spark 3.0Databricks
This talk explains how Spark 3.0 can improve the performance of SQL applications. Spark 3.0 provides many performance features such as dynamic partitioning and enhanced pushdown. Each of them can improve the performance of a different type of SQL application.
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
The document discusses the Optimized Row Columnar (ORC) file format. ORC provides columnar storage, row groups, optimized compression, vectorized processing and other features to improve performance of Hive queries up to 100x. It describes how ORC has been implemented at companies like Facebook and Spotify to significantly reduce storage needs, improve query performance and reduce CPU usage. The document outlines various optimizations in ORC including column projection, stripe splitting, statistics gathering, vectorized processing using SIMD and improvements in compression and encoding of integer, double and string data types.
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
This document discusses how to optimize performance in SQL Server. It covers:
1) Why performance tuning is necessary to allow systems to scale, improve performance, and save costs.
2) How to optimize SQL Server performance by addressing CPU, memory, I/O, and other factors like compression and partitioning.
3) How to optimize the database for performance through techniques like schema design, indexing, locking, and query optimization.
Transactional operations in Apache Hive: present and futureDataWorks Summit
Apache Hive is an enterprise data warehouse build on top of Hadoop. Hive supports insert, update, delete, and merge SQL operations with transactional semantics and read operations that run at snapshot isolation. The well defined semantics of these operations in the face of failure and concurrency are critical to building robust application on top of Apache Hive. In the past there were many preconditions to enabling these features which meant giving up other functionality. The need to make these tradeoffs is rapidly being eliminated.
This talk will describe the intended use cases, architecture of the implementation, recent improvements and new features build for Hive 3.0. For example, bucketing transactional tables, while supported, is no longer required. Performance overhead of using transactional tables is nearly eliminated relative to identical non-transactional tables. We’ll also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL.
Speaker
Eugene Koifman, Hortonworks, Principal Software Engineer
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
3 Things to Learn About:
-How Kudu is able to fill the analytic gap between HDFS and Apache HBase
-The trade-offs between real-time transactional access and fast analytic performance
-How Kudu provides an option to achieve fast scans and random access from a single API
SQL is a language used to manage data in relational database management systems. It allows users to define, manipulate, and control access to data. Some key points about SQL:
- SQL is used to query, insert, update, and manipulate data stored in tables. It represents data as rows and columns.
- It works with many programming languages and applications. Data can be retrieved using SELECT statements and inserted using INSERT statements.
- SQL databases store data in tables which can then be related to each other using operations like joins. This relational model allows flexible data retrieval and reporting.
12cR1 new features. I have tried to cover all new features of 12cR1 and many more may be missing. These are all my own views and do not necessarily reflect the views of Oracle. Requesting all visitors to comment on it to improve further.
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
SQL Performance Improvements at a Glance in Apache Spark 3.0Databricks
This talk explains how Spark 3.0 can improve the performance of SQL applications. Spark 3.0 provides many performance features such as dynamic partitioning and enhanced pushdown. Each of them can improve the performance of a different type of SQL application.
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
The document discusses the Optimized Row Columnar (ORC) file format. ORC provides columnar storage, row groups, optimized compression, vectorized processing and other features to improve performance of Hive queries up to 100x. It describes how ORC has been implemented at companies like Facebook and Spotify to significantly reduce storage needs, improve query performance and reduce CPU usage. The document outlines various optimizations in ORC including column projection, stripe splitting, statistics gathering, vectorized processing using SIMD and improvements in compression and encoding of integer, double and string data types.
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
This document discusses how to optimize performance in SQL Server. It covers:
1) Why performance tuning is necessary to allow systems to scale, improve performance, and save costs.
2) How to optimize SQL Server performance by addressing CPU, memory, I/O, and other factors like compression and partitioning.
3) How to optimize the database for performance through techniques like schema design, indexing, locking, and query optimization.
Transactional operations in Apache Hive: present and futureDataWorks Summit
Apache Hive is an enterprise data warehouse build on top of Hadoop. Hive supports insert, update, delete, and merge SQL operations with transactional semantics and read operations that run at snapshot isolation. The well defined semantics of these operations in the face of failure and concurrency are critical to building robust application on top of Apache Hive. In the past there were many preconditions to enabling these features which meant giving up other functionality. The need to make these tradeoffs is rapidly being eliminated.
This talk will describe the intended use cases, architecture of the implementation, recent improvements and new features build for Hive 3.0. For example, bucketing transactional tables, while supported, is no longer required. Performance overhead of using transactional tables is nearly eliminated relative to identical non-transactional tables. We’ll also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL.
Speaker
Eugene Koifman, Hortonworks, Principal Software Engineer
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
3 Things to Learn About:
-How Kudu is able to fill the analytic gap between HDFS and Apache HBase
-The trade-offs between real-time transactional access and fast analytic performance
-How Kudu provides an option to achieve fast scans and random access from a single API
SQL is a language used to manage data in relational database management systems. It allows users to define, manipulate, and control access to data. Some key points about SQL:
- SQL is used to query, insert, update, and manipulate data stored in tables. It represents data as rows and columns.
- It works with many programming languages and applications. Data can be retrieved using SELECT statements and inserted using INSERT statements.
- SQL databases store data in tables which can then be related to each other using operations like joins. This relational model allows flexible data retrieval and reporting.
12cR1 new features. I have tried to cover all new features of 12cR1 and many more may be missing. These are all my own views and do not necessarily reflect the views of Oracle. Requesting all visitors to comment on it to improve further.
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks
LinkedIn's search architecture called Galene uses Lucene to index hundreds of millions of profiles. Galene improves search quality and scalability through techniques like offline indexing for complex features, live updates at fine granularity, static ranking to prioritize more popular profiles, and early termination to quickly return top results. The architecture includes base, live, and snapshot indexes to support these techniques.
Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.
This document discusses various techniques for optimizing SQL queries in SQL Server, including:
1) Using parameterized queries instead of ad-hoc queries to avoid compilation overhead and improve plan caching.
2) Ensuring optimal ordering of predicates in the WHERE clause and creating appropriate indexes to enable index seeks.
3) Understanding how the query optimizer works by estimating cardinality based on statistics and choosing low-cost execution plans.
4) Avoiding parameter sniffing issues and non-deterministic expressions that prevent accurate cardinality estimation.
5) Using features like the Database Tuning Advisor and query profiling tools to identify optimization opportunities.
The LSMW (Legacy System Migration Workbench) is a tool that supports one-time and periodic transfer of data from non-SAP systems to an SAP R/3 system. It provides a simple interface for transferring data with little to no coding required. The LSMW reads external data files, converts the data to the appropriate SAP format based on defined rules, and imports the converted data into SAP using standard transfer methods like batch input, BAPIs, or IDocs. The main steps in using the LSMW are defining source and target structures, mapping source fields to targets, reading and converting the data, and importing it into SAP.
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...confluent
The document discusses how businesses can adopt a streaming-first approach using stream processing tools like Apache Kafka and KSQL. It argues that databases are no longer suitable for analyzing real-time streaming data and processing events. KSQL is presented as an open source tool that allows users to write SQL queries against streaming data in Kafka. It supports features like continuous queries, stream-table joins, and streaming materialized views. The document also provides a demo of how KSQL can be used for real-time anomaly detection on streaming web user data.
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
This document discusses Flink's Table and SQL APIs, which provide a unified way to write batch and streaming queries. It motivates the need for a relational API by explaining that while Flink's DataStream API is powerful, it requires more technical skills. The Table and SQL APIs allow users to focus on business logic by writing declarative queries. It describes how the APIs work, including translating queries to logical and execution plans and supporting batch, streaming and windowed queries. Finally, it outlines the current capabilities and opportunities for contributors to help expand Flink's relational features.
MariaDB TX 3.0 introduces several new features to enhance its capabilities for enterprise workloads. These include purpose-built storage engines tailored for different use cases, improved schema evolution capabilities like invisible and compressed columns, instant column additions, temporal data and queries, and increased compatibility with Oracle databases through features like sequences and PL/SQL support. The new release aims to challenge proprietary databases by providing an open source alternative with many advanced enterprise features.
This session talks all about SQL Server Query Parameterization. Covers various topics like Query Tuning, Adhoc Query, Execution Plan, Query Optimizer, Parameter Sniffing, Indexes etc. etc.
This document discusses database performance factors for developers. It covers topics like query execution plans, table indexes, table partitioning, and performance troubleshooting. The goal is to help developers understand how to optimize database performance. It provides examples and recommends analyzing execution plans, properly indexing tables, partitioning large tables, and using a structured approach to troubleshooting performance issues.
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
PostgreSQL - масштабирование в моде, Valentine Gogichashvili (Zalando SE)Ontico
This document discusses how PostgreSQL helped Zalando, one of Europe's largest online fashion retailers, scale their operations. Some key points:
- Zalando uses PostgreSQL to store data for over 13.7 million customers, 150,000+ products, and handles 640+ million visits annually.
- They access data through stored procedures to maintain consistency and independence from schema changes. Stored procedures also allow easy data modeling and schema changes without downtimes.
- Zalando shards their database without limits through their Java stored procedure wrapper which directs calls to the appropriate shard transparently.
- They closely monitor PostgreSQL using custom tools like PGObserver and have a dedicated team to ensure high performance and availability.
Amazon Redshift는 속도가 빠른 페타바이트 규모의 완전관리형 데이터 웨어하우스로, 간편하고 비용 효율적으로 모든 데이터를 기존 비즈니스 인텔리전스 도구를 사용하여 분석할 수 있게 해줍니다. 이 강연에서는 대량 병렬처리를 가능하게 하는 RedShift의 분산 처리구조를 살펴보고, 다양한 데이터 소스 및 포맷으로 부터의 데이터 통합 및 로드를 위한 모범 사례에 대하여 실습을 통하여 학습할 예정입니다.
연사: 김상필, 아마존 웹서비스 솔루션즈 아키텍트
This document provides 10 tips for improving SQL performance in DB2 databases. It discusses the importance of ensuring accurate statistics are available to the DB2 optimizer to help determine optimal query execution plans. It also explains how to promote predicates to Stage 1 processing when possible to improve performance by enabling index access plans or full table scans by the Data Manager component rather than relying on the Relational Data Server. The tips cover additional techniques like selecting only necessary columns and rows, using constants over variables when possible, matching data types, ordering predicates for most restrictive filtering first, pruning unnecessary columns from result sets, and limiting result sets with known sizes.
SE2016 Java Roman Ugolnikov "Migration and source control for your DB"Inhacking
The document discusses database migration and source control. It describes how database structure, data, and logic can change across versions. It recommends using tools like Liquibase and Flyway to manage database schema changes and keep the database schema in sync with code. These tools allow defining changes in SQL-based changesets and tracking them in version control. The document also covers how the tools implement features like rollback of changes, preconditions for changes, and support for multiple database types.
Roman Ugolnikov Migrationа and sourcecontrol for your dbАліна Шепшелей
The document discusses database migration and source control. It describes how database structure, data, and logic can change across versions. It recommends using tools like Liquibase and Flyway to manage database schema changes and keep the database schema in sync with code. These tools allow defining changes in migration files and rolling back changes if needed. The document also covers how the tools work, supported databases, file formats, preconditions, and provides a demo of using the tools for a sample database migration.
Structured Streaming in Apache Spark 2.0 introduces a continuous data flow programming model. It processes live data streams using a streaming query that is expressed similarly to batch queries on static data. The streaming query continuously appends incoming data to an unbounded table and performs incremental aggregations. This allows for exactly-once processing semantics without users needing to handle micro-batching or fault tolerance. Structured Streaming queries can be written using the Spark SQL DataFrame/Dataset API and output to sinks like files, databases, and dashboards. It is still experimental but provides an alternative to the micro-batch model of earlier Spark Streaming.
Hpverticacertificationguide 150322232921-conversion-gate01Anvith S. Upadhyaya
Here are the key points about projection segmentation in Vertica:
- Projection segmentation splits large projections into multiple segments and distributes those segments across database nodes for improved parallelism and high availability.
- The segmentation process randomly distributes rows of a projection across all available nodes using a hash function. This random distribution balances the load evenly.
- Segmented projections allow Vertica to parallelize queries by enabling each node to work independently on its portion of the data.
- If a node fails, its segments can be recovered from the duplicate segments stored on other live nodes, ensuring the data remains available.
- Segmentation is determined automatically by Vertica based on projection size and number of nodes. The system monitors segment
Similar to Strongly Consistent Global Indexes for Apache Phoenix (20)
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT MATKA GUESSING KALYAN CHART FINAL ANK SATTAMATAK KALYAN MAKTA SATTAMATAK KALYAN MAKTA
Applications of Data Science in Various IndustriesIABAC
The wide-ranging applications of data science across industries.
From healthcare to finance, data science drives innovation and efficiency by transforming raw data into actionable insights.
Learn how data science enhances decision-making, boosts productivity, and fosters new advancements in technology and business. Explore real-world examples of data science applications today.
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...ThinkInnovation
Objective
To identify the impact of speed limit restrictions in different constituencies over the years with the help of DID technique to conclude whether having strict speed limit restrictions can help to reduce the increasing number of road accidents on weekends.
Context*
Generally, on weekends people tend to spend time with their family and friends and go for outings, parties, shopping, etc. which results in an increased number of vehicles and crowds on the roads.
Over the years a rapid increase in road casualties was observed on weekends by the Government.
In the year 2005, the Government wanted to identify the impact of road safety laws, especially the speed limit restrictions in different states with the help of government records for the past 10 years (1995-2004), the objective was to introduce/revive road safety laws accordingly for all the states to reduce the increasing number of road casualties on weekends
* The Speed limit restriction can be observed before 2000 year as well, but the strict speed limit restriction rule was implemented from 2000 year to understand the impact
Strategies
Observe the Difference in Differences between ‘year’ >= 2000 & ‘year’ <2000
Observe the outcome from multiple linear regression by considering all the independent variables & the interaction term
Euro 2024 Predictions - Group Stage Results
How did we get on?
We predicted 10 of the 16 teams to progress to the last 16, just one less than we targeted
A fantastic win last night by Georgia against Portugal 2 - 0
See our results for each group and some surprises we didnt expect
https://www.selectdistinct.co.uk/2024/06/27/euro-2024-prediction-results
#EURO2024 #Germany2024 #England #EURO2024predictions
2. Why Phoenix at Salesforce?
Massive Data Scale w/
Familiar Interface
Trusted storage Consistent
Multi-cloud
Salesforce
Multi-tenancy
3. HDFS
HBase Server
(Da
Application Server HBase Region Servers
Phoenix
Server
Phoenix
Application
Phoenix Client
HBase Client
SQL
Table Scans/
Mutations
Table
Region
RPC
5. Secondary Indexing
ID Name City
1234 Ashley Seattle
2345 Kadir San Francisco
Primary KeyPrimary Key
ID Name City
1234 Ashley Seattle
2345 Kadir San Francisco
Primary Key
City ID Name
San Francisco 2345 Kadir
Seattle 12345 Ashley
Secondary Key
Data Table Index Table
6. Secondary Indexing - Update
ID Name City
1234 Ashley Seattle
Primary KeyPrimary Key
City ID Name
San Francisco 2345 Kadir
ID Name City
2345 Kadir San Francisco
City ID Name
Seattle 12345 Ashley
Data Table Index Table
7. Secondary Indexing - Update
ID Name City
1234 Ashley Seattle
Primary KeyPrimary Key
City ID Name
ID Name City
2345 Kadir San Francisco
City ID Name
Seattle 12345 Ashley
Data Table Index Table
8. Global Secondary Indexing - Update
ID Name City
1234 Ashley Seattle
Primary KeyPrimary Key
City ID Name
ID Name City
2345 Kadir Seattle
City ID Name
Seattle 1234 Ashley
Seattle 2345 Kadir
Data Table Index Table
9. Current Design Challenges
● Tries to make tables consistent at the write time by relying on client retries
○ May not handle correlated failures and may leave data table inconsistent with its indexes
● Needs external tools to detect inconsistencies and repair them
10. Design Objectives
● Secondary indexes should be always in sync with their data tables
● Strong consistency should not result in significant performance impact
● Strong consistency should not impact scalability significantly
11. Observations
● Data must be consistent at read time
○ An index table row can be repaired from the corresponding data table row at read time
● In HBase writes are fast
○ We can add extra write phase without severely impacting write performance
12. Strongly Consistent Design
Operation Strongly Consistent Design
Read
1. Read the index rows and check their status
2. The unverified rows repaired from the data table
13. Strongly Consistent Design
Operation Strongly Consistent Design
Read
1. Read the index rows and check their status
2. The unverified rows repaired from the data table
Write
1. Set the status of existing index rows unverified and write the new index
rows with the unverified status
2. Write the data table rows
3. Delete the existing index rows and set the status of new rows to verified
14. Strongly Consistent Design
Operation Strongly Consistent Design
Read
1. Read the index rows and check their status
2. The unverified rows repaired from the data table
Write
1. Set the status of existing index rows unverified and write the new index
rows with the unverified status
2. Write the data table rows
3. Delete the existing index rows and set the status of new rows to verified
Delete
1. Set the index table rows with the unverified status
2. Delete the data table rows
3. Delete index table rows
15. Correctness Without Concurrent Row Updates
● Missing index row is not possible
○ An index row is updated first before its data row
■ If the index update is failed then the data row update will not be attempted
○ An index row is deleted only after its data table row is deleted
● Verified index row implies existence of the corresponding data row
○ The status for an index row is set to verified only after the corresponding data row is written
○ The status for an index row is set to unverified before the corresponding data row is deleted
● Unverified index rows are not used for serving user queries
○ An unverified index row is repaired from its data row during scans
16. Correctness With Concurrent Row Updates
● The third phase is skipped for concurrent updates
○ Detect concurrent updates and leave them in the unverified state
● Use two phase row locking to detect concurrent updates on a data row
read the data
table
(phase 1) index
table update
(phase 2) update
the data table
phase 3 index
table update
Pending Rows
add remove
17. Performance Impact of Strong Consistency
● Setup: A data table with two indexes on a 10 node cluster
○ 1 billion large rows with random primary key
○ Top N queries on indexes where N is 50
● Less than 25% increase in write latency
○ Due to setting row status in phase 3
● No noticeable increase in read latency
○ The number of unverified rows due to pending updates on a given table region is limited by the
number of RPC threads and mutation batch size