Presentation by Kadir Ozdemir, Principal Architect - Salesforce, recorded at Distributed SQL Summit on Sept 20, 2019.
https://vimeo.com/362358494
distributedsql.org/
Introducing Multi Valued Vectors Fields in Apache LuceneSease
Since the introduction of native vector-based search in Apache Lucene happened, many features have been developed, but the support for multiple vectors in a dedicated KNN vector field remained to explore. Having the possibility of indexing (and searching) multiple values per field unlocks the possibility of working with long textual documents, splitting them in paragraphs and encoding each paragraph as a separate vector: scenario that is often encountered by many businesses. This talk explores the challenges, the technical design and the implementation activities happened during the work for this contribution to the Apache Lucene project. The audience is expected to get an understanding of how multi-valued fields can work in a vector-based search use-case and how this feature has been implemented.
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
The document discusses updates to InfluxDB IOx, a new columnar time series database. It covers changes and improvements to the API, CLI, query capabilities, and path to open sourcing builds. Key points include moving to gRPC for management, adding PostgreSQL string functions to queries, optimizing functions for scalar values and columns, and monitoring internal systems as the first step to releasing open source builds.
The document summarizes Apache Phoenix and its past, present, and future as a SQL interface for HBase. It describes Phoenix's architecture and key features like secondary indexes, joins, aggregations, and transactions. Recent releases added functional indexes, the Phoenix Query Server, and initial transaction support. Future plans include improvements to local indexes, integration with Calcite and Hive, and adding JSON and other SQL features. The document aims to provide an overview of Phoenix's capabilities and roadmap for building a full-featured SQL layer over HBase.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
1. Log structured merge trees store data in multiple levels with different storage speeds and costs, requiring data to periodically merge across levels.
2. This structure allows fast writes by storing new data in faster levels before merging to slower levels, and efficient reads by querying multiple levels and merging results.
3. The merging process involves loading, sorting, and rewriting levels to consolidate and propagate deletions and updates between levels.
The document outlines the roadmap for SQL Server, including enhancements to performance, security, availability, development tools, and big data capabilities. Key updates include improved intelligent query processing, confidential computing with secure enclaves, high availability options on Kubernetes, machine learning services, and tools in Azure Data Studio. The roadmap aims to make SQL Server the most secure, high performing, and intelligent data platform across on-premises, private cloud and public cloud environments.
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Don’t optimize my queries, optimize my data!Julian Hyde
The document discusses strategies for optimizing data through materialized views and how data systems can learn to optimize themselves. It proposes an algorithm that uses sketches and information theory to profile data cardinalities and recommend materialized views. The algorithm aims to defeat the combinatorial search space by only considering combinations with "surprising" cardinalities. This profiling provides the cost and benefit information needed to optimize data structures. The document also discusses using query logs and statistics to infer relationships between tables and design summary tables through lattices.
This document summarizes a presentation about optimizing HBase performance through caching. It discusses how baseline tests showed low cache hit rates and CPU/memory utilization. Reducing the table block size improved cache hits but increased overhead. Adding an off-heap bucket cache to store table data minimized JVM garbage collection latency spikes and improved memory utilization by caching frequently accessed data outside the Java heap. Configuration parameters for the bucket cache are also outlined.
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
NeoDash - Building Neo4j Dashboards In MinutesNeo4j
NeoDash is an open-source, low-code dashboard builder for Neo4j that allows users to build interactive dashboards with graphs, tables, charts and other visualizations using Cypher queries. It has over 500 active users across 50 countries. Dashboards created in NeoDash can be customized and published to predefined users. The presentation demonstrates how to use NeoDash to build dashboards from Neo4j data in minutes without extensive programming knowledge. Support options for NeoDash include training, extensions and help with installation from Neo4j professional services.
SQL is a language used to manage data in relational database management systems. It allows users to define, manipulate, and control access to data. Some key points about SQL:
- SQL is used to query, insert, update, and manipulate data stored in tables. It represents data as rows and columns.
- It works with many programming languages and applications. Data can be retrieved using SELECT statements and inserted using INSERT statements.
- SQL databases store data in tables which can then be related to each other using operations like joins. This relational model allows flexible data retrieval and reporting.
12cR1 new features. I have tried to cover all new features of 12cR1 and many more may be missing. These are all my own views and do not necessarily reflect the views of Oracle. Requesting all visitors to comment on it to improve further.
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks
LinkedIn's search architecture called Galene uses Lucene to index hundreds of millions of profiles. Galene improves search quality and scalability through techniques like offline indexing for complex features, live updates at fine granularity, static ranking to prioritize more popular profiles, and early termination to quickly return top results. The architecture includes base, live, and snapshot indexes to support these techniques.
Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.
This document discusses various techniques for optimizing SQL queries in SQL Server, including:
1) Using parameterized queries instead of ad-hoc queries to avoid compilation overhead and improve plan caching.
2) Ensuring optimal ordering of predicates in the WHERE clause and creating appropriate indexes to enable index seeks.
3) Understanding how the query optimizer works by estimating cardinality based on statistics and choosing low-cost execution plans.
4) Avoiding parameter sniffing issues and non-deterministic expressions that prevent accurate cardinality estimation.
5) Using features like the Database Tuning Advisor and query profiling tools to identify optimization opportunities.
The LSMW (Legacy System Migration Workbench) is a tool that supports one-time and periodic transfer of data from non-SAP systems to an SAP R/3 system. It provides a simple interface for transferring data with little to no coding required. The LSMW reads external data files, converts the data to the appropriate SAP format based on defined rules, and imports the converted data into SAP using standard transfer methods like batch input, BAPIs, or IDocs. The main steps in using the LSMW are defining source and target structures, mapping source fields to targets, reading and converting the data, and importing it into SAP.
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...confluent
The document discusses how businesses can adopt a streaming-first approach using stream processing tools like Apache Kafka and KSQL. It argues that databases are no longer suitable for analyzing real-time streaming data and processing events. KSQL is presented as an open source tool that allows users to write SQL queries against streaming data in Kafka. It supports features like continuous queries, stream-table joins, and streaming materialized views. The document also provides a demo of how KSQL can be used for real-time anomaly detection on streaming web user data.
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
This document discusses Flink's Table and SQL APIs, which provide a unified way to write batch and streaming queries. It motivates the need for a relational API by explaining that while Flink's DataStream API is powerful, it requires more technical skills. The Table and SQL APIs allow users to focus on business logic by writing declarative queries. It describes how the APIs work, including translating queries to logical and execution plans and supporting batch, streaming and windowed queries. Finally, it outlines the current capabilities and opportunities for contributors to help expand Flink's relational features.
MariaDB TX 3.0 introduces several new features to enhance its capabilities for enterprise workloads. These include purpose-built storage engines tailored for different use cases, improved schema evolution capabilities like invisible and compressed columns, instant column additions, temporal data and queries, and increased compatibility with Oracle databases through features like sequences and PL/SQL support. The new release aims to challenge proprietary databases by providing an open source alternative with many advanced enterprise features.
This session talks all about SQL Server Query Parameterization. Covers various topics like Query Tuning, Adhoc Query, Execution Plan, Query Optimizer, Parameter Sniffing, Indexes etc. etc.
This document discusses database performance factors for developers. It covers topics like query execution plans, table indexes, table partitioning, and performance troubleshooting. The goal is to help developers understand how to optimize database performance. It provides examples and recommends analyzing execution plans, properly indexing tables, partitioning large tables, and using a structured approach to troubleshooting performance issues.
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
PostgreSQL - масштабирование в моде, Valentine Gogichashvili (Zalando SE)Ontico
This document discusses how PostgreSQL helped Zalando, one of Europe's largest online fashion retailers, scale their operations. Some key points:
- Zalando uses PostgreSQL to store data for over 13.7 million customers, 150,000+ products, and handles 640+ million visits annually.
- They access data through stored procedures to maintain consistency and independence from schema changes. Stored procedures also allow easy data modeling and schema changes without downtimes.
- Zalando shards their database without limits through their Java stored procedure wrapper which directs calls to the appropriate shard transparently.
- They closely monitor PostgreSQL using custom tools like PGObserver and have a dedicated team to ensure high performance and availability.
Amazon Redshift는 속도가 빠른 페타바이트 규모의 완전관리형 데이터 웨어하우스로, 간편하고 비용 효율적으로 모든 데이터를 기존 비즈니스 인텔리전스 도구를 사용하여 분석할 수 있게 해줍니다. 이 강연에서는 대량 병렬처리를 가능하게 하는 RedShift의 분산 처리구조를 살펴보고, 다양한 데이터 소스 및 포맷으로 부터의 데이터 통합 및 로드를 위한 모범 사례에 대하여 실습을 통하여 학습할 예정입니다.
연사: 김상필, 아마존 웹서비스 솔루션즈 아키텍트
This document provides 10 tips for improving SQL performance in DB2 databases. It discusses the importance of ensuring accurate statistics are available to the DB2 optimizer to help determine optimal query execution plans. It also explains how to promote predicates to Stage 1 processing when possible to improve performance by enabling index access plans or full table scans by the Data Manager component rather than relying on the Relational Data Server. The tips cover additional techniques like selecting only necessary columns and rows, using constants over variables when possible, matching data types, ordering predicates for most restrictive filtering first, pruning unnecessary columns from result sets, and limiting result sets with known sizes.
SE2016 Java Roman Ugolnikov "Migration and source control for your DB"Inhacking
The document discusses database migration and source control. It describes how database structure, data, and logic can change across versions. It recommends using tools like Liquibase and Flyway to manage database schema changes and keep the database schema in sync with code. These tools allow defining changes in SQL-based changesets and tracking them in version control. The document also covers how the tools implement features like rollback of changes, preconditions for changes, and support for multiple database types.
Roman Ugolnikov Migrationа and sourcecontrol for your dbАліна Шепшелей
The document discusses database migration and source control. It describes how database structure, data, and logic can change across versions. It recommends using tools like Liquibase and Flyway to manage database schema changes and keep the database schema in sync with code. These tools allow defining changes in migration files and rolling back changes if needed. The document also covers how the tools work, supported databases, file formats, preconditions, and provides a demo of using the tools for a sample database migration.
Structured Streaming in Apache Spark 2.0 introduces a continuous data flow programming model. It processes live data streams using a streaming query that is expressed similarly to batch queries on static data. The streaming query continuously appends incoming data to an unbounded table and performs incremental aggregations. This allows for exactly-once processing semantics without users needing to handle micro-batching or fault tolerance. Structured Streaming queries can be written using the Spark SQL DataFrame/Dataset API and output to sinks like files, databases, and dashboards. It is still experimental but provides an alternative to the micro-batch model of earlier Spark Streaming.
Hpverticacertificationguide 150322232921-conversion-gate01Anvith S. Upadhyaya
Here are the key points about projection segmentation in Vertica:
- Projection segmentation splits large projections into multiple segments and distributes those segments across database nodes for improved parallelism and high availability.
- The segmentation process randomly distributes rows of a projection across all available nodes using a hash function. This random distribution balances the load evenly.
- Segmented projections allow Vertica to parallelize queries by enabling each node to work independently on its portion of the data.
- If a node fails, its segments can be recovered from the duplicate segments stored on other live nodes, ensuring the data remains available.
- Segmentation is determined automatically by Vertica based on projection size and number of nodes. The system monitors segment
Similar to Strongly Consistent Global Indexes for Apache Phoenix (20)
How We Added Replication to QuestDB - JonTheBeachjavier ramirez
Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
Airline Satisfaction Project using Azure
This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
Applications of Data Science in Various IndustriesIABAC
The wide-ranging applications of data science across industries.
From healthcare to finance, data science drives innovation and efficiency by transforming raw data into actionable insights.
Learn how data science enhances decision-making, boosts productivity, and fosters new advancements in technology and business. Explore real-world examples of data science applications today.
2. Why Phoenix at Salesforce?
Massive Data Scale w/
Familiar Interface
Trusted storage Consistent
Multi-cloud
Salesforce
Multi-tenancy
3. HDFS
HBase Server
(Da
Application Server HBase Region Servers
Phoenix
Server
Phoenix
Application
Phoenix Client
HBase Client
SQL
Table Scans/
Mutations
Table
Region
RPC
5. Secondary Indexing
ID Name City
1234 Ashley Seattle
2345 Kadir San Francisco
Primary KeyPrimary Key
ID Name City
1234 Ashley Seattle
2345 Kadir San Francisco
Primary Key
City ID Name
San Francisco 2345 Kadir
Seattle 12345 Ashley
Secondary Key
Data Table Index Table
6. Secondary Indexing - Update
ID Name City
1234 Ashley Seattle
Primary KeyPrimary Key
City ID Name
San Francisco 2345 Kadir
ID Name City
2345 Kadir San Francisco
City ID Name
Seattle 12345 Ashley
Data Table Index Table
7. Secondary Indexing - Update
ID Name City
1234 Ashley Seattle
Primary KeyPrimary Key
City ID Name
ID Name City
2345 Kadir San Francisco
City ID Name
Seattle 12345 Ashley
Data Table Index Table
8. Global Secondary Indexing - Update
ID Name City
1234 Ashley Seattle
Primary KeyPrimary Key
City ID Name
ID Name City
2345 Kadir Seattle
City ID Name
Seattle 1234 Ashley
Seattle 2345 Kadir
Data Table Index Table
9. Current Design Challenges
● Tries to make tables consistent at the write time by relying on client retries
○ May not handle correlated failures and may leave data table inconsistent with its indexes
● Needs external tools to detect inconsistencies and repair them
10. Design Objectives
● Secondary indexes should be always in sync with their data tables
● Strong consistency should not result in significant performance impact
● Strong consistency should not impact scalability significantly
11. Observations
● Data must be consistent at read time
○ An index table row can be repaired from the corresponding data table row at read time
● In HBase writes are fast
○ We can add extra write phase without severely impacting write performance
12. Strongly Consistent Design
Operation Strongly Consistent Design
Read
1. Read the index rows and check their status
2. The unverified rows repaired from the data table
13. Strongly Consistent Design
Operation Strongly Consistent Design
Read
1. Read the index rows and check their status
2. The unverified rows repaired from the data table
Write
1. Set the status of existing index rows unverified and write the new index
rows with the unverified status
2. Write the data table rows
3. Delete the existing index rows and set the status of new rows to verified
14. Strongly Consistent Design
Operation Strongly Consistent Design
Read
1. Read the index rows and check their status
2. The unverified rows repaired from the data table
Write
1. Set the status of existing index rows unverified and write the new index
rows with the unverified status
2. Write the data table rows
3. Delete the existing index rows and set the status of new rows to verified
Delete
1. Set the index table rows with the unverified status
2. Delete the data table rows
3. Delete index table rows
15. Correctness Without Concurrent Row Updates
● Missing index row is not possible
○ An index row is updated first before its data row
■ If the index update is failed then the data row update will not be attempted
○ An index row is deleted only after its data table row is deleted
● Verified index row implies existence of the corresponding data row
○ The status for an index row is set to verified only after the corresponding data row is written
○ The status for an index row is set to unverified before the corresponding data row is deleted
● Unverified index rows are not used for serving user queries
○ An unverified index row is repaired from its data row during scans
16. Correctness With Concurrent Row Updates
● The third phase is skipped for concurrent updates
○ Detect concurrent updates and leave them in the unverified state
● Use two phase row locking to detect concurrent updates on a data row
read the data
table
(phase 1) index
table update
(phase 2) update
the data table
phase 3 index
table update
Pending Rows
add remove
17. Performance Impact of Strong Consistency
● Setup: A data table with two indexes on a 10 node cluster
○ 1 billion large rows with random primary key
○ Top N queries on indexes where N is 50
● Less than 25% increase in write latency
○ Due to setting row status in phase 3
● No noticeable increase in read latency
○ The number of unverified rows due to pending updates on a given table region is limited by the
number of RPC threads and mutation batch size