Deep dive into Clustered Columnstore structures with information on compression algorithms, compression types, locking and dictionaries, as well as the Batch Processing Mode.
MariaDB ColumnStore is a high performance columnar storage engine for MariaDB that supports analytical workloads on large datasets. It uses a distributed, massively parallel architecture to provide faster and more efficient queries. Data is stored column-wise which improves compression and enables fast loading and filtering of large datasets. The cpimport tool allows loading data into MariaDB ColumnStore in bulk from CSV files or other sources, with options for centralized or distributed parallel loading. Proper sizing of ColumnStore deployments depends on factors like data size, workload, and hardware specifications.
ETL with Clustered Columnstore - PASS Summit 2014Niko Neugebauer
You will find some basic information about ways to extract, load & maintain information with Clustered Columnstore Indexes.
You will need to have some knowledge about Columnstore Indexes structures to use it.
This document discusses various techniques for optimizing queries in MySQL databases. It covers storage engines like InnoDB and MyISAM, indexing strategies including different index types and usage examples, using explain plans to analyze query performance, and rewriting queries to improve efficiency by leveraging indexes and removing unnecessary functions. The goal of these optimization techniques is to reduce load on database servers and improve query response times as data volumes increase.
This document provides an overview of Postgresql, including its history, capabilities, advantages over other databases, best practices, and references for further learning. Postgresql is an open source relational database management system that has been in development for over 30 years. It offers rich SQL support, high performance, ACID transactions, and extensive extensibility through features like JSON, XML, and programming languages.
Scylla Summit 2022: How ScyllaDB Powers This Next Tech CycleScyllaDB
Applications have never been so data-hungry, nor as demanding for scale, speed and availability. Hear from CEO Dor Laor as he shares how ScyllaDB is powering this next tech cycle.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
MyRocks deployment at Facebook and Roadmaps
This document discusses Facebook's deployment of MyRocks, a MySQL storage engine that uses RocksDB. It summarizes Facebook's initial goals for MyRocks, the technical challenges of migrating to MyRocks, their production configuration, and monitoring. It also outlines Facebook's plans to help further develop MyRocks in MariaDB and Percona Server with a focus on read performance, mixed engines, better replication, and supporting larger instance sizes.
Writing powerful stored procedures in PL/SQLMariaDB plc
Oracle Database compatibility in MariaDB Server lets developers choose between ANSI SQL and PL/SQL when writing stored procedures. In this session, Senior Solutions Engineer Alton Dinsmore focuses on how to write powerful stored procedures and functions with PL/SQL, whether you are migrating from Oracle Database or not.
In 2018's user conference keynote MariaDB CEO, Michael Howard, announced an initiative to build a MariaDB DBaaS platform. In this session, the DBaaS team shares how MariaDB is approaching DBaaS, then discusses the role of containers and Kubernetes, the need for infrastructure-agnostic provisioning, support for day-two operations and enterprise requirements for large-scale DBaaS deployments.
Running Scylla on Kubernetes with Scylla OperatorScyllaDB
- The document discusses running Scylla, a NoSQL database, on Kubernetes using the Scylla Operator. The Operator allows Kubernetes to leverage for workload management and provides a management layer for Scylla.
- A demo shows deploying a Scylla cluster on Kubernetes with the Operator, stress testing the deployment, and performing common procedures like scaling up and upgrading Scylla versions.
- The Operator uses custom resources and controllers to map Scylla concepts like members, clusters, and datacenters to Kubernetes concepts like statefulsets and pods. This provides capabilities like topology changes and rolling upgrades.
This document summarizes a presentation about migrating to PostgreSQL. It discusses PostgreSQL's history and features, including its open source nature, performance, extensibility, and support for JSON, XML, and other data types. It also covers installation, common SQL features, indexing, concurrency control using MVCC, and best practices for optimization. The presentation aims to explain why developers may want to use PostgreSQL as an alternative or complement to other databases.
Jss 2015 in memory and operational analyticsDavid Barbarin
This document contains summaries of presentations and information about the #JSS2015 conference on SQL Server 2015 organized by GUSS. It provides information on speakers David Barbarin and Frédéric Pichaut and topics to be covered including columnstore architecture, columnstore improvements in SQL 2016, in-memory OLTP architecture and improvements, and remaining unsupported in-memory features.
This document provides an overview and comparison of relational and NoSQL databases. Relational databases use SQL and have strict schemas while NoSQL databases are schema-less and include document, key-value, wide-column, and graph models. NoSQL databases provide unlimited horizontal scaling, very fast performance that does not deteriorate with growth, and flexible queries using map-reduce. Popular NoSQL databases include MongoDB, Cassandra, HBase, and Redis.
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
Migrating large databases at Facebook from InnoDB to MyRocks and HBase to MyRocks resulted in significant space savings of 2-4x and improved write performance by up to 10x. Various techniques were used for the migrations such as creating new MyRocks instances without downtime, loading data efficiently, testing on shadow instances, and promoting MyRocks instances as masters. Ongoing work involves optimizations like direct I/O, dictionary compression, parallel compaction, and dynamic configuration changes to further improve performance and efficiency.
Gs08 modernize your data platform with sql technologies wash dcBob Ward
The document discusses the challenges of modern data platforms including disparate systems, multiple tools, high costs, and siloed insights. It introduces the Microsoft Data Platform as a way to manage all data in a scalable and secure way, gain insights across data without movement, utilize existing skills and investments, and provide consistent experiences on-premises, in the cloud, and hybrid environments. Key elements of the Microsoft Data Platform include SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Data Lake, and Analytics Platform System.
ClustrixDB: how distributed databases scale outMariaDB plc
ClustrixDB, now part of MariaDB, is a fully distributed and transactional RDBMS for applications with the highest scalability requirements. In this session Robbie Mihalyi, VP of Engineering for ClustrixDB, provides an introduction to ClustrixDB, followed by an in-depth technical overview of its architecture, with a focus on distributed storage, transactions and query processing – and its unique approach to index partitioning.
Percona Cluster ( Galera ) is one of the best database solution that provides synchronous replication. The feature like automatic recovery, GTID and multi threaded replication makes it powerful along with ( XtraDB and Xtrabackup ).
The good solution for MySQL HA.
MariaDB is an open-source relational database management system that was created to be more open and community-focused than its predecessor, MySQL. It was founded in 2009 by the original developers of MySQL after Oracle acquired Sun Microsystems. MariaDB aims to preserve the open nature of MySQL by using an open governance model and keeping its code open source under GPL. It has become the default database in several major Linux distributions and is available on major cloud platforms. MariaDB provides an enterprise-grade database with high availability, performance, scalability and security features.
Solr cloud the 'search first' nosql database extended deep divelucenerevolution
Presented by Mark Miller, Software Engineer, Cloudera
As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.
Sql server 2014 x velocity – updateable columnstore indexesPat Sheehan
The document compares rowstore and columnstore database structures. Rowstore stores data grouped by row, while columnstore stores data grouped by column. Columnstore reads only needed columns, is natively compressed, and supports batch-mode optimization, providing 5-100 times faster performance than rowstore for large queries. However, columnstore has limitations such as only one columnstore index per table and not supporting updates on nonclustered indexes.
This document provides information about an upcoming presentation on Columnstore Indexes in SQL Server 2014. It notes that the presentation will be recorded so that those who could not attend live can view it later. It requests that anyone with issues about being recorded should leave immediately, and remaining will be taken as consent to the recording. It also states the presentation will be free and will begin in 1 minute.
The document provides an overview and summary of new features in Microsoft SQL Server 2016. It discusses enhancements to the database engine, in-memory OLTP, columnstore indexes, R services, high availability, security, and Reporting Services. Key highlights include support for up to 2TB of durable memory-optimized tables, increased index key size limits, temporal data support, row-level security, and improved integration with Azure and Power BI capabilities. The presentation aims to help users understand and leverage the new and improved features in SQL Server 2016.
Slides for a talk.
Talk abstract:
In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases.
In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and use work load management.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Who Should Attend:
• Data Warehouse Developers, Big Data Architects, BI Managers, and Data Engineers
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression, and lazy decompression. It provides examples of run length and dictionary encoding. The document also discusses columnar file formats like RCFile, ORC, and Parquet, providing more details on ORC. It concludes with a case study where optimizations to a petabyte-scale data warehouse including sorting, changed compression, and other configuration changes improved query performance significantly through reduced data size.
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression techniques like run-length encoding, and lazy decompression. Specific columnar file formats like RCFile, ORC, and Parquet are mentioned. The document concludes with a case study describing optimizations made to a 1PB Hive table that resulted in a 3x query performance improvement through techniques like explicit sorting, improved compression, increased bucketing, and stripe size tuning.
Tony Gibbs gave a presentation on Amazon Redshift covering its history, architecture, concepts, and parallelism. The presentation included details on Redshift's cluster architecture, node components, storage design, data distribution styles, and terminology. It also provided a deep dive on parallelism in Redshift, explaining how queries are compiled and executed through streams, segments, and steps to enable massively parallel processing across nodes.
SQL Server 2014 In-Memory Tables (XTP, Hekaton)Tony Rogerson
Semi-advanced presentation on SQL Server 2014 in-memory tables which is part of the Extreme Transaction Processing feature (project: Hekaton).
Deck and demo can be found: http://sdrv.ms/1dvWouN
An introduction to column store indexes and batch modeChris Adkin
This document discusses column store databases and how they work. It explains that column store databases store data by column rather than row to better utilize modern CPU architectures. It describes how column stores use compression techniques like run-length encoding and dictionaries. It also demonstrates how batch processing and sorting data can improve performance of queries against column stores by keeping more data in CPU caches.
This document provides a summary of different data storage systems and structures. It discusses B-trees, LSM-trees, hash indices, R-trees, and the Block Range Index. It describes their uses, properties, and tradeoffs for operations like reads, writes, and range queries. Overall, the document analyzes various indexing techniques and how they are applied in different databases.
MongoDB is a document-oriented, non-relational database that provides an alternative to traditional RDBMS systems. It uses a dynamic schema with flexible document structures and embedded documents. MongoDB has built-in replication for high availability and automatic failover. It also has built-in sharding for horizontal scalability across multiple servers. MongoDB uses JSON-like documents with dynamic schemas, indexing, high performance, and scale horizontally and vertically.
Sql server engine cpu cache as the new ramChris Adkin
This document discusses CPU cache and memory architectures. It begins with a diagram showing the cache hierarchy from L1 to L3 cache within a CPU. It then discusses how larger CPUs have multiple cores, each with their own L1 and L2 caches sharing a larger L3 cache. The document highlights how main memory bandwidth has not kept up with increasing CPU speeds and caches.
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Kyle Davis
Let's explore how Redis (and Redis Enterprise) can be used to store data in not only deterministic structures but also probabilistic structures like Bloom filters, HyperLogLog, Count Min Sketch and Cuckoo filters. We examine both usage and briefly summarize the algorithms that back these structures. Also we review the use-cases and applications for probabilistic structures.
SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson
Hekaton is large piece of kit, this session will focus on the internals of how in-memory tables and native stored procedures work and interact – Database structure: use of File Stream, backup/restore considerations in HA and DR as well as Database Durability, in-memory table make up: hash and range indexes, row chains, Multi-Version Concurrency Control (MVCC). Design considerations and gottcha’s to watch out for.
The session will be demo led.
Note: the session will assume the basics of Hekaton are known, so it is recommended you attend the Basics session.
XMLDB Building Blocks And Best Practices - Oracle Open World 2008 - Marco Gra...Marco Gralike
The document provides an overview of Oracle XMLDB building blocks and best practices. It discusses issues with storing XML data in relational databases, including the impedance mismatch between the XML and relational data models. It also highlights worst practices like not optimizing data access and only using a single table to store all XML data. The document recommends using XML schemas to define logical and physical storage structures and leveraging Oracle XMLDB features like binary XML storage, XML indexes, and partitioning.
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...Amazon Web Services
Amazon Redshift is the new data warehouse service from Amazon Web Services. Redshift offers you fast query performance when analyzing data sets from a few hundred gigabytes to over a petabyte at a fraction of the cost of traditional solutions. In this webinar, we will take a detailed look at Redshift, including a live demonstration. This webinar is ideal for anyone looking to gain deeper insight into their data, without the usual challenges of time, cost and effort.
This document provides an overview of SQL including its introduction, database architectures, data types, keys, built-in functions, commands, joins, views, indexes, triggers, schemas and more. SQL is a tool for organizing, managing and retrieving data from a database. It allows users to define data structure, retrieve, manipulate and control access to data. The document discusses single database, multiple database and multi-location architectures. It also covers data types like character, numeric, date and LOB for Oracle.
Dive deep into some of the key innovations behind Amazon Aurora, discuss best practices and configurations, and share early customer experience from the field.
The document discusses various disaster recovery strategies for SQL Server including failover clustering, database mirroring, and peer-to-peer transactional replication. It provides advantages and disadvantages of each approach. It also outlines the steps to configure replication for Always On Availability Groups which involves setting up publications and subscriptions, configuring the availability group, and redirecting the original publisher to the listener name.
This document provides an overview and best practices for using Amazon Redshift as a data warehouse. It discusses ingestion best practices like using multiple files for COPY and primary keys. It also covers data hygiene practices like analyzing tables and vacuuming regularly. Recent features like automatic compression, table restore, UDFs and interleaved sort keys are described. The document provides guidance on migrating workloads and tuning queries, including using WLM queues and the performance monitor in the console.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
4. Niko Neugebauer
Microsoft Data Platform Professional
OH22 (http://www.oh22.net)
15+ years in IT
SQL Server MVP
Founder of 3 Portuguese PASS Chapters
Blog: http://www.nikoport.com
Twitter: @NikoNeugebauer
LinkedIn: http://pt.linkedin.com/in/webcaravela
Email: info@webcaravela.com
5. So this is a supposedly Deep Dive
My assumptions:
You have heard about Columnstore Indexes
You understand the difference between RowStore vs
Columnstore
You know about Dictionary existance in Columnstore Indexes
You know how locking & blocking works (at least understand the
S, SI, IX, X locks)
You have used DBCC Page functionality
You are crazy enough to believe that this topic could be
expanded into this kind of level
7. About Columnstore Indexes:
Reading Fact tables
Reading Big Dimension tables
Very low-activity big OLTP tables, which are
scanned & processed almost entirely
Data Warehouses
Decision Support Applications
Business Intelligence Applications
8. Clustered Columnstore in
SQL Server 2014
Delta-Stores (open & close)
Deleted Bitmap
Delete & Update work as a
DELETE + INSERT
10. Batch Mode
New Model of Data Processing
Query execution by using GetNext (), which delivers data to the CPU
(In its turn it will go down the stack and get physical accecss to data.
Every operator behaves this way, which makes GetNext() a virtuall
function.
For execution plans sometimes you will have 100s of this function
invocation before you will get actual 1 row.
For OLTP it might be a good idea, since we are working just with few
rows, but If you are working with millions of rows (in BI or DW) you will
make billions of such invocations.
Entering Batch Mode, which actually invokes data for processing not 1
by 1 but in Batches of ~900 rows
This might bring benefits in 10s & 100s times
11. Batch Mode
In Batch Mode every operator down the stack have to play
the same game, passing the same amount of data -> Row
Mode can’t interact with Batch Mode.
64 row vs 900 rows
(Progammers, its like passing an Array vs 1 by 1 param )
Works exclusively for Columnstore Indexes
Works exclusively for parallel plans, hence MAXDOP >= 2
Think about it as if it would be a Factory processing vs
Manual Processing (19th vs 18th Century)
12. Batch Mode is fragile
Not every operator is implemented in Batch Mode.
Examples of Row Mode operators: Sort, Exchange, Inner LOOP,
Merge, ...
Any disturbance in the force will make Batch Execution Mode to fall
down into Row Execution Mode, for example lack of memory.
SQL Server 2014 introduces so-called “Mixed Mode”, where execution
plan operators in Row Mode can co-exist with Batch Mode operators
13. Batch Mode Deep Dive
Optimized for 64 bit values of the register
Late materialization (working on compressed values)
Batch Size is optimized to work in L2 Cache with idea of
avoiding Cache Misses
14. Latency Cache
L1 cache reference – 0.5 ns
L2 cache reference – 7.0 ns (14 times slower)
L3 cache reference – 28.0 ns (4 times slower)
L3 cache reference (outside NUMA) – 42.0 ns (6 times slower)
Main memory reference – 100ns (3 times slower)
Read 1 MB sequentially from memory – 250.000 ns (5.000
times L1 Cache)
16. SQL Server 2014 Batch Mode
• All execution improvements are done for
Nonclustered & Clustered Columnstores
• Mixed Mode – Row & Batch mode can co-exist
• OUTER JOIN, UNION ALL, EXIST, IN, Scalar
Aggregates, Distinct Aggregates – all work in Batch
Mode
• Some TempDB operations for Columnstore Indexes
are running in Batch mode. (TempDB Spill)
26. Run-length compression, more complex
scenario
Name Last Name
Mark Simpson
Mark Donalds
John Simpson
Andre White
Andre Donalds
Andre Simpson
Ricardo Simpson
Mark Simpson
Charlie Simpson
Mark White
Charlie Donalds
Name Last Name
Mark:4 Simpson:1
John:1 Donalds:1
Andre:3 Simpson:1
Ricardo:1 White:1
Charlie:2 Simpson:1
White:1
Donalds:1
Simpson:3
Donalds:1
Name Last Name
Mark Simpson
Mark Donalds
Mark Simpson
Mark White
John Simpson
Andre White
Andre Donalds
Andre Simpson
Ricardo Simpson
Charlie Simpson
Charlie Donalds
27. Run-length compression, more complex
scenario, part 2
Name Last Name
Mark Simpson
Mark Donalds
John Simpson
Andre White
Andre Donalds
Andre Simpson
Ricardo Simpson
Mark Simpson
Charlie Simpson
Mark White
Charlie Donalds
Name Last Name
Andre:1 Donalds:3
Charlie:1 Simpson:6
Mark:3 White:3
Andre:1
Ricarod:1
John:1
Charlie:1
Andre:1
Mark:1
Name Last Name
Andre Donalds
Charlie Donalds
Mark Donalds
Mark Simpson
Mark Simpson
Andre Simpson
Ricardo Simpson
John Simpson
Charlie Simpson
Andre White
Mark White
29. Huffman enconding (aka ASCII encoding)
Name Count Code
Mark 4 001
Andre 3 010
Charlie 2 011
John 1 100
Ricardo 1 101
Fairly efficient ~ N log (N)
Design a Huffman code in linear time if input probabilities
(aka weights) are sorted.
Name Last Name
Mark Simpson
Mark Donalds
Mark Simpson
Mark White
John Simpson
Andre White
Andre Donalds
Andre Simpson
Ricardo Simpson
Charlie Simpson
Charlie Donalds
31. Binary Compression
Super-secret Vertipac aka xVelocity compression
turning data into LOBs.
LOBs are stored by using traditional storage mechanisms
(8K pages & extents)
33. Columnstore Archival Compression
One more compression level
Applied over the xVelocity compression
It is a slight modification of LZ77 (aka Zip)
New!
34. Compression Recap:
Determination of the best algorithm is the principal key for the
success for the X-Velocity. This process includes data shuffling
between segments and different methods of compression.
Every segment has different data, and so different algorithms with
different success are being applied.
If you are seeing a lot of queries including a predicate on a certain
column, then try creating a traditional clustered index on it
(sorting) and then create a columnstore.
Every compression is supported on the partition level
38. Dictionaries
Global dictionaries, contain entries for each and every of the
existing segments of the same column storage
Local dictionaries, contain entries for 1 or more segments of the
same column storage
Sizes varies from 56 bytes (min) to 16 MB (max)
There is a specialized view which provides information on the
dictionaries, such as entries count, size, etc -
sys.column_store_dictionaries
Undocumented feature which potentially allow us to consult the
content of the dictionaries (will see it later)
No all columns will use dictionaries
50. BULK Load
A process completely apart
102.400 is a magic number which gives you a Segment
instead of a Delta-Store
For data load, if you order your loaded data into chunks
of 1.045.678 rows for loading – your Columnstore will
be almost perfect
52. Memory Management
Columnstore Indexes consume A LOT of memory
Columnstore Object Pool – new special pool in SQL 2012+
New Memory Brocker which divides memory between Row Store & Column
Store
53. Memory Management
Memory Grant for Index Creation in MB = ( 4.2 * Cols_Count + 68 ) * DOP +
String_Cols * 34 (2012 Formula)
When not enough memory granted, you might need to change Resource
Governor limits for the respective group (here setting max percent grant to
50%):
ALTER WORKLOAD GROUP [DEFAULT] WITH
(REQUEST_MAX_MEMORY_GRANT_PERCENT=50);
GO
ALTER RESOURCE GOVERNOR
RECONFIGURE
GO
Memory Management is automatic, so when you have not enough memory
– then the DOP will be lowered automatically until 2, so the memory
consumption will be lowered.
63. Links:
My blog series on Columnstore Indexes (39+ Blogposts):
http://www.nikoport.com/columnstore/
Remus Rusanu Introduction for Clustered Columnstore:
http://rusanu.com/2013/06/11/sql-server-clustered-columnstore-
indexes-at-teched-2013/
White Paper on the Clustered Columnstore:
http://research.microsoft.com/pubs/193599/Apollo3%20-
%20Sigmod%202013%20-%20final.pdf