Hpverticacertificationguide 150322232921-conversion-gate01

HP Vertica Certification Guide
Softtek 2015

Identify key features of Vertica
1. Performance Features
1. Column-orientation
2. Aggressive Compression
3. Read-Optimized Storage
4. Ability to exploit multiple sort orders
5. Parallel shared-nothing design on on-the-shelf hardware
6. Bottom Line
2. Administrative and Management Features
1. Vertica Database Designer
2. Recovery and High Availability through K-Safety
3. Continuous Load: Snapshot Isolation and the WOS
4. Monitoring and Administration Tools and APIs
Cristóbal Gómez | Identify key features of Vertica | 1

The Vertica Analytic Database Architecture

ROS Distribution And Tuple Mover

Victor Espinosa | Topic | # Page
Temas:
- Describe High Availability capabilities
and describe Vertica’s transaction
model.
- Identify characteristics and determine
features of projections used in Vertica.

High Availability. Ability of the database to continue running even if a node goes
down.
Proj A Proj B Proj C
Proj C Proj A Proj B
Buddy Projections: copies of existing projections stored in adjacent nodes.
K-Safety: 0,1,2

High Availability and Recovery
- HP Vertica is said to be K-safe.
High Availability with Projections.
- Vertica Replicate small, unsegmented projections.
- creates buddy projections for large, segmented projections.
- for small tables, it replicates them, creating and storing duplicates of these
projections on all nodes.
- HP Vertica creates buddy projections, which are copies of segmented
projections that are distributed across database nodes.

Features
- Columnar Orientation.
Vertica stores data in columns, reads only the columns referenced by the query.
- Advanced Encoding / Compression.
compress and encode as part of the database design.
reduce disk storage.
data does not need to be unencoded to return a result.
- High Availability.
- Automatic Database Design
transform data into column-based projections.
query performance can be enhanced by comparing the data loaded and the most
commonly used SQL queries.
- Application Integration.
Vertica uses standard SQL.
- Massively Parallel Processing.
ETL
Replication
Data Quality
Vertica Analytics
Reporting

Projections
Characteristics and Features.
Projection is a representation of the columns in the source tables.
Vertica stores data in columnar format called Projections.
Vertica stores all data in Projections.
Projections are updated automatically as data is loaded into the database.
Data is sorted and compressed.
Vertica distribute the data across all nodes.
3 Types of Projections:
Superprojections. Contain all data, they are created when data is first loaded into
the database.
Query-Specific Projections. Contain only the columns needed for a specific query.
Buddy Projections. Copies of projections stored on an adjacent node.

Projections with large amount of data:
For small amount of data segmentation is not efficient, Vertica copy the full
projection to each node.

Create projections using DDL (Data Definition Language)

Vertica’s Transaction Model.
Vertica follows the SQL-92 transaction model.
- DML commands: INSERT, UPDATE, DELETE.
- you don’t have to explicitly start a transaction.
- we must use COMMIT, ROLLBACK or COPY to end a transaction.
In Vertica:
- DELETE doesn’t delete data from disk storage, it marks rows as deleted so
they can be be found by historical queries.
- UPDATE write two rows: one with new data and one marked for deletion.
Like COPY, by default, INSERT, UPDATE and DELETE commands write the data to
the WOS and on overflow write to the ROS. For large INSERTS or UPDATES, you
can use the DIRECT keyword to force HP Vertica to write rows directly to the ROS.
Loading large number of rows as single row inserts are not recommended for
performance reasons. Use COPY instead.

Cristóbal Gómez | Topic | # Page
Temas
A1 - Identify key features of Vertica
C1 - Identify benefits of loading data into WOS and directly into ROS
D4 - Distinguish between deleting partitions and deleting records
F1 - Identify situations when a backup is recommended
H1 - Understanding analytics syntax

Identify benefits of loading data into WOS and directly into
ROS

Arely Sandoval
Encoding
Is the process of converting data into a standard format. Vertica uses a number of different encoding
strategies, depending on column data type, table cardinality, and sort order.
Compression
Is process of transforming data into a compact format.
Encoding Types
ENCODING AUTO (default)
Lempel-Ziv-Oberhumer-based (LZO) compression is used for CHAR/VARCHAR, BOOLEAN,
BINARY/VARBINARY, and FLOAT columns.
ENCODING DELTAVAL
Stores only the differences between sequential data values instead of the values themselves. This
encoding type is best used for integer-based columns, but also applies to
DATE/TIME/TIMESTAMP/INTERVAL columns. It has no effect on other data types.
ENCODING RLE
Arely Sandoval | A3- Differentiate between compression and encoding| # Page

ENCODING BLOCK_DICT
For each block of storage, Vertica compiles distinct column values into a dictionary and then stores the dictionary and a
list of indexes to represent the data block. Is ideal for few-valued, unsorted columns in which saving space is more
important than encoding speed. BINARY/VARBINARY columns do not support BLOCK_DICT encoding.
ENCODING BLOCKDICT_COMP
This encoding type is similar to BLOCK_DICT except that dictionary indexes are entropy coded. This encoding type
requires significantly more CPU time to encode and decode and has a poorer worst-case performance. However, use
of this type can lead to space savings if the distribution of values is extremely skewed.
ENCODING DELTARANGE_COMP
Is ideal for many-valued FLOAT columns that are either sorted or confined to a range. Do not use it with unsorted
columns that contain NULL values, as the storage cost for representing a NULL value is high.It has a high cost for both
compression and decompression.
ENCODING COMMONDELTA_COMP
Is ideal for sorted FLOAT and INTEGER-based (DATE/TIME/TIMESTAMP/INTERVAL) data columns with predictable
sequences and only the occasional sequence breaks, such as timestamps recorded at periodic intervals or primary
keys.
ENCODING NONE
Do not specify this value. Increases space usage, increases processing time, and leads to problems

SELECT
PROJECTION_NAME,
PROJECTION_COLUMN_NAME,
ENCODING_TYPE,DATA_TYPE
FROM
PROJECTION_COLUMNS
WHERE
PROJECTION_COLUMN_NAME='Column_Name';
Differentiate between compression and encoding
● Encoded data can be processed directly by Vertica.
● Compressed data cannot be directly processed by Vertica. Data must first
be decompressed.
● Encoding depends on the data type of the data being encoded, and
compression treats a compressed block as opaque / doesn't really care
what's in it.

● D6 - Identify the advantages of a group by pipe versus a
group by hash
● F3 - Define the Resource Manager's role in query
processing
● H3 - Using explain plans and query profiles

Juan Carlos Vázquez Tapia | Topic | # Page
Juan Carlos Vazquez Tapia
Temas
● Viernes 20 de Marzo
○ Sección: Projection Design
■ B5 - Understanding buddy projections.
● Martes 24 de Marzo
○ Sección: Removing Data Permanently from Vertica and Advanced Projection Design.
■ D2 - Identify the advantages and disadvantages of using delete vectors to identify records marked for
deletion.
● Miercoles 25 de Marzo
○ Sección: Cluster Management in Vertica.
■ E4 - Define local segmentation capability in Vertica.
● Jueves 26 de Marzo
○ Sección: Monitoring and Troubleshooting Vertica.
■ G4 - Defining, using and logging into Management Console.

Juan Carlos Vázquez Tapia | Understanding Buddy Projections | # Page
Projection Design
B5 - Understanding Buddy Projections
Definition:
HP Vertica creates buddy projections, which are replicas of projections of the data in the database
that exist in the cluster and these replicas are distributed across database nodes.
HP Vertica ensures that projections that contain the same data are distributed to different nodes.
This ensures that if a node goes down, all the data is available on the remaining nodes.
The number of buddy projections is determined by the value of K as in K-safety

Juan Carlos Vázquez Tapia | Understanding Buddy Projections | # Page
B5 - Understanding Buddy Projections
Requirements:
There are some requirements that two projections need to accomplish to be considered “buddies”,
those requirements are:
● They have to contain the same columns
● They have to have the same hash segmentation
● Use different node ordering
Buddy projections can have different sort orders for query performance purposes.

Juan Neve
B4.- Describe the process of projection segmentation.
D1.- Describe the process used to mark records for
deletion.
E3.- Identify the steps of online recovery of a failed node.
G3.- Describe how to disallow user connections, while
preserving dbadmin connectivity.

B4.- Describe the purpose of
projection segmentation
● Provides high availability
● Recovery of data
● Optimizes query execution
Juan Antonio Neve Gómez | Page 1

Segmented Duplicated
Segmentation

The Random distribution of data is very
important for segmentation to be
effective. it keeps the load on the
nodes to the minimum so it runs more
efficiently.
Replicate projections provide high
availability because all of the data is
available on each node. And of course it
helps to recovery because there are more
copies on the other nodes.

Carlos Leal
1. Determining segmentation and partitioning (B6)
1. Identify the process for processing a large delete or update (D3)
1. Distinguish between the items in Vertica Cluster (E5)
1. Administering a cluster using management console (F5)
Carlos Ivan Leal

Determining Segmentation and Partitioning
Partitioning and segmentation have completely separate functions in Vertica. It is important to clarify the
differences because the concepts are similar, and there terms are often used interchangeably for other
databases.
Carlos Leal | Segmentation and Partitioning | B6

Segmentation and Partitioning
Segmentation defines how data is spread among cluster nodes, while partitioning specifies how data is
organized within the individual nodes. Segmentation is defined by the projection, and partitioning is defined
by the table. Logically, the partition clause is applied after the segmented by clause.

Segmentation and partitioning have opposite goals regarding data localization. Partitioning attempts to
introduce hot spots within the node, allowing for a convenient way to drop data and reclaim the disk space.
Segmentation (by hash) distributes the data evenly across all nodes in a Vertica cluster.

Partitioning by year, for example, makes sense if you intend to retain and drop data at the granularity of a
year. On the other hand, segmenting the data by year would be an extremely bad choice, as the node holding
data for the current year would likely answer far more queries than the other nodes.

Carlos Leal | Identify the process for processing a large
delete or update
Identify the process for processing a large
delete or update D3
● Performance Considerations for Deletes and Updates
A large number of (un-purged) deleted rows could negatively affect query and recovery performance.
To eliminate the rows that have been deleted from the result, a query must do extra processing. It has been
observed if 10% or more of the total rows in a table have been deleted, the performance of a query on the table
slows down. However your experience may vary depending upon the size of the table, the table definition, and
the query. The same problem can also happen during the recovery. To avoid this, the delete rows need to be
purged in Vertica. For more information, see Purge Procedure.

Carlos Leal | Concurrency
Concurrency
Deletes and updates take exclusive locks on the table. Hence, only one delete or update
transaction on that table can be in progress at a time and only when no loads (or INSERTs) are
in progress. Deletes and updates on different tables can be run concurrently.

Carlos Leal | Optimizing
Optimizing Deletes and Updates for Performance
The process of optimizing a design for deletes and updates is the same. Some simple steps to
optimize a projection design or a delete or update statement can increase the query
performance by tens to hundreds of times. The following section details several proposed
optimizations to significantly increase delete and update performance.

Temas (Manuel Loza)
● B2 - Define RLE
● C6 - Understanding both WOS and ROS
● E1 - Identify the steps used to add nodes to an existing
clusters
● G1 - Define the use of Management Console in
monitoring Vertica

Define RLE
Run-Length Encoding
o is an encoding method.
o increases performance because there is less disk I/O during query
execution.
o Stores more data in less space.
How it works?
● replaces sequences of the same data values within a column by a
single value and a count number.
Typically used when data is:
1. Sorted
2. Low cardinality
3. Any data type

Understanding both WOS and ROS
Write Optimized Store (WOS)
● Memory-Resident
● Used to store INSERT, UPDATE, DELETE and COPY actions
● Arranged by projection
● Records are stored in the order they are inserted
o Stores data without compression or indexing
 Support very fast load speed
● A projection is sorted only when queried
o Remains sorted until new data is inserted into it
● Holds both committed and uncommitted transactions

Read Optimized Store (ROS)
● Disk storage structure
o Highly optimized
o Read oriented
● Like WOS, ROS is arranged by projection
o Projections in ROS are stored in ROS contain
● Makes optimal use of sorting (indexing) and compression
● COPY...DIRECT and INSERT (with /*direct*/ hint)
o Load data directly into ROS

Luis Cárdenas
C2 Define the actions of the move out and merge out tasks
D5 Identify the advantages of merge join versus hash join.
F2 Features of the vertica file used for back up and restore
H2 Using event based windows, time series, event server
join and pattern matching.

Ruben Gonzalez
A. Vertica Architecture (Viernes 20)
4. Installation of Vertica.
C. Loading Data into Vertica. (Lunes 23)
4. Copying data directly to ROS
D Removing Data Permanently from Vertica and Advanced Projection Design. (Martes 24)
7. Describe the characteristics of a prejoin projection.
F Backup/Restore and Resource Management in Vertica. (Jueves 26)
4. Describe the differences between maxconcurrency and planned concurrency.

Laura López
B3 - Describe Order By importance in projection design
C7 - Distinguishing between moveout and mergeout
actions
E2 - Describe the benefits of having identically sorted
buddy projections
G2 - Determine methods to troubleshoot spread

● Specifies the columns to sort the projection on.
● You cannot specify an ascending or descending clause.
● HP Vertica always uses an ascending sort order in
physical storage.
● If you do not specify the ORDER BY table-column
parameter, HP Vertica uses the order in which columns
are specified as the sort order for the projection.
● One of the ways the projections can be optimized.

Identifying characteristics of data file
directory
Disk Space Requirements for HP Vertica
In addition to actual data stored in the database, HP Vertica requires disk space for several
data
reorganization operations, such as mergeout and managing nodes in the cluster. For best
results,
HP recommends that disk utilization per node be no more than sixty percent (60%) for a K-
Safe=1
database to allow such operations to proceed.

directory
In addition, disk space is temporarily required by certain query execution operators, such as hash
joins and sorts, in the case when they cannot be completed in memory (RAM). Such operators
might be encountered during queries, recovery, refreshing projections, and so on. The amount of
disk space needed (known as temp space) depends on the nature of the queries, amount of data on
the node and number of concurrent users on the system. By default, any unused disk space on the
data disk can be used as temp space. However, HP recommends provisioning temp space
separate from data disk space. See Configuring Disk Usage to Optimize Performance
Prepare the Logical Schema Script
Designing a logical schema for an HP Vertica database is no different from designing one for any
other SQL database. Details are described more fully in Designing a Logical Schema.
To create your logical schema, prepare a SQL script (plain text file, typically with an extension of
.sql) that:

directory
Prepare Data Files
Prepare two sets of data files:
l Test data files. Use test files to test the database after the partial data load. If possible, use part
of the actual data files to prepare the test data files.
l Actual data files. Once the database has been tested and optimized, use your data files for your
initial Bulk Loading Data.
How to Name Data Files
Name each data file to match the corresponding table in the logical schema. Case does not matter.
Use the extension .tbl or whatever you prefer. For example, if a table is named Stock_Dimension,
name the corresponding data file stock_dimension.tbl. When using multiple data files, append _
nnn (where nnn is a positive integer in the range 001 to 999) to the file name. For example, stock_
dimension.tbl_001, stock_dimension.tbl_002, and so on.

Hpverticacertificationguide 150322232921-conversion-gate01

More Related Content

Similar to Hpverticacertificationguide 150322232921-conversion-gate01

Similar to Hpverticacertificationguide 150322232921-conversion-gate01 (20)

Recently uploaded

Recently uploaded (20)

Hpverticacertificationguide 150322232921-conversion-gate01