The document discusses building a data warehouse in SQL Server. It provides an agenda that covers topics like an overview of data warehousing, data warehouse design, dimension and fact tables, and physical design. It also discusses components of a data warehousing solution like the data warehouse database, ETL processes, and security considerations.
This document discusses key aspects of business intelligence architecture. It covers topics like data modeling, data integration, data warehousing, sizing methodologies, data flows, and new BI architecture trends. Specifically, it provides information on:
- Data modeling approaches including OLTP and OLAP models with star schemas and dimension tables.
- ETL processes like extraction, transformation, and loading of data.
- Types of data warehousing solutions including appliances and SQL databases.
- Methodologies for sizing different components like databases, servers, users.
- Diagrams of data flows from source systems into staging, data warehouse and marts.
- New BI architecture designs that integrate compute and storage.
Discover the origins of big data, discuss existing and new projects, share common use cases for those projects, and explain how you can modernize your architecture using data analytics, data operations, data engineering and data science.
Big Data Fundamentals is your prerequisite to building a modern platform for machine learning and analytics optimized for the cloud.
We’ll close out with a live Q&A with some of our technical experts as well.
Stretch your brain with a packed agenda:
Open source software
Data storage
Data ingestion
Data analytics
Data engineering
IoT and life after Lambda architectures
Data science
Cybersecurity
Cluster management
Big data in the cloud
Success stories
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Edureka!
This Data Warehouse Tutorial For Beginners will give you an introduction to data warehousing and business intelligence. You will be able to understand basic data warehouse concepts with examples. The following topics have been covered in this tutorial:
1. What Is The Need For BI?
2. What Is Data Warehousing?
3. Key Terminologies Related To Data Warehouse Architecture:
a. OLTP Vs OLAP
b. ETL
c. Data Mart
d. Metadata
4. Data Warehouse Architecture
5. Demo: Creating A Data Warehouse
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
Organizations with governed metadata made available through their data catalog can answer questions their people have about the organization’s data. These organizations get more value from their data, protect their data better, gain improved ROI from data-centric projects and programs, and have more confidence in their most strategic data.
Join Bob Seiner for this lively webinar where he will talk about the value of a data catalog and how to build the use of the catalog into your stewards’ daily routines. Bob will share how the tool must be positioned for success and viewed as a must-have resource that is a steppingstone and catalyst to governed data across the organization.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
Data Architecture Best Practices for Advanced AnalyticsDATAVERSITY
Many organizations are immature when it comes to data and analytics use. The answer lies in delivering a greater level of insight from data, straight to the point of need.
There are so many Data Architecture best practices today, accumulated from years of practice. In this webinar, William will look at some Data Architecture best practices that he believes have emerged in the past two years and are not worked into many enterprise data programs yet. These are keepers and will be required to move towards, by one means or another, so it’s best to mindfully work them into the environment.
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020.
Abstract:
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh.
The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership.
This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Building a Data Strategy Your C-Suite Will SupportReid Colson
Being a data leader in any industry is an advantage that creates measurable financial benefits. Many studies have shown this – I’ve seen them from Bain, McKinsey, MIT and more. Since most firms are measured on profit, getting good at making data driven decisions is a key to being competitive. You can't get there without a plan. That is where a data strategy comes in.
In speaking with ~300 firms who indicated that their organizations were effective in using data and analytics, McKinsey found that construction of a data strategy was the number one contributing factor to their success. Being good at using data to drive decisions creates a meaningful profit advantage and those who are leaders indicated that the number one driver of their success was their data strategy.
This presentation will cover what a data strategy is, how to construct one, and how to get buy in from your executive team. The author is a former Fortune 500 Chief Data Officer and has held senior data roles at Capital One and Markel.
Here are a few helpful links for your data journey:
Free Data Investment ROI Template:
https://www.udig.com/digging-in/roi-calculator-for-it-projects/
Real world data use cases:
https://www.udig.com/our-work/?category=data
Contact Me:
https://www.udig.com/contact/
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Anastasija Nikiforova
This presentation was delivered as part of the Data Science Seminar titled “When, Why and How? The Importance of Business Intelligence“ organized by the Institute of Computer Science (University of Tartu) in cooperation with Swedbank.
In this presentation I talked about:
*“Data warehouse vs. data lake – what are they and what is the difference between them?” (structured vs unstructured, static vs dynamic (real-time data), schema-on-write vs schema on-read, ETL vs ELT) with further elaboration on What are their goals and purposes? What is their target audience? What are their pros and cons?
*“Is the Data warehouse the only data repository suitable for BI?” – no, (today) data lakes can also be suitable. And even more, both are considered the key to “a single version of the truth”. Although, if descriptive BI is the only purpose, it might still be better to stay within data warehouse. But, if you want to either have predictive BI or use your data for ML (or do not have a specific idea on how you want to use the data, but want to be able to explore your data effectively and efficiently), you know that a data warehouse might not be the best option.
*“So, the data lake will save my resources a lot, because I do not have to worry about how to store /allocate the data – just put it in one storage and voila?!” – no, in this case your data lake will turn into a data swamp! And you are forgetting about the data quality you should (must!) be thinking of!
*“But how do you prevent the data lake from becoming a data swamp?” – in short and simple terms – proper data governance & metadata management is the answer (but not as easy as it sounds – do not forget about your data engineer and be friendly with him [always… literally always :D) and also think about the culture in your organization.
*“So, the use of a data warehouse is the key to high quality data?” – no, it is not! Having ETL do not guarantee the quality of your data (transform&load is not data quality management). Think about data quality regardless of the repository!
*“Are data warehouses and data lakes the only options to consider or are we missing something?“– true! Data lakehouse!
*“If a data lakehouse is a combination of benefits of a data warehouse and data lake, is it a silver bullet?“– no, it is not! This is another option (relatively immature) to consider that may be the best bit for you, but not a panacea. Dealing with data is not easy (still)…
In addition, in this talk I also briefly introduced the ongoing research into the integration of the data lake as a data repository and data wrangling seeking for an increased data quality in IS. In short, this is somewhat like an improved data lakehouse, where we emphasize the need of data governance and data wrangling to be integrated to really get the benefits that the data lakehouses promise (although we still call it a data lake, since a data lakehouse is nut sufficiently mature concept with different definitions of it).
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
This document discusses moving from a centralized data architecture to a distributed data mesh architecture. It describes how a data mesh shifts data management responsibilities to individual business domains, with each domain acting as both a provider and consumer of data products. Key aspects of the data mesh approach discussed include domain-driven design, domain zones to organize domains, treating data as products, and using this approach to enable analytics at enterprise scale on platforms like Azure.
Business Intelligence & Data Analytics– An Architected ApproachDATAVERSITY
Business intelligence (BI) and data analytics are increasing in popularity as more organizations are looking to become more data-driven. Many tools have powerful visualization techniques that can create dynamic displays of critical information. To ensure that the data displayed on these visualizations is accurate and timely, a strong Data Architecture is needed. Join this webinar to understand how to create a robust Data Architecture for BI and data analytics that takes both business and technology needs into consideration.
The data services marketplace is enabled by a data abstraction layer that supports rapid development of operational applications and single data view portals. In this presentation yo will learn services-based reference architecture, modality, and latency of data access.
- Reference architecture for enterprise data services marketplace
- Modality and latency of data access
- Customer use cases and demo
This presentation is part of the Denodo Educational Seminar , and you can watch the video here goo.gl/vycYmZ.
The document discusses modern data architectures. It presents conceptual models for data ingestion, storage, processing, and insights/actions. It compares traditional vs modern architectures. The modern architecture uses a data lake for storage and allows for on-demand analysis. It provides an example of how this could be implemented on Microsoft Azure using services like Azure Data Lake Storage, Azure Data Bricks, and Azure Data Warehouse. It also outlines common data management functions such as data governance, architecture, development, operations, and security.
The Data Driven University - Automating Data Governance and Stewardship in Au...Pieter De Leenheer
The document discusses implementing data governance and stewardship programs at universities. It provides examples of programs at Stanford University, George Washington University, and in the Flanders region of Belgium. The key aspects covered are:
- Establishing a data governance framework with roles, processes, asset definitions. and oversight council.
- Implementing data stewardship activities like data quality management, metadata development, and reference data management.
- Stanford's program established foundations for institutional research through data quality and context definitions.
- George Washington runs a centralized program managed by the IT governance office.
- The Flanders program provides research information and services across universities through consistent definitions, roles and collaborative workflows.
This document provides an overview of a data catalog called Amundsen that was created to improve the productivity of data users. Amundsen indexes data resources and powers search based on usage patterns to help users discover, understand, and analyze data. It aims to reduce the time data scientists spend on data discovery activities from one-third to increase their productivity. The tool provides search of metadata from various data sources and displays table details, column metadata stats, and people profiles to help users find and understand corporate data.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
Developing a Data Strategy for your organization can seem like a daunting task – but it’s worth the effort. Getting your Data Strategy right can provide significant value, as data drives many of the key initiatives in today’s marketplace – from digital transformation, to marketing, to customer centricity, to population health, and more. This webinar will help demystify Data Strategy and its relationship to Data Architecture and will provide concrete, practical ways to get started.
Microsoft SQL Server Data Warehouses for SQL Server DBAsMark Kromer
The document discusses Microsoft SQL Server data warehousing solutions. It provides an agenda for a presentation that includes an overview of Microsoft's data warehousing offerings, how to establish baseline metrics for Fast Track reference configurations, and how to design balanced server and storage configurations for data warehousing workloads. It also discusses software and hardware best practices, such as data striping and storage configuration recommendations. Overall, the document outlines topics and solutions to help customers accelerate their data warehouse deployments using Microsoft SQL Server.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
The document provides information about what a data warehouse is and why it is important. A data warehouse is a relational database designed for querying and analysis that contains historical data from transaction systems and other sources. It allows organizations to access, analyze, and report on integrated information to support business processes and decisions.
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...Edgar Alejandro Villegas
The document discusses best practices for data warehousing performance on Oracle Database. It covers Oracle Exadata Database Machine capabilities like intelligent storage, hybrid columnar compression, and smart flash cache. It also discusses partitioning, parallelism, monitoring tools, and data loading techniques to maximize warehouse performance.
This document provides information about a webinar on SQL Server 2016 Stretch Database presented by Antonios Chatzipavlis. The webinar covers an introduction to Stretch Database, its limitations and pricing, backup and restore of Stretch databases, and frequently asked questions. Antonios Chatzipavlis has over 30 years of experience working with computers and SQL Server. He is a Microsoft Certified Trainer and SQL Server Evangelist who runs the SQL School Greece training organization.
Department Row Level Security Customization For People Soft General Ledger.Pptwonga6
The document summarizes the implementation of department row-level security customization within the General Ledger module at the University of Calgary to comply with privacy laws. The customization restricts access to ledger and journal line records by department for each user. It was implemented using custom tables to associate users to departments, modified people code, and query security. The customization was later expanded to include new security roles and the ability to grant access without a department. The customization was needed due to privacy laws, standardized chart of accounts, and cultural factors around budget oversight and single person departments.
Antonios Chatzipavlis is a database architect and SQL Server expert with over 30 years of experience working with SQL Server. The document provides tips for installing and configuring SQL Server correctly, including selecting the appropriate server hardware, installing Windows, configuring disks and storage, installing and configuring SQL Server, and creating user databases. The goal is to optimize performance and reliability based on best practices.
This document provides an overview of auditing data access in SQL Server. It discusses various methods for auditing such as using common criteria, SQL Trace, DML triggers, temporal tables, and implementing SQL Server Audit. SQL Server Audit is described as the primary auditing tool in SQL Server that can track both server and database level events. Considerations for implementing and managing SQL Server Audit are also covered.
Row Level Security (RLS) enables implementation of row-level access restrictions in SQL Server. RLS uses predicate functions to define the security logic and filters rows for queries based on that logic. Security predicates bind the predicate functions to tables and are defined as filter predicates to silently filter rows or blocking predicates to prevent write operations. Best practices include keeping the security logic simple and on separate schemas for maintenance. RLS has some limitations including incompatibility with Filestream and Polybase.
Implementing Mobile Reports in SQL Sserver 2016 Reporting ServicesAntonios Chatzipavlis
The document provides an overview of implementing mobile reports in SQL Server 2016 Reporting Services. It discusses preparing data for mobile reports, using the SQL Server Mobile Report Publisher tool, and publishing mobile reports. The presenter has extensive experience with SQL Server and provides their qualifications. The presentation also provides information on optimizing reports, formatting time data, using filters and Excel files in reports, and designing reports using navigators and visualizations in the Mobile Report Publisher tool. It demonstrates the tool's interface and capabilities.
The document discusses SQL Server monitoring and troubleshooting. It provides an overview of SQL Server monitoring, including why it is important and common monitoring tools. It also describes the SQL Server threading model, including threads, schedulers, states, the waiter list, and runnable queue. Methods for using wait statistics like the DMVs sys.dm_os_waiting_tasks and sys.dm_os_wait_stats are presented. Extended Events are introduced as an alternative to SQL Trace. The importance of establishing a performance baseline is also noted.
Leezet Llorance completed a course on developing SQL data models at New Horizons Computer Learning Center on August 24, 2016. The course, numbered 20768, was presented by New Horizons and instructed by Martin Wuesthoff. A certificate of completion was awarded for successfully finishing the SQL data modeling course.
TIQ Solutions - QlikView Data Integration in a Java WorldVizlib Ltd.
The document discusses TIQ Solutions' QlikView data integration landscape. It includes a JDBC connector that connects QlikView with various JDBC data sources like Hadoop, SAP HANA, and Neo4j. It also includes a JSON proxy server that connects QlikView with RESTful APIs and JSON data sources. Additionally, it mentions a QVX and QVD converter that enables creating and converting QlikView files for use in other Java applications and frameworks.
KPerry - 20463 Implementing a Data Warehouse with Microsoft® SQL Server (2)Kwame M. Perry
The document is a certificate of completion for Kwame M. Perry who completed a course on Implementing a Data Warehouse with Microsoft SQL Server on June 19, 2015. The course was presented by New Horizons Computer Learning Centers and the certificate was signed by the Director of Training, Keith Glass.
QlikView stores data in documents using a two-level organization:
1) Distinct lists of values for each field
2) Tables with pointers corresponding to the list values
The order of values in the lists reflects the load order of data.
Memory usage is important because documents and data will grow over time, leaving less RAM available and degrading performance once thresholds are breached. Proper structure and load order can provide significant memory savings.
Dynamic data masking is a data protection feature in SQL Server 2016 that masks sensitive data in query results without altering the actual data. It can help protect private information by exposing only obfuscated data to unauthorized users. Administrators can configure masking rules for specific columns using various masking functions like default, email, random, or custom string masking. The underlying data remains intact but masked data is returned for users without unmask permissions. It provides data security with minimal performance impact by masking results on-the-fly.
The document discusses different methods for loading data into Qlikview, including loading from files, inline loading defined data, resident loading from existing tables, incremental loading of new records, binary loading between Qlikview files, add loading to append data to tables, and buffer loading to automatically create QVD files. Examples are provided for each loading method to illustrate how they are used within Qlikview scripts. The various loading techniques allow for loading data from different sources, transforming data, handling incremental changes, reusing existing data models, and buffering data for future use.
A data warehouse is a collection of integrated data from multiple sources organized to support management decision making. It contains subject-oriented, integrated, time-variant and non-volatile data stored in a way that is optimized for query and analysis. There are different types of data warehouses including data marts, operational data stores and enterprise data warehouses. Key components of a data warehouse include data sources, extraction, loading, a comprehensive database, metadata and middleware tools.
A data warehouse is a collection of data integrated from multiple sources to support decision making. It contains subject-oriented, integrated, time-variant, and non-volatile data stored in a way that makes it readily available for analysis. Data marts can be dependent on the warehouse or independent subsets designed for specific departments. Successful implementation requires identifying data sources and governance, planning data quality and modeling, selecting ETL and database tools, and supporting end users. Key challenges include unrealistic expectations, technical issues, and ensuring ongoing value.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman
This document discusses various business intelligence tools for data analysis including ETL, OLAP, reporting, and metadata tools. It provides evaluation criteria for selecting tools, such as considering budget, requirements, and technical skills. Popular tools are identified for each category, including Informatica, Cognos, and Oracle Warehouse Builder. Implementation requires determining sources, data volume, and transformations for ETL as well as performance needs and customization for OLAP and reporting.
A data warehouse is a pool of data structured to support decision making. It integrates data from multiple sources and is time-variant and nonvolatile. Data warehouses can take the form of enterprise data warehouses, used across an organization for decision support, or data marts designed for a specific department. The data warehousing process involves extracting data from sources, transforming and loading it into a comprehensive database, and using middleware tools and metadata. Real-time data warehousing allows for information-based decision making using up-to-date data.
- Data warehousing aims to help knowledge workers make better decisions by integrating data from multiple sources and providing historical and aggregated data views. It separates analytical processing from operational processing for improved performance.
- A data warehouse contains subject-oriented, integrated, time-variant, and non-volatile data to support analysis. It is maintained separately from operational databases. Common schemas include star schemas and snowflake schemas.
- Online analytical processing (OLAP) supports ad-hoc querying of data warehouses for analysis. It uses multidimensional views of aggregated measures and dimensions. Relational and multidimensional OLAP are common architectures. Measures are metrics like sales, and dimensions provide context like products and time periods.
The Shifting Landscape of Data IntegrationDATAVERSITY
This document discusses the shifting landscape of data integration. It begins with an introduction by William McKnight, who is described as the "#1 Global Influencer in Data Warehousing". The document then discusses how challenges in data integration are shifting from dealing with volume, velocity and variety to dealing with dynamic, distributed and diverse data in the cloud. It also discusses IDC's view that this shift is occurring from the traditional 3Vs to the 3Ds. The rest of the document discusses Matillion, a vendor that provides a modern solution for cloud data integration challenges.
Data warehousing is an architectural model that gathers data from various sources into a single unified data model for analysis purposes. It consists of extracting data from operational systems, transforming it, and loading it into a database optimized for querying and analysis. This allows organizations to integrate data from different sources, provide historical views of data, and perform flexible analysis without impacting transaction systems. While implementation and maintenance of a data warehouse requires significant costs, the benefits include a single access point for all organizational data and optimized systems for analysis and decision making.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
The document discusses optimizing a data warehouse by offloading some workloads and data to Hadoop. It identifies common challenges with data warehouses like slow transformations and queries. Hadoop can help by handling large-scale data processing, analytics, and long-term storage more cost effectively. The document provides examples of how customers benefited from offloading workloads to Hadoop. It then outlines a process for assessing an organization's data warehouse ecosystem, prioritizing workloads for migration, and developing an optimization plan.
This document provides tips for optimizing performance in Power BI by focusing on different areas like data sources, the data model, visuals, dashboards, and using trace and log files. Some key recommendations include filtering data early, keeping the data model and queries simple, limiting visual complexity, monitoring resource usage, and leveraging log files to identify specific waits and bottlenecks. An overall approach of focusing on time-based optimization by identifying and addressing the areas contributing most to latency is advocated.
The document provides information about data warehousing including definitions, how it works, types of data warehouses, components, architecture, and the ETL process. Some key points:
- A data warehouse is a system for collecting and managing data from multiple sources to support analysis and decision-making. It contains historical, integrated data organized around important subjects.
- Data flows into a data warehouse from transaction systems and databases. It is processed, transformed, and loaded so users can access it through BI tools. This allows organizations to analyze customers and data more holistically.
- The main components of a data warehouse are the load manager, warehouse manager, query manager, and end-user access tools. The ETL process
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
This document provides an overview of data warehousing. It defines a data warehouse as a subject-oriented, integrated collection of data used to support management decision making. The benefits of data warehousing include high returns on investment and increased productivity. A data warehouse differs from an OLTP system in its design for analytics rather than transactions. The typical architecture includes data sources, an operational data store, warehouse manager, query manager and end user tools. Key components are extracting, cleaning, transforming and loading data, and managing metadata. Data flows include inflows from sources and upflows of summarized data to users.
The document discusses Microsoft's approach to implementing a data mesh architecture using their Azure Data Fabric. It describes how the Fabric can provide a unified foundation for data governance, security, and compliance while also enabling business units to independently manage their own domain-specific data products and analytics using automated data services. The Fabric aims to overcome issues with centralized data architectures by empowering lines of business and reducing dependencies on central teams. It also discusses how domains, workspaces, and "shortcuts" can help virtualize and share data across business units and data platforms while maintaining appropriate access controls and governance.
Various Applications of Data Warehouse.pptRafiulHasan19
The document discusses various applications of data warehousing. It begins by describing problems with traditional transactional systems and how data warehouses address these issues. It then defines key components of a data warehouse including the extraction, transformation, and loading of data from various sources. The document outlines how online analytical processing (OLAP) tools, metadata repositories, and data mining techniques analyze and explore the collected data. Finally, it weighs the benefits of a data warehouse against the costs of implementation and maintenance.
This document provides an overview of data warehousing and related concepts. It defines a data warehouse as a centralized database for analysis and reporting that stores current and historical data from multiple sources. The document describes key elements of data warehousing including Extract-Transform-Load (ETL) processes, multidimensional data models, online analytical processing (OLAP), and data marts. It also outlines advantages such as enhanced access and consistency, and disadvantages like time required for data extraction and loading.
- SolidQ is a data management consulting company founded in 2007 in Italy that has over 1000 customers and 200 consultants worldwide. They specialize in Microsoft data platform solutions.
- Davide Mauri is a Microsoft SQL Server MVP and president of an Italian SQL Server user group with 18 years of experience in data architecture, design, performance, and BI. He is a mentor at SolidQ.
- SolidQ provides data analysis services including SQL Server Analysis Services multidimensional and tabular solutions, Power Pivot, and data mining features in both multidimensional and tabular models using algorithms like clustering, decision trees, and regression.
Watch the companion webinar at: http://embt.co/1FTVdGF
Every year the State of Texas CIO releases the five-year State Strategic Plan with IT initiatives for government organizations to implement. How many of the items from the November 2014 plan update have you planned for or put in place? If you need help aligning with these state objectives, join this session to learn how ER/Studio can enhance your data architecture to meet these goals.
In this age of data policies and protection, Texas State agencies are required to develop controls to ensure confidentiality, integrity and availability of their data. In this webinar, we’ll show a live demonstration of ER/Studio and describe how it addresses key areas of the strategic objectives, including:
+ Data security and privacy classifications
+ Data quality and availability requirements
+ Enterprise planning and collaboration within and across organizations
Similar to Building Data Warehouse in SQL Server (20)
This document provides an overview of using Polybase for data virtualization in SQL Server. It discusses installing and configuring Polybase, connecting external data sources like Azure Blob Storage and SQL Server, using Polybase DMVs for monitoring and troubleshooting, and techniques for optimizing performance like predicate pushdown and creating statistics on external tables. The presentation aims to explain how Polybase can be leveraged to virtually access and query external data using T-SQL without needing to know the physical data locations or move the data.
Antonios Chatzipavlis presented on SQL Server backup and restore. The presentation covered database architecture basics including data files, transaction log files, and the buffer cache. It also discussed backup types like full, differential, transaction log, copy only and partial backups. Backup strategies and restore processes were explained, including restoring to a point in time and restoring system databases. The internals of how SQL Server performs backups using buffers and I/O threads was also summarized.
Antonios Chatzipavlis presented on migrating SQL workloads to Azure. He discussed modernizing data platforms by discovering, assessing, planning, transforming, optimizing, testing and remediating. Key migration considerations include remaining, rehosting, refactoring, rearchitecting, rebuilding or replacing workloads. Tools for migrating data include Microsoft Assessment and Planning Toolkit, Data Migration Assistant, Database Experimentation Assistant, SQL Server Migration Assistant, and Azure Database Migration Service. Workloads can be migrated to Azure VMs, Azure SQL Databases or Azure SQL Managed Instances.
This document summarizes a webinar presentation about workload management in SQL Server 2019. It discusses how SQL Server's Resource Governor feature can be used to provide multitenancy, predictable performance, and isolation for multiple workloads running on a single SQL Server instance. Key concepts covered include resource pools, workload groups, and classification functions to assign sessions to different pools and groups. The presentation also reviews best practices for using lookup tables in classification functions and shows some DMVs for monitoring Resource Governor configuration and statistics.
This document provides an overview of loading data into Azure SQL DW (Synapse Analytics). It discusses extracting source data into text files, landing the data into Azure Data Lake Store Gen2, preparing the data for loading into staging tables using PolyBase or COPY commands, transforming the data, and inserting it into production tables. It also compares ETL vs ELT approaches and SSIS vs Azure Data Factory for data integration. The presenter then demonstrates loading data in Synapse SQL pool and invites any questions.
The document provides an overview of the DAX language. It discusses that DAX is the programming language used in Power BI, Power Pivot, and Analysis Services for data modeling, reporting, and analytics. It describes the basic components of a DAX data model including tables, columns, relationships, measures, and hierarchies. It also covers DAX syntax, functions, operators, and how context and filter context work in DAX calculations and queries.
The document introduces Diagnostic Management Views (DMVs) and Dynamic Management Functions (DMFs) in SQL Server. It discusses that DMVs and DMFs return server state information and can be used to monitor server health, diagnose problems, and tune performance. It provides examples of common DMVs and DMFs used for query execution and the query plan cache. Finally, it notes that the presentation will demonstrate troubleshooting with DMVs and DMFs.
This document summarizes common T-SQL anti-patterns that can negatively impact query performance, including using SELECT *, functions in predicates, OR operators, implicit conversions, unnecessary sorts, correlated subqueries, and dynamic SQL execution. The presentation provides explanations of why each anti-pattern hurts performance and recommendations for more optimized alternatives such as using indexes, temporary tables, parameterization, and execution plan analysis.
This document discusses designing a modern data warehouse in Azure. It provides an overview of traditional vs. self-service data warehouses and their limitations. It also outlines challenges with current data warehouses around timeliness, flexibility, quality and findability. The document then discusses why organizations need a modern data warehouse based on criteria like customer experience, quality assurance and operational efficiency. It covers various approaches to ingesting, storing, preparing and modeling data in Azure. Finally, it discusses architectures like the lambda architecture and common data models.
Modernizing Your Database with SQL Server 2019 discusses SQL Server 2019 features that can help modernize a database, including:
- The Hybrid Buffer Pool which supports persistent memory to improve performance on read-heavy workloads.
- Memory-Optimized TempDB Metadata which stores TempDB metadata in memory-optimized tables to avoid certain blocking issues.
- Intelligent Query Processing features like Adaptive Query Processing, Batch Mode processing on rowstores, and Scalar UDF Inlining which improve query performance.
- Approximate Count Distinct, a new function that provides an estimated count of distinct values in a column faster than a precise count.
- Lightweight profiling, enabled by default, which provides query plan
This document discusses designing a modern data warehouse in Azure. It provides an overview of traditional vs. self-service data warehouses and their limitations. It also outlines challenges with current data warehouses around timeliness, flexibility, quality and findability. The document then discusses why organizations need a modern data warehouse based on criteria like customer experience, quality assurance and operational efficiency. It covers various approaches to ingesting, storing, preparing, modeling and serving data on Azure. Finally, it discusses architectures like the lambda architecture and common data models.
The document provides details about an SQL expert's background and certifications. It summarizes the expert's career starting in 1982 working with computers and 1988 starting in the computer industry. In 1996, they started working with SQL Server 6.0 and have since earned multiple Microsoft certifications. The expert now provides training and consultation services, and created an online school called SQL School Greece to teach SQL Server.
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018 Antonios Chatzipavlis
Azure SQL Database is a managed database service hosted in Microsoft's Azure cloud. Some key differences from SQL Server include: the service is paid by the hour based on the selected service tier; users can dynamically scale resources up or down; backups and high availability are managed by the service provider; and common administration tasks are handled by the provider rather than the user. The service offers automatic backups, point-in-time restore, and geo-restore capabilities along with built-in high availability through replication across three copies in the primary region.
The document discusses technologies within the Microsoft SQL family and Azure SQL that can help organizations address requirements of the General Data Protection Regulation (GDPR). It covers features for discovering and classifying personal data, managing access and controlling how data is used, and protecting data through encryption, auditing and other security controls. Built-in technologies like dynamic data masking, row-level security, authentication options, and transparent data encryption are described as ways SQL Server and Azure SQL Database can help organizations comply with GDPR.
The document provides biographical information about Antonios Chatzipavlis, a SQL Server expert and evangelist. It then summarizes his presentation on statistics and index internals in SQL Server, which covers topics like cardinality estimation, inspecting and updating statistics, index structure and types, and identifying missing indexes. The presentation includes demonstrations of analyzing cardinality estimation and picking the right index key.
This document provides an introduction and overview of Azure Data Lake. It describes Azure Data Lake as a single store of all data ranging from raw to processed that can be used for reporting, analytics and machine learning. It discusses key Azure Data Lake components like Data Lake Store, Data Lake Analytics, HDInsight and the U-SQL language. It compares Data Lakes to data warehouses and explains how Azure Data Lake Store, Analytics and U-SQL process and transform data at scale.
This document provides an overview of Azure SQL Data Warehouse. It discusses what Azure SQL Data Warehouse is, how it is provisioned and scaled, best practices for designing tables in Azure SQL DW including distribution keys and data types, and methods for loading and querying data including PolyBase and labeling queries for monitoring. The presentation also covers tuning aspects like statistics, indexing, and resource classes.
This document provides an introduction and overview of Azure DocumentDB. It discusses how DocumentDB is a fully managed NoSQL database service that provides fast and predictable performance for JSON data through SQL querying capabilities. It also describes how DocumentDB offers features like elastic scaling, high availability, global distribution and ease of development. The document then provides information on starting with DocumentDB, writing queries, and programming capabilities within DocumentDB like stored procedures and triggers.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
2. I have been started with computers.
I started my professional carrier in computers industry.
I have been started to work with SQL Server version 6.0
I earned my first certification at Microsoft as Microsoft Certified Solution
Developer (3rd in Greece) and started my carrier as Microsoft Certified
Trainer (MCT) with more than 20.000 hours of training until now!
I became for first time Microsoft MVP on SQL Server
I created the SQL School Greece (www.sqlschool.gr)
I became MCT Regional Lead by Microsoft Learning Program.
I was certified as MCSE : Data Platform, MCSE: Business Intelligence
Antonios Chatzipavlis
Database Architect
SQL Server Evangelist
MCT, MCSE, MCITP, MCPD, MCSD, MCDBA,
MCSA, MCTS, MCAD, MCP, OCA, ITIL-F
1982
1988
1996
1998
2010
2012
2013
CHAPTER
3. Follow us in social media
Twitter @antoniosch / @sqlschool
Facebook fb/sqlschoolgr
YouTube yt/user/achatzipavlis
LinkedIn SQL School Greece group
Pinterest pi/SQLschool/
5. Stay Involved!
• Sign up for a free membership today at sqlpass.org
• Linked In: http://www.sqlpass.org/linkedin
• Facebook: http://www.sqlpass.org/facebook
• Twitter: @SQLPASS
• PASS: http://www.sqlpass.org
6. Whatever your data passion – there’s a Virtual Chapter for you!
www.sqlpass.org/vc
7. Planning on attending PASS Summit 2015? Start
saving today!
• The world’s largest gathering of SQL Server & BI professionals
• Take your SQL Server skills to the next level by learning from the world’s
top SQL Server experts, in over 190 technical sessions
• Over 5000 registrations, representing 2000 companies, from 52
countries, ready to network & learn
Save $150 right now using
discount code LC15CPJ8
$1795
until July 12th, 2015
8. Don’t miss your chance to vote in the 2015 PASS
elections-update your myPASS profile by June 1!
In order to vote for the 2015 PASS Nomination Committee & the Board of Directors, you need
to complete all mandatory fields in your myPASS profile by 11:59 PM PDT June 1, 2015.
• PASS members will be reminded to review & complete their profiles
• Members will receive instructions for updating profiles and deleting duplicate profiles
• Eligible voters will receive information about key election dates and the voting process after June 1
Head to sqlpass.org/myPASS today!
For more info on elections,
visit to sqlpass.org/elections
9. • Overview of Data Warehousing
• Data Warehouse Solution
• Data Warehouse Infrastructure
• Data Warehouse Hardware
• Data Warehouse Design Overview
• Designing Dimension Tables
• Designing Fact Tables
• Data Warehouse Physical Design
Agenda
11. • There are many definitions for the term “data warehouse,”
and disagreements over specific implementation details.
• It is generally agreed that a data warehouse is a centralized
store of business data that can be used for reporting and
analysis to inform key decisions.
• A data warehouse provides a solution to the problem of
distributed data that prevents effective business decision-
making.
What is a Data Warehouse?
12. The single organizational repository of enterprise wide
data across many or all lines of business and subject
areas.
Contains massive and integrated data.
Represents the complete organizational view of
information needed to run and understand the
business.
Definition of Data Warehouse
13. • Contains a large volume of data that relates to historical
business transactions.
• Is optimized for read operations that support querying the
data.
• Is loaded with new or updated data at regular intervals.
• Provides the basis for enterprise BI applications.
Data Warehouse characteristics
14. • Finding the information required for business decision
• This is time-consuming and error-prone.
• Key business data is distributed across multiple systems.
• This makes it hard to collate all the information necessary for a particular
business decision.
• Fundamental business questions are hard to answer.
• Most business decisions require a knowledge of fundamental facts.
• The distribution of data throughout multiple systems in a typical
organization can make them difficult, or even impossible, to answer.
What makes a Data Warehouse useful?
15. • The specific, subject oriented, or departmental view of
information from the organization.
• Generally these are built to satisfy user requirements for
information
What is a Data Mart?
16. Data Warehouse Vs Data Mart
Data Warehouse Data Mart
Scope
• Application independent
• Centralized or Enterprise
• Planned
• Specific application
• Decentralized by group
• Organic but may be planned
Data
• Historical, detailed, summary
• Some denormalization
• Some history, detailed, summary
• High denormalization
Subjects • Multiple Subjects • Single central subject area
Sources • Many internal and external sources • Few internal and external sources
Other
• Flexible
• Data oriented
• Long life
• Single complex structure
• Restrictive
• Project oriented
• Short life
• Multiple simple structures
17. • Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
18. • Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
19. • Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
20. • Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
21. Components of a Data Warehousing Solution
Data
Warehouse
Master Data
Management
Data
Cleansing
DataSources
ETL
Data
Models
Reporting and Analysis
22. 1. Start by identifying the business questions that the data warehousing
solution must answer
2. Determine the data that is required to answer these questions
3. Identify data sources for the required data
4. Assess the value of each question to key business objectives versus
the feasibility of answering it from the available data
For large enterprise-level projects, an incremental approach can be effective:
• Break the project down into multiple sub-projects
• Each sub-project deals with a particular subject area in the data warehouse
Starting a Data Warehouse Project
23. Core Data Warehousing
• SQL Server Database
Engine
• SQL Server Integration
Services
• SQL Server Master Data
Services
• SQL Server Data Quality
Services
Enterprise BI
• SQL Server Analysis
Services
• SQL Server Reporting
Services
• Microsoft SharePoint
Server
• Microsoft Office
Self-Service BI
Big Data Analysis
• Excel Add-ins
(PowerPivot, Power
Query, Power View,
Power Map)
• Microsoft Office 365
Power BI
• Windows Azure
HDInsight
SQL Server As a Data Warehousing Platform
25. A data warehouse
is a relational database that
is optimized for reading data
for analysis and reporting.
Keep in mind
26. • Logical:
• Is typically designed to denormalize data into a structure that minimizes the
number of join operations required in the queries used to retrieve and
aggregate data.
• A common approach is to design a star schema
• Physical:
• Affect the performance and manageability of the data warehouse
Logical and Physical Database schema
27. • Query processing requirements, including anticipated peak
memory and CPU utilization.
• Storage volume and disk input/output requirements.
• Network connectivity and bandwidth.
• Component redundancy for high availability.
Hardware selection
28. • Failover time requirements.
• Configuration and management complexity.
• The volume of data in the data warehouse.
• The frequency of changes to data in the data warehouse.
• The effect of the backup process on data warehouse
performance.
• The time to recover the database in the event of a failure.
High availability and Disaster Recovery
29. • The authentication mechanisms that you must support to
provide access to the data warehouse.
• The permissions that the various users who access the data
warehouse will require.
• The connections over which data is accessed.
• The physical security of the database and backup media.
Security
30. • Data Source Connection Types
• Credentials and Permissions
• Data Formats
• Data Acquisition Windows
Data sources
31. • Staging:
• What data must be staged?
• Staging data format
• Required transformations:
• Transformations during extraction versus data flow transformations
• Incremental ETL:
• Identifying data changes for extraction
• Inserting or updating when loading
ETL Processes
32. • Data quality:
• Cleansing data:
• Validating data values
• Ensuring data consistency
• Identifying missing values
• Deduplicating data
• Master data management:
• Ensuring consistent business entity definitions across multiple systems
• Applying business rules to ensure data validity
Data Quality and Master Data Management
34. Data volume
• The amount of data that the data warehouse must store
• The size and frequency of incremental loads of new data.
• The primary consideration is the number of rows in fact tables
• But don’t forget dimension data, indexes, and data models
that are stored on disk.
System Sizing Factors
35. Analysis and Reporting Complexity
• This includes the number, complexity, and predictability of the
queries that will be used to analyze the data or produce reports.
• Typically, BI solutions must support a mix of the following query
types:
• Simple. Relatively straightforward SELECT statements.
• Medium. Repeatedly executed queries that include aggregations or many joins.
• Complex. Unpredictable queries with complex aggregations, joins, and
calculations.
System Sizing Factors
36. Number of Users
• This is the total number of information workers who will
access the system, and how many of them will do so
concurrently.
Availability Requirements
• These include when the system will need to be used, and
what planned or unplanned downtime the business can
tolerate.
System Sizing Factors
37. Typical System Categorization
Small Medium Large
Data Volume 100s of GBs to 1 TB 1 to 10 TB 10 TB to 100s of TBs
Analysis and
Reporting
Complexity
Over 50% simple
30% medium
Less than 10% complex
50% simple
30-35% medium
10-15% complex
30-35% simple
40% medium
20-25% complex
Number of Users
100 total
10 to 20 concurrent
1,000 total
100 to 200 concurrent
1,000s of concurrent
users
Availability
Requirements
Business hours
1 hour of downtime per
night
24/7 operations
38. Data Warehouse Workloads
ETL
• Control flow tasks
• Data query and insert
• Network data transfer
• In-memory data pipeline
• SSIS Catalog or MSDB I/O
Reporting
• Client requests
• Data source queries
• Report rendering
• Caching
• Snapshot execution
• Subscription processing
• Report Server Catalog I/O
Operations and
Maintenance
• OS activity
• Logging
• SQL Server Agent Jobs
• SSIS packages
• Indexes
• Backups
DW
• Processing
• Aggregation storage
• Multidimensional on disk
• Tabular in memory
• Query execution
Cubes
39. Typical Server Topologies for a BI Solution
Single Server
Architecture
DW
Distributed
Architecture
ServersFew Many
Hardware costs
Software license costs
Configuration complexity
Scalability & Performance
Flexibility
40. Scaling-out a BI Solution
Analysis ServicesData Warehouse
Integration Services Reporting Services
• Partitioning the data
across multiple
database servers
• SQL Server Parallel
Data Warehouse
edition
• Install the Reporting
Services database on a
single database server,
• Then install the
Reporting Services
report server service
on multiple servers
that all connect to the
same Reporting
Services database.
Create a read-only copy of a
multidimensional database and
connect to it from multiple Analysis
Services query servers.
Use multiple SSIS
servers to perform a
subset of the ETL
processes in parallel
41. Planning for High Availability
• AlwaysOn Failover
Cluster
• RAID Storage
• AlwaysOn Failover
Cluster
• AlwaysOn Availability
Group
• NLB Report
Servers
• AlwaysOn
Availability
Group
• AlwaysOn
Failover Cluster
Data Warehouse
Analysis Services
Integration Services
Reporting Services
43. • A DW usually has longer-running queries
• A DW has higher read activity than write activity
• The data in DW is usually more static
• In a DW it is much more important to be able to process a
large amount of data quickly, than it is to support a high
number of I/O operations per second
Keep in mind
44. • Determine initial data volume
• Number of fact table rows x row size
• Use 100 bytes per row as an estimate if unknown
• Add 30-40% for dimensions and indexes
• Project data growth
• Number of new fact rows per month
• Factor in compression
• Typically 3:1
Determining Storage Requirements
Other storage requirements
• Configuration databases
• Log files
• TempDB
• Staging tables
• Backups
• Analysis Services models
45. • Use more smaller disks instead of fewer larger disks
• Use the fastest disks you can afford
• Consider solid state disks especially for random I/O
• Use RAID 10, or minimally RAID 5
• Consider a dedicated storage area network for manageability
and extensibility
• Balance I/O across enclosures, storage processors, and disk groups
Considerations for Storage Hardware
47. • Determine core MCR
• Apply formula to estimate required number of cores:
Estimating CPU Requirements
𝐶𝑃𝑈𝑠 =
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑞𝑢𝑒𝑟𝑦 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑀𝐵
𝑀𝐶𝑅
𝑥 𝐶𝑜𝑛𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑢𝑠𝑒𝑟𝑠
𝑇𝑟𝑎𝑟𝑔𝑒𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑡𝑖𝑚𝑒
48. • This metric measures the maximum SQL Server data processing rate for a
standard query and data set for a specific server and CPU combination.
• This is provided as a per-core rate, and it is measured as a query-based scan
from memory cache.
• MCR is the initial starting point for Fast Track system design.
• It represents an estimated maximum required I/O bandwidth for the server, CPU,
and workload.
• MCR is useful as an initial design guide because it requires only minimal local
storage and database schema to estimate potential throughput for a given CPU.
• It is not a measure of system performance.
Maximum Consumption Rate (MCR)
49. • Create a reference dataset based on the TPC-H line item table or similar data set.
• The table should be of a size that it can be entirely cached in the SQL Server buffer pool yet still
maintain a minimum one-second execution time for the query provided here.
• For FTDW the following query is used:
SELECT sum([integer field]) FROM [table]
WHERE [restrict to appropriate data volume]
GROUP BY [col].
• Ensure that Resource Governor settings are at default values.
• Ensure that the query is executing from the buffer cache.
• Executing the query once should put the pages into the buffer, and subsequent executions should
read fully from buffer. Validate that there are no physical reads in the query statistics output.
Calculate MCR
50. • Set STATISTICS IO and STATISTICS TIME to ON to output results.
• Run the query multiple times, at MAXDOP = 1.
• Record the number of logical reads and CPU time from the statistics
output for each query execution.
• Calculate the MCR in MB/s using the formula:
( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024
• A consistent range of values (+/- 5%) should appear over a
minimum of five query executions.
• Significant outliers (+/- 20% or more) may indicate configuration issues. The
average of at least 5 calculated results is the FTDW MCR.
Calculate MCR
51. • Special SQL Server Edition only available in hardware appliances
• Massively parallel processing
• Shared-nothing architecture
• Dedicated control nodes, compute nodes, and storage nodes
SQL Server Parallel Data Warehouse
DualFiberChannel
Database servers
(compute nodes)
Infiniband
Storage Arrays
Control Node
Cluster
Management
Servers
Landing Zone
(ETL Interface)
Backup Nodes
54. • Identify the grain
• Select the required dimensions
• Identify the facts
Dimensional Modeling
55. • The grain of a dimensional model is the lowest level of detail
at which you can aggregate the measures.
• It is important to choose the level of grain that will support the
most granular of reporting and analytical requirements
• Typically the lowest level possible from the source data is the
best option.
Identify the grain
56. • Determine which of the dimensions related to the business
process should be included in the model
• The selection of dimensions depends on the reporting and
analytical requirements, specifically on the business entities by
which users need to aggregate the measures
• Almost all dimensional models include a time-based
dimension
Select the required dimensions
57. • Identify the facts that you want to include as measures.
• Measures are numeric values that can be expressed at the
level of the grain chosen earlier and aggregated across the
selected dimensions.
• Depending on the grain you choose for the dimensional
model and the grain of the source data, you might need to
allocate measures from a higher level of grain across multiple
fact rows.
Identify the facts
58. Documenting Dimensional Models
Sales Order
Item Quantity
Unit Cost
Total Cost
Unit Price
Sales Amount
Shipping Cost
Time
(Order Date and
Ship Date)
Salesperson
CustomerProduct
Calendar Year
Month
Date
Fiscal Year
Fiscal Quarter
Month
Date
Region
Country
Territory
Manager
Name
Name
Country
State or Province
City
Age
Marital Status
Gender
Category
Subcategory
Product Name
Color
Size
60. Each row
in a dimension table represents
an instance of a business entity
by which the measures
in the fact table
can be aggregated
Keep in mind
61. • A key column uniquely identifies each row in the dimension table.
• Usually the dimension data is obtained from a source system in
which a key is already assigned, this is the “business key”
• It is standard practice to define a new “surrogate key” that uses an
integer value to identify each row.
• A surrogate key is recommended for the following reasons:
• The data warehouse might use dimension data from multiple source systems, so it is possible
that business keys are not unique.
• Some source systems use non-numeric keys, such as a globally unique identifier (GUID), or
natural keys, such as an email address, to uniquely identify data entities. Integer keys are smaller
and more efficient to use in joins from fact tables.
• If the dimension table supports “Type 2” slowly-changing dimensions.
Dimension keys
62. Dimension keys
ProductKey ProductAltKey ProductName Color Size
1 MB1-B-32 MB1 Mountain Bike Blue 32
2 MB1-R-32 MB1 Mountain Bike Red 32
CustomerKey CustomerAltKey Name
1 1002 Amy Alberts
2 1005 Neil Black
Surrogate Key Business (Alternate) Key
63. • Hierarchies
• Multiple attributes can be combined to form hierarchies that enable users to drill
down into deeper levels of detail.
• Business users can view aggregated fact data at each level
• Slicers
• Attributes do not need to form hierarchies to be useful in analysis and reporting.
• Business users can group or filter data based on single-level hierarchies to create
analytical sub-groupings of data.
• Drill-through detail
• Some attributes have little value as slicers or members of a hierarchy.
• It can be useful to include entity-specific attributes to facilitate drill-through
functionality in reports or analytical applications.
Dimension Attributes and Hierarchies
64. Dimension Attributes and Hierarchies
CustKey CustAltKey Name Country State City Phone Gender
1 1002 Amy Alberts Canada BC Vancouver 555 123 F
2 1005 Neil Black USA CA Irvine 555 321 M
3 1006 Ye Xu USA NY New York 555 222 M
Hierarchy
SlicerDrill-through detail
65. • Identify the semantic meaning of NULL
• Unknown or None?
• Do not assume NULL equality
• Use ISNULL( )
Unknown and None
OrderNo Discount DiscountType
1000 1.20 Bulk Discount
1001 0.00 N/A
1002 2.00
1003 0.50 Promotion
1004 2.50 Other
1005 0.00 N/A
1006 1.50
Source
DimensionTable
DiscKey DiscAltKey DiscountType
-1 Unknown Unknown
0 N/A None
1 Bulk Discount Bulk Discount
2 Promotion Promotion
3 Other Other
66. • The simplest type of SCD to implement.
• Attribute values are updated directly in the existing dimension table
row and no history is maintained.
• Suitable for attributes that are used to provide drill-through details
• Unsuitable for analytical slicers or hierarchy members where historic
comparisons must reflect the attribute values as they were at the
time of the fact event.
Slowly Changing Dimensions – Type 1
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 123
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 222
67. • These changes involve the creation of a fresh version of the dimension entity in the form of a new
row.
• Typically, a bit column in the dimension table is used as a flag to indicate which version of the
dimension row is the current one.
• Additionally, datetime columns are often used to indicate the start and end of the period for which a
version of the row was (or is) current.
• Maintaining start and end dates makes it easier to assign the appropriate foreign key value to fact rows as they are
loaded so they are related to the version of the dimension entity that was current at the time the fact occurred.
Slowly Changing Dimensions – Type 2
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver Yes 1/1/2000
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012
4 1002 Amy Alberts Toronto Yes 1/1/2012
68. • Rarely used
• The previous value (or a complete history of previous values) is
maintained in the dimension table row.
• This requires modifying the dimension table schema to
accommodate new values for each tracked attribute, and can result
in a complex dimension table that is difficult to manage.
Slowly Changing Dimensions – Type 3
CustKey CustAltKey Name Cars
1 1002 Amy Alberts 0
CustKey CustAltKey Name Prior Cars Current Cars
1 1002 Amy Alberts 0 1
69. • Surrogate key
• Granularity
• Range
Time Dimension
DateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year
00000000 01-01-1753 NULL NULL NULL NULL NULL NULL
20130101 01-01-2013 1 3 Tue 01 Jan 2013
20130102 01-02-2013 2 4 Wed 01 Jan 2013
20130103 01-03-2013 3 5 Thu 01 Jan 2013
20130104 01-04-2013 4 6 Fri 01 Jan 2013
• Attributes and hierarchies
• Multiple calendars
• Unknown values
70. • Create a Transact-SQL script
• Use Microsoft Excel
• Use a BI tool to autogenerate a time dimension table
Populating a Time Dimension Table
71. • A common requirement in a data warehouse is to support
dimensions with parent-child hierarchies
• Typically, parent-child hierarchies are implemented as self-
referencing tables, in which a column in each row is used as a
foreign-key reference to a primary-key value in the same
table
Self-Referencing Dimension
EmployeeKey EmployeeAltKey EmployeeName ManagerKey
1 1000 Kim Abercrombie NULL
2 1001 Kamil Amireh 1
3 1002 Cesar Garcia 1
4 1003 Jeff Hay 2
72. • Combine low-cardinality attributes that don’t belong in
existing dimensions into a junk dimension
• Avoids creating many small dimension tables
Junk Dimensions
JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit
1 1 1 Credit
2 1 1 Debit
3 1 0 Credit
4 1 0 Debit
5 0 1 Credit
6 0 1 Debit
7 0 0 Credit
8 0 0 Debit
78. Understanding DW Components Activity
ETL
Data Models
Reports
User Queries
ETL Loads
• Bulk inserts
• Some lookups and
updates
• Large fact
tables
• Star joins to
dimension
tables
Data Model Processing
• Mostly table/index
scans
Report Processing
• Predictable queries
• Many rows with range-based
query filters
Self-Service BI
• Potentially
unpredictable
queries
79. • Create files with an initial size
• Based on the eventual size of the objects that will be stored on them
• This pre-allocates sequential disk blocks and helps avoid fragmentation.
• Disable autogrowth
• If you begin to run out of space in a data file, it is more efficient to explicitly
increase the file size by a large amount rather than rely on incremental
autogrowth.
Data files guidelines
80. • Create at least one filegroup in addition to the primary one, and
then set it as the default filegroup so you can separate data tables
from system tables.
• Create dedicated filegroups for extremely large fact tables and using
them to place those fact tables on their own logical disks.
• If some tables in the data warehouse are loaded on a different
schedule from others, consider using filegroups to separate the
tables into groups that can be backed up independently.
• If you intend to partition a large fact table, create a filegroup for
each one so that older, stable rows can be backed up, and then set
as read-only.
Filegroups guidelines
81. • Separate staging database
• Create it on a logical disk distinct from the data warehouse files.
• Into the data warehouse database
• Create a file and filegroup for them on a logical disk
• Separate from the fact and dimension tables.
• An exception to the previous guideline is made for staging tables that will be
switched with partitions to perform fast loads.
• These must be created on the same filegroup as the partition with which they will be
switched.
Staging tables
82. • To avoid fragmentation of data files
• Place it on a dedicated logical disk
• Set its initial size based on how much it is likely to be used.
• Set the growth increment to be quite large to ensure that
performance is not interrupted by frequent growth of
TempDB.
• Creating multiple files for TempDB to help minimize
contention during page free space (PFS) scans as temporary
objects are created and dropped.
TempDB
83. • Set the transaction mode of the Data Warehouse, Staging
Database and TempDB to Simple
• Helps to avoid having to truncate transaction logs
• Additionally, most of the inserts in a data warehouse are
typically performed as bulk load operations which are not
logged.
• To avoid disk resource conflicts between data warehouse I/O
and logging, place the transaction log files for all databases
on a dedicated logical disk.
Transaction logs
84. • SQL Server Enterprise edition supports data compression at
both page and row level.
• Data compression benefits in a data warehouse
• Reduced storage requirements.
• Improved query performance
• Best practices for data compression in a data warehouse
• Use page compression on all dimension tables and fact table partitions.
• If performance is CPU-bound, revert to row compression on frequently-
accessed partitions.
Data Compression
85. • Improved query performance
• More granular manageability
• Improved data load performance
• Best practices for partitioning in a DW
• Partition Large Fact Tables
• Partition on an incrementing date key
• Design the partition scheme for ETL and manageability.
• Maintain an empty partition at the start and end of the table
Table Partitioning
86. • Indexes maximize query performance
• Planning Indexes is the most important part of database
design process
• Some inexperienced BI professionals are tempted to create
many indexes on all tables to support queries.
Indexes in DW
87. • Create a clustered index on the surrogate key column.
• This column is used to join the dimension table to fact tables, and a clustered
index will help the query optimizer minimize the number of reads required to filter
fact rows.
• Create a non-clustered index on the alternate key column and
include the SCD current flag, start date, and end date columns.
• This index will improve the performance of lookup operations during ETL data
loads that need to handle slowly-changing dimensions.
• Create non-clustered indexes on frequently searched attributes, and
consider including all members of a hierarchy in a single index.
Dimension table indexes
88. • Create a clustered index on the most commonly-searched
date key.
• Date ranges are the most common filtering criteria in most data warehouse
workloads, so a clustered index on this key should be particularly effective in
improving overall query performance.
• Create non-clustered indexes on other, frequently-searched
dimension keys.
• Columnstore index on all columns
Fact table indexes
89. • Create a view for each dimension and fact table with
NOLOCK query hint in the view definition
• Create views with user-friendly view and column names
• Do not include metadata columns in views
• Create views to combine snowflake dimension tables
• Partition-align indexed views
• Use the SCHEMABINDING option
• Security
Using Views in a DW
91. • Overview of Data Warehousing
• Data Warehouse Solution
• Data Warehouse Infrastructure
• Data Warehouse Hardware
• Data Warehouse Design Overview
• Designing Dimension Tables
• Designing Fact Tables
• Data Warehouse Physical Design
Summary