This document provides guidelines for building cloud BI project architectures. It discusses considerations for architectural design such as data sources, volumes, model complexity and sharing needs. It then presents four common architecture templates - Hulk, Iron Man, Thor and Hawkeye - tailored to different needs around reporting demand, data volume and complexity. Key aspects of architectures like sources, transportation, processing, storage, live calculation, data access and orchestration are examined. Finally, it compares features of technologies that can fulfill different functional roles.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
In this session, Sergio covered the Lakehouse concept and how companies implement it, from data ingestion to insight. He showed how you could use Azure Data Services to speed up your Analytics project from ingesting, modelling and delivering insights to end users.
Azure Data Factory Mapping Data Flow allows users to stage and transform data in Azure during a limited preview period beginning in February 2019. Data can be staged from Azure Data Lake Storage, Blob Storage, or SQL databases/data warehouses, then transformed using visual data flows before being landed to staging areas in Azure like ADLS, Blob Storage, or SQL databases. For information, contact adfdataflowext@microsoft.com or visit http://aka.ms/dataflowpreview.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
The document discusses data mesh vs data fabric architectures. It defines data mesh as a decentralized data processing architecture with microservices and event-driven integration of enterprise data assets across multi-cloud environments. The key aspects of data mesh are that it is decentralized, processes data at the edge, uses immutable event logs and streams for integration, and can move all types of data reliably. The document then provides an overview of how data mesh architectures have evolved from hub-and-spoke models to more distributed designs using techniques like kappa architecture and describes some use cases for event streaming and complex event processing.
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
This document provides an overview of Azure Databricks, including:
- Azure Databricks is an Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It includes Spark SQL, streaming, machine learning libraries, and integrates fully with Azure services.
- Clusters in Azure Databricks provide a unified platform for various analytics use cases. The workspace stores notebooks, libraries, dashboards, and folders. Notebooks provide a code environment with visualizations. Jobs and alerts can run and notify on notebooks.
- The Databricks File System (DBFS) stores files in Azure Blob storage in a distributed file system accessible from notebooks. Business intelligence tools can connect to Databricks clusters via JDBC
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Is the traditional data warehouse dead?James Serra
With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.
1- Introduction of Azure data factory.pptxBRIJESH KUMAR
Azure Data Factory is a cloud-based data integration service that allows users to easily construct extract, transform, load (ETL) and extract, load, transform (ELT) processes without code. It offers job scheduling, security for data in transit, integration with source control for continuous delivery, and scalability for large data volumes. The document demonstrates how to create an Azure Data Factory from the Azure portal.
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley
Jim Boriotti presents an overview and demo of Azure Synapse Analytics, an integrated data platform for business intelligence, artificial intelligence, and continuous intelligence. Azure Synapse Analytics includes Synapse SQL for querying with T-SQL, Synapse Spark for notebooks in Python, Scala, and .NET, and Synapse Pipelines for data workflows. The demo shows how Azure Synapse Analytics provides a unified environment for all data tasks through the Synapse Studio interface.
Optimize the performance, cost, and value of databases.pptxIDERA Software
Today’s businesses run on data, making it essential for them to access data quickly and easily. This requirement means that databases must run efficiently at all times but keeping a database performing at its best remains a challenging task. Fortunately, database administrators (DBAs) can adopt many practices to achieve this goal, thus saving time and money.
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
The Future of Data Integration: Data Mesh, and a Special Deep Dive into Stream Processing with GoldenGate, Apache Kafka and Apache Spark. This video is a replay of a Live Webinar hosted on 03/19/2020.
Join us for a timely 45min webinar to see our take on the future of Data Integration. As the global industry shift towards the “Fourth Industrial Revolution” continues, outmoded styles of centralized batch processing and ETL tooling continue to be replaced by realtime, streaming, microservices and distributed data architecture patterns.
This webinar will start with a brief look at the macro-trends happening around distributed data management and how that affects Data Integration. Next, we’ll discuss the event-driven integrations provided by GoldenGate Big Data, and continue with a deep-dive into some essential patterns we see when replicating Database change events into Apache Kafka. In this deep-dive we will explain how to effectively deal with issues like Transaction Consistency, Table/Topic Mappings, managing the DB Change Stream, and various Deployment Topologies to consider. Finally, we’ll wrap up with a brief look into how Stream Processing will help to empower modern Data Integration by supplying realtime data transformations, time-series analytics, and embedded Machine Learning from within data pipelines.
GoldenGate: https://www.oracle.com/middleware/tec...
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
This document provides an overview of Azure Databricks, a Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It discusses key components of Azure Databricks including clusters, workspaces, notebooks, visualizations, jobs, alerts, and the Databricks File System. It also outlines how data engineers can leverage Azure Databricks for scenarios like running ETL pipelines, streaming analytics, and connecting business intelligence tools to query data.
This document provides an introduction and overview of Azure Data Lake. It describes Azure Data Lake as a single store of all data ranging from raw to processed that can be used for reporting, analytics and machine learning. It discusses key Azure Data Lake components like Data Lake Store, Data Lake Analytics, HDInsight and the U-SQL language. It compares Data Lakes to data warehouses and explains how Azure Data Lake Store, Analytics and U-SQL process and transform data at scale.
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This document provides an overview of building a modern cloud analytics solution using Microsoft Azure. It discusses the role of analytics, a history of cloud computing, and a data warehouse modernization project. Key challenges covered include lack of notifications, logging, self-service BI, and integrating streaming data. The document proposes solutions to these challenges using Azure services like Data Factory, Kafka, Databricks, and SQL Data Warehouse. It also discusses alternative implementations using tools like Matillion ETL and Snowflake.
At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.
Azure Data Factory ETL Patterns in the CloudMark Kromer
This document discusses ETL patterns in the cloud using Azure Data Factory. It covers topics like ETL vs ELT, the importance of scale and flexible schemas in cloud ETL, and how Azure Data Factory supports workflows, templates, and integration with on-premises and cloud data. It also provides examples of nightly ETL data flows, handling schema drift, loading dimensional models, and data science scenarios using Azure data services.
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
This document discusses ETL patterns in the cloud using Azure Data Factory. It covers topics like ETL vs ELT, scaling ETL in the cloud, handling flexible schemas, and using ADF for orchestration. Key points include staging data in low-cost storage before processing, using ADF's integration runtime to process data both on-premises and in the cloud, and building resilient data flows that can handle schema drift.
This document provides an agenda and summary for a Data Analytics Meetup (DAM) on March 27, 2018. The agenda covers topics such as disruption opportunities in a changing data landscape, transitioning from traditional to modern BI architectures using Azure, Azure SQL Database vs Data Warehouse, data integration with Azure Data Factory and SSIS, Analysis Services, Power BI reporting, and a wrap-up. The document discusses challenges around data growth, digital transformation, and the shrinking time for companies to adapt to disruption. It provides overviews and comparisons of Azure SQL Database, Data Warehouse, and related Azure services to help modernize analytics architectures.
Analytics and Lakehouse Integration Options for Oracle ApplicationsRay Février
The document discusses various options for extracting data from Oracle Fusion and Oracle EPM Cloud applications for analytics purposes. It outlines using the Business Intelligence Cloud Connector (BICC) to extract data to object storage, which can then be loaded into Oracle Analytics Cloud (OAC) or Autonomous Data Warehouse (ADW) for analysis. For EPM Cloud, it notes using the EPM Automate REST API wrapper or Oracle Data Integrator Marketplace connector. The document provides an overview of tools like OAC, ADW, ODI, and OCI Data Integration that can help transform and model the data for analytics and machine learning.
What is in a modern BI architecture? In this presentation, we explore PaaS, Azure Active Directory and Storage options including SQL Database and SQL Datawarehouse.
Alluxio Data Orchestration Platform for the CloudShubham Tagra
Alluxio originated as an open source project at UC Berkeley to orchestrate data for cloud applications by providing a unified namespace and intelligent data caching across multiple data sources. It provides consistent high performance for analytics and AI workloads running on object stores by caching frequently accessed data in memory and tiering data to flash/disk based on policies. Alluxio can also enable hybrid cloud environments by allowing on-premises workloads to burst to public clouds without data movement through "zero-copy" access to remote data.
The document compares ETL and ELT data integration processes. ETL extracts data from sources, transforms it, and loads it into a data warehouse. ELT loads extracted data directly into the data warehouse and performs transformations there. Key differences include that ETL is better for structured data and compliance, while ELT handles any size/type of data and transformations are more flexible but can slow queries. AWS Glue, Azure Data Factory, and SAP BODS are tools that support these processes.
A complete-guide-to-oracle-to-redshift-migrationbindu1512
The document provides a guide to migrating from an on-premises Oracle data warehouse to AWS Redshift. It discusses conducting an assessment of the Oracle environment and a cloud-fit analysis. The assessment involves understanding business needs, risks, budgets and timelines. It also involves documenting the existing Oracle solution and identifying components for migration versus reengineering. The guide then covers evaluating tools for migrating data and objects, reengineering versus lift-and-shift approaches for ETL processes, and migrating users and applications. Key AWS services discussed include Database Migration Service, Schema Conversion Tool, Snowball and Snowmobile for bulk data transfer, and Direct Connect.
Building the Next-gen Digital Meter Platform for FluviusDatabricks
Fluvius is the network operator for electricity and gas in Flanders, Belgium. Their goal is to modernize the way people look at energy consumption using a digital meter that captures consumption and injection data from any electrical installation in Flanders ranging from households to large companies. After full roll-out there will be roughly 7 million digital meters active in Flanders collecting up to terabytes of data per day. Combine this with regulation that Fluvius has to maintain a record of these reading for at least 3 years, we are talking petabyte scale. delaware BeLux was assigned by Fluvius to setup a modern data platform and did so on Azure using Databricks as the core component to collect, store, process and serve these volumes of data to every single consumer and beyond in Flanders. This enables the Belgian energy market to innovate and move forward. Maarten took up the role as project manager and solution architect.
Data Driven Advanced Analytics using Denodo Platform on AWSDenodo
The document discusses challenges with data-driven cloud modernization and how the Denodo platform can help address them. It outlines Denodo's capabilities like universal connectivity, data services APIs, security and governance features. Example use cases are presented around real-time analytics, centralized access control and transitioning to the cloud. Key benefits of the Denodo data virtualization approach are that it provides a logical view of data across sources and enables self-service analytics while reducing costs and IT dependencies.
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
Logical replication allows migration between different hardware, operating systems, and Oracle versions with minimal downtime. It works by reading the redo logs of the source database in real time and applying the changes to the target database. Some preparation is required, such as testing and validating the migration. If issues occur during cutover to the 12c target, the original production system remains intact with no data risk. Logical replication provides an effective method for migrating to Oracle 12c with zero or near-zero downtime.
Modern Database Development Oow2008 Lucas JellemaLucas Jellema
This document summarizes an Oracle database expert's presentation on optimal use of Oracle Database 10g and 11g for modern application development. Some key points covered include how modern applications are distributed, global, and service-oriented; how new Oracle database features support cloud computing, analytics, and internationalization; and guidelines for developing applications that leverage the database while maintaining independence.
Simplifying Cloud Architectures with Data VirtualizationDenodo
Watch here: https://bit.ly/2yxLo6f
Moving applications and data to the Cloud is a priority for many organizations. The benefits - in terms of flexibility, agility, and cost savings - are driving Cloud adoption. However, the journey to the Cloud is not as easy as many people think. The process of moving application and data to the Cloud is challenging and can entail widespread disruption across the organization if not carefully managed. Even when systems are migrated to the Cloud, the resultant hybrid or multi-Cloud architecture is more complex for users to navigate, making it harder for them to get the data that they need to do their jobs.
Data Virtualization can help organizations at all stages of their journey to the Cloud - during migration and also in the resultant hybrid or multi-Cloud architectures. Attend this webinar to learn how Data Virtualization can:
- Help organizations manage risk and minimize the disruption caused as systems are moved to the Cloud
- Provide a single point of access for data that is both on-premise and in the Cloud, making it easier for users to find and access the data that they need
- Provide a security layer to protect and manage your data when it's distributed across hybrid or multi-Cloud architectures
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
Check out this presentation to learn the basics of using Attunity Replicate to stream real-time data to Azure Data Lake Storage Gen2 for analytics projects.
The document discusses agile approaches to data warehousing and big data technologies. It describes traditional data warehousing as brittle and costly to modify. An agile approach uses reusable ETL modules and a hyper-normalized integration layer to flexibly adapt to changing requirements. Big data technologies like Hadoop, NoSQL databases, and cloud-based data warehouses are also discussed as enabling flexible and cost-effective options for large and evolving data and analytics needs.
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Similar to Azure BI Cloud Architectural Guidelines.pdf (20)
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataSamuel Jackson
We present our work to improve data accessibility and performance for data-intensive tasks within the fusion research community. Our primary goal is to develop services that facilitate efficient access for data-intensive applications while ensuring compliance with FAIR principles [1], as well as adoption of interoperable tools, methods and standards.
The major outcome of our work is the successful creation and deployment of a data service for the MAST (Mega Ampere Spherical Tokamak) experiment [2], leading to substantial enhancements in data discoverability, accessibility, and overall data retrieval performance, particularly in scenarios involving large-scale data access. Our work follows the principles of Analysis-Ready, Cloud Optimised (ARCO) data [3] by using cloud optimised data formats for fusion data.
Our system consists of a query-able metadata catalogue, complemented with an object storage system for publicly serving data from the MAST experiment. We will show how our solution integrates with the Pandata stack [4] to enable data analysis and processing at scales that would have previously been intractable, paving the way for data-intensive workflows running routinely with minimal pre-processing on the part of the researcher. By using a cloud-optimised file format such as zarr [5] we can enable interactive data analysis and visualisation while avoiding large data transfers. Our solution integrates with common python data analysis libraries for large, complex scientific data such as xarray [6] for complex data structures and dask [7] for parallel computation and lazily working with larger that memory datasets.
The incorporation of these technologies is vital for advancing simulation, design, and enabling emerging technologies like machine learning and foundation models, all of which rely on efficient access to extensive repositories of high-quality data. Relying on the FAIR guiding principles for data stewardship not only enhances data findability, accessibility, and reusability, but also fosters international cooperation on the interoperability of data and tools, driving fusion research into new realms and ensuring its relevance in an era characterised by advanced technologies in data science.
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18
[2] M Cox, The Mega Amp Spherical Tokamak, Fusion Engineering and Design, Volume 46, Issues 2–4, 1999, Pages 397-404, ISSN 0920-3796, https://doi.org/10.1016/S0920-3796(99)00031-9
[3] Stern, Charles, et al. "Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production." Frontiers in Climate 3 (2022): 782909.
[4] Bednar, James A., and Martin Durant. "The Pandata Scalable Open-Source Analysis Stack." (2023).
[5] Alistair Miles (2024) ‘zarr-developers/zarr-python: v2.17.1’. Zenodo. doi: 10.5281/zenodo.10790679
[6] Hoyer, S. & Hamman, J., (20
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...rightmanforbloodline
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
Big Data and Analytics Shaping the future of PaymentsRuchiRathor2
The payments industry is experiencing a data-driven revolution powered by big data and analytics.
Here's a glimpse into 5 ways this dynamic duo is transforming how we pay.
In essence, big data and analytics are playing a pivotal role in building a future filled with faster, more secure, and convenient payment methods for everyone.
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...JeevanKp7
Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients.
The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease.
A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.
Combined supervised and unsupervised neural networks for pulse shape discrimi...Samuel Jackson
Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem.
Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.
2. Executive summary
This document is intended to provide guidelines for building architectures on cloud BI projects.
Considerations
To define an architecture for your project, we suggest you look at these criteria :
Source: where your data is located
ETL complexity: what kind of business rules
and transformations you need to support
Data volumes: The sheer size of the data
Model complexity: the business problem
you are representing and the kind of KPIs
you’ll need to support
Sharing needs: whether the data is only
used for this project or if it needs to
integrate with core data assets
Reporting demand: the expected
rendering speed, and for how many users
Templates
We’ve defined four templates based on common needs patterns which can be reused as-is or slightly modified to suit your particular case.
When you need pure muscle-power
Ex: CDB Reporting, Datalens
For simple reporting over large data
Ex: Radarly, Digital Dashboard
For complex reporting over light data
Ex: Budgit
Quick and agile project for POCs or short-lived needs
Ex: Weezevent
Hulk
Iron man
Thor
Hawkeye
3. Architectural considerations
Sources
Cloud data source are
simple to capture.
On-premises data sources
can imply a form of
gateway (or IR), a push from
the local infrastructure or a
VPN access linking cloud
resources to local networks.
Data volumes
Small data volumes can
generally be processed in
memory all at once and fit
within the 1GB of data
limitation in Power BI.
Medium data volumes can
be processed with a single
machine whereas large
data volumes require
cluster-based, parallelized
processing.
Data interests
Local data interests can be
managed in a fully
autonomous way, isolated
from other projects and
stakeholders.
Global data interest intend
to have their results be
reused by other projects
and teams. As such, these
projects have more
complex integration phases
and have more advanced
security features to manage
them.
The criteria you should consider when planning an architecture
4. Architectural considerations
ETL complexity
Simple ETL involves only light
transformations and data
type casting.
Medium ETL transforms an
incoming landing model
into a fully-fledged star
schema.
Complex ETL involves
proactive data quality
management, advanced
dimensional models and/or
intricate business rules.
Model complexity
Simple models use additive
measures over a single star
schema or a flat dataset.
Medium models include
advanced DAX with semi-
additive measures and/or
calculation over multiple
star schemas.
Complex models require
performance hindering
features such as row-level
security, bi-directional cross-
filtering or very advanced
DAX calculations.
Reporting demand
Low demands infer that it is
acceptable to have longer
response times (5-15s).
Medium demands require
snappy response times
(<100ms) over a small
number of concurrent users.
High demands involving
having snappy response
times over a large number
of concurrent users.
5. Functional phases
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION DATA ACCESS
TRANSPORTING
Where the original
data lives
What moves the
data from the
source to the
platform
What coordinates
the different services
What cleans and
transforms the data
from its raw state to
its usable form
Where data lives in
its cold form
Where reporting
calculations are
made for the end-
users
How the end uses
access the data
An architecture is divided into functional workloads. A single
technology can support multiple workload, and a singe workload
can sometimes be shared between different technologies.
6. Sources
Sources come in two main categories : cloud sources and on-premises sources. Generally speaking, cloud sources
are relatively simple to manage whereas on-prems have to deal with the added complexity of networking.
Cloud : in this category, we find object storage like AWS S3 or Azure Blob, API calls and user documents (ex: Excel
files) stored on SharePoint Online.
- Object storage is straightforward and is handled with an ID/Secret mechanism.
- API calls can be a bit more complex, especially depending on the authentication mechanism, but often offer a
good amount of flexibility in what is returned. They are often capped in terms of data size per call and require
more custom logic to handle.
- Documents stored on SPO allow users to give direct input in the solution but come with the perils of poorly formed
Excel files. Whenever possible, we recommend capturing user inputs through a small web application or a
PowerApp.
On-prems : these sources are highly valuable (they often form the core of information systems) but can be tricky to
access from a cloud services. A few options are available to handle this situation :
- Joining the cloud resource to the internal network through VPN
- Exposing part of the source (or extracts) in a DMZ. This may not be possible if the data is sensitive
- Having an on-premises ETL push the data to the cloud rather than having cloud services fetching the data
- Using a gateway like Azure Data Factory’s Integration Runtime to act as a bridge between the on-prems
resources and the cloud service. This tends to be the easiest scenario.
7. Transportation
Transportation refers to the Extract-and-Load workloads (without the transformations). Throughput, connectivity,
parametrization, and monitoring are the key aspects in choosing the right transportation solution.
Power BI Dataflows : when data volumes are small (less than 1GB), transformations are simple and the end
destination is solely meant to be used in Power BI, it’s Dataflow/PowerQuery engine can be used. It supports a large
array of connectors and decent parametrization possibilities.
Python : Ideally ran in a serverless environment (AWS Sage Maker, Airflow or Azure Functions), managed code can
adapt to a wide variety of data sources, shapes and destinations… provided you have the skill and time to code
the E-L solution. This is more adapted to small-to-medium data volumes since the code is usually limited to a single
machine (and often to a single cpu thread). It is fully DevOps compatible.
Azure Data Factory : the go-to solution for E-L workloads in Azure, ADF is capable of handling large workloads with
excellent throughput (especially if landing to ADLS) and is fully DevOps compatible. ADF’s main downside is that
while it can read from many different sources, it typically only writes only to MS destinations (with a few exceptions).
For self-service
projects
The go-to
solution. Even
more with on-
prems sources
Your can opener
for complex files
and API calls
OUR VERDICT
8. OUR VERDICT
Processing
The processing layer is where the data quality and business logic is applied. Some projects require very light
transformations whereas other completely change the model and apply complex business logic.
PySpark: is a managed code framework that can scale very well to Big Data scenario. Spark-based solutions can
be easily implemented in a PaaS format through Databricks with fully managed notebook, simple industrialization
of code and good monitoring capabilities. For maintainability and support, we recommend using PySpark as the
main Spark language.
SQL Procedures : Transformations can take place within the database itself, limiting data movement. Azure SQL DB
supports a DevOps-compatible fully-fledge language (T-SQL) and is suitable for small-to-medium data volumes.
Snowflake’s SQL programmatic objects are less developed but the platform can handle very large data volumes.
Azure SQL DWH offers big data levels of volume using SQL Stored procedures but need to be managed more
actively (manual cluster starting/scaling/stopping).
Azure Data Factory : ADF offers a GUI-based dataflow engine powered by Spark. It can handle very large data
volumes, but may be somewhat more limited for very complex transformations. Simple to medium complexity of ETL
can be performed without problems.
Python : Python’s processing is similar as it’s transportation’s workload : very flexible but potentially longer to
develop and best used for smaller data volumes.
PBI Dataflow : Power BI offers a simple GUI-based (codeless) interface for light ETL workloads. It can be used to
develop simple transformations over low data volumes very quickly and has a large array of source connectors. It is
limited for its output to Power BI and the Common Data Model in ADLS.
For self-service
projects
For point-and-
clic ETL in the
cloud
For complex
flows and ML
For SQL
professionals
For big data
projects
9. ETL key feature comparaisons
Here are key differentiators for typical…
PBI PySpark
(Databricks)
Cloud sources
On-Prems sources
Handling of semi-structured data
Destinations
Data volume
Transformation capabilities
Machine learning capabilities
CI/CD capabilities
Alert and monitoring
ADF + Dataflow ADF + SQL Procs
(Azure SQL DB)
Python
(Airflow)
Handling of Excel files
Ease of development
10. Storage
The storage layer is where cold data rests. Its main concerns are data throughput (in and out), data management
capabilities (RLS, data masking, Active Directory authentication, etc.), DevOps compatibility.
Azure Data Lake Storage : ADLS is an object-based storage system based on HDFS. It has excellent throughput but is
limited to file-level security (not row-level or column-level). As such, it is best used during massive import/export or
where 100% of the file needs to be read.
Snowflake : Soon to be present globally on Azure, is a true dataware solution in the cloud. As a pure storage layer,
it doesn’t have the same data management or DevOps capabilities as Azure SQL DB but it supports an impressive
per-project compute costing model over the same data. It’s throughput is similar to Azure SQL DWH.
Azure SQL DB : Azure’s main fully-featured DMBS, SQL DB has excellent data management capabilities and Active
Directory support. It supports a declarative development approach which offers many DevOps opportunities.
However, its throughput is not great and massive data volumes should be loaded using an incremental approach
for best performances. When hosting medium data volumes or more, consider an S3 service level or more to
access advanced features like columnstore storage.
Azure SQL DWH : A massively paralleled processing (MPP) version of Azure SQL DB. By default, data is shared across
60 shards, themselves spread between 1 to 60 nodes which offer very high throughput when required. Azure SQL
DWH supports programmatic objects such as stored procedures but has a slightly different writing style to Azure SQL
DB in order to take advantage of its MPP capabilities.
Landing native
files
BEST FOR
Small structured
data
Big data ETL
Big data
ad hoc use
11. Database key feature comparaisons
The cloud now offers multiple solutions for hosting relational
data. While these have more or less feature parity on all core
functionalities (they can all perform relatively well), key
features do exist between them. Snowflake Azure SQL DB Azure SQL DWH
Scale time
Compute/storage isolation
Semi-structured data
PBI integration
Azure Active Directory integration
DevOps & CI/CD support
Temporal Tables
Data cloning
Cost when use as ETL
Cost when use as reporting engine
Ease of budget forcasting
Just don’t do it…
DB programming
*
12. Live calculation
Reporting calculations engines preform the real-time data crunching when users consume reports. This tends to be
a high-demand, 24/7 service due to the group’s international nature. Model complexity and reporting demands
are key drivers when choosing an appropriate technology for this layer.
Power BI models : if the data volumes are small (less than 1GB of compressed data), Power BI native models,
especially when used on a Premium capacity, globally give the most fully-featured capabilities for this workload on
Power BI. Its only real draw backs is the lack of a real developer experience like source control or CI/CD
capabilities. Premium workload data size limits will soon be increased to 10GB… for a fee.
Composite Models : Composite Models allow Power BI to natively keep some of its data (like dimensions or
aggregated tables) in-memory and other in DirectQuery mode. This reduces the need for computation at the
reporting layer. It may not be adequate for complex models since the DAX used by DirectQuery still has limitations
and may involve uneven performances depending on when a query can hit PBI table or when it needs to revert
back to DirectQuery data. They are best suited for dashboarding scenarios (limited interactivity) over large data.
Azure Analysis Services : AAS is the scale-out version of Power BI Native Models. They can perform the same
complex calculations over very large datasets at a cost-efficient level (when compared to Power BI Premium). AAS
has a slower release cycle than Power BI, and thus tends to lack the latest features supported by PBI. It does
however support a real developer experience with source control and development environments.
DirectQuery : Often misunderstood, this mode makes every visualization on a report sends one (or many) SQL query
to the underlying source for every change on the page (ex: a filter applied). Performance will be slower than in-
memory (although may still be acceptable depending on data volume and back-end compute power). Other
issues include DAX limitations and some hard limits on data volume being returned. For these reasons, limit
DirectQuery to exploratory scenarios, near-time scenarios or dashboarding scenarios with limited complexity and
interactivity. Depending on the source, there may also be significant pressure on data gateways.
When it fits !
OUR VERDICT
When it doesn’t !
Near-time,
dashboards and
exploration
Simple reports
over big data
13. Live calculations feature comparaison
This workload comparison is somewhat more complicated
because of the relationship between data volumes and
calculation complexity. Regardless, here are some general
trends we can observe.
DirectQuery
(on Snowflake)
Cost
Composite
(on Snowflake)
AAS
PBI models
(non Premium)
Volume
Model and KPI complexity
Refreshing
CI/CD
Row-level security
Data mashing
Calculation speed
*
* *
*
*
*
*
*
*
*
*
*
*
14. USE FOR
Data Access
Data access refers to how the users are able to get to the data. This is done through the
Power BI embedded : preconfigured reports and dashboards for a topic are accessible key-in-hand from the data
portal.
Direct access – data models : If there’s a need to create a local version of the report and enhance the model with
additional KPIs, it is possible to connect directly to the data model through Excel or Power BI. This allows the report
maker to start from pre-validated dimensional data and KPIs and focus on his/her own additional KPIs and visuals.
This method however doesn’t allow the report maker to mash additional data in the current model.
Direct access – curated data : If a BI developer want to create his/her own model, it is possible to access the
curated dimensional data directly from the CoreDB. This significantly lowers the cost of a project by diminishing ETL
costs and tapping into pre-validated dimensional data. KPIs and any additional data will still need to be
developed and tested before use.
Direct access – data lake : A BI developer or a Data Scientist may wish to have access to raw data files for their
own project. This may or may not be possible depending on security needs on a per-dataset basis.
Import for your
own dataware
Datasets for
data science
Build your own
reports
Key-in-hand
reports
15. Where to connect
Raw data
Curated dimensional data
Business rules applied and tested
ETL already done
One location for multiple data sources
KPIs built and tested
Calculation engine for self-service
Data visualization based on common needs
Core DB
Data lake Analysis
Services
Power BI
What you get depending on where you
connect.
16. Orchestrating
Orchestrating is a key feature of cloud architecture. Not only does it need to launch and manage job, but should
also be able to interact with the cloud fabric (resource scaling, creation/deletion, etc.). Key features include native
connectivity to various APIs, programmatic flows (if, loops, branches, error-handling, etc.), trigger-based launching,
DevOps compatibility and GUIs for development and monitoring.
LogicApps : The de facto orchestration tool in the Azure stack and includes all the key features required in such a
tool. It includes GUI-based experiences that can be scaled to a full DevOps pipeline.
Azure Data Factory : ADF includes its own simple scheduling and orchestrating tool. While not as developed as
Logic App, it can be a valid choice when the orchestration is limited to simple scheduling, core data activities (ADF
pipelines, Databricks jobs, SQL procs, etc) and basic REST calls.
AirFlow : Airflow is a code oriented scheduler, allowing people to create complex workflows with many tasks. It is
mainly used to execute Python or Spark tasks. It is seamlessly integrated in Data Portal, so it is a good choice if you
keep you data in Datalake and want to have a single entry point to monitor both your data and your processes.
The go-to
solution in Azure
OUR VERDICT
Data science
projects in AWS
Simple
scheduling
needs with ADF
17. Feature comparaisons for orchestration
CI/CD
Scheduling and triggering
Native connectors
Debugging
Ease of development
Alerting & monitoring
Parameterization
Control flow
ADF
Airflow
LogicApps
Here are the key differentiator for orchestrating
solutions.
Interfacing with On-Prems assets
Secrets Management
18. Architecture templates
Data volume
Project complexity
Hulk
When you need pure muscle-power
Ex: CDB Reporting, Datalens
Thor
Any sources Large data volumes Global interest
Complex ETL Simple model Medium demand
Simple reporting over large data
Ex: Radarly, Digital Dashboard
Iron Man
Complex reporting over light data
Ex: Budgit
Hawkeye
Quick and agile project for POCs or short-lived needs
Ex: Weezevent
Any sources Small data volumes Local interest
Simple ETL Medium model Medium demand
Any sources Small data volumes Local interest
Complex ETL Complex model Medium demand
Any sources large data volumes Global interest
Complex ETL Complex model high demand
Whilst not the only considerations to take into account, architectures can
be broadly segmented by the volume of data they handle and the
complexity of ETL and model they must support. Based on this, we’ve
defined template architectures to guide you through the design process
19. Hulk
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
Best-used for
Any sources Large data volumes Global interest
Complex ETL Complex model High demand
When you need pure muscle-power
Databricks
(pySpark)
ADF
20. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
Step 1
Data is landed from S3 to an ADLS gen2 via an ADF pipeline. This
ensure a fast, bottle-neck free landing phase. Due to the volume,
an incremental loading approach is highly recommanded to limit
the impact on an on-prems IR gateway and the throughput to
the SQL DB. We could have used ADF scheduling in simple
scenarios. However, for uniformity’s sake, and to benefit from
extra alerting capabilities, Logic Apps is preferred as the overall
scheduler.
1
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
21. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
2
Step 2
Databricks or ADF is used to perform complexe ETL over a large
dataset by leveraging is Spark SQL engine in Python.
It fetches the landed data from ADLS and enriches it with curated
data from the Core DB to perform the complexe ETL.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
22. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
3
Step 3
Any group data that can be used in an overall curated
dimensional model useful for other projects is pushed back to the
Core DB. This database is thus enriched project by project easy-
to-used, vouched-for datasets.
Data that is purely report-specific is pushed to a project datamart.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
23. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
4
Step 4
Due to the size of the reporting dataset, the complexity of its
model and KPI and the expected reporting performances, an
AAS cube is used as a main reporting engine.
The final report is built on top of this cube and exposed in the
Data Portal through Power BI Embedded.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
24. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
5
Step 5
Subsidiary that whishes to do so can access the AAS cube to
make their own custom report and/or fetch the dimensional data
through the Core DB by using custom views for security purposes.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
26. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
Composite
models
PBI Embedded
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 1
The overall architecture ressembles a Hulk-like scenario due to the
volume of data.
Transformations are performed in Databricks or ADF despite the
fact that the ETL is rather simple because data volumes can
overwhelme a single-machine architecture and/or the
throughput of the target database.
1
Thor
Simple reporting over large data
Databricks
ADF
27. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
Composite
models
PBI Embedded
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
2
Step 2
Two options are available for reporting calculations The simplest one
is to use a AAS cube. This is potentially more expensive in terms of
software, but simpler in development. The alternative is to use PBI
models with aggregations if the KPIs are simple enough to hit the
aggregations on a regular basis. However, this complexifies the data
model (thus development time), can hurt performance when the
query is passed through to the source, and infers costs on the data
mart layer (using Snowflake because of it per-query costing model).
Thor
Simple reporting over large data
Databricks
ADF
29. Iron Man
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Any sources Medium data volumes Local interest
Complex ETL Complex model Medium demand
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
Complex reporting over light data
LIVE CALCULATION Data Access
30. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
LIVE CALCULATION Data Access
1
Step 1
This architecture is designed for « tactical projects » where data
sharing is not paramount and data volumes are low, but which
may still necessitate a fair amount of business rules and data
cleansing.
The low volume means we can directly write raw data to the
database without a file-based landing in a datalake.
Iron Man
Complex reporting over light data
31. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
LIVE CALCULATION Data Access
2
Step 2
The complexe ETL can then be implented in SQL Stored
Procedures within the database itself.
For simplicity’s sake, the orchestration in Logic Apps launches the
ADF, and ADF launches the procs after the landing. This reduces
the complexity of the Logic App code to handle long-runnning
procedures.
Iron Man
Complex reporting over light data
33. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Small data volumes Local interest
Complex ETL Complex model Medium demand
Project
ressources
Shared
ressources Cloud sources
Airflow
Snowflake
PR Python
Operator
Amazon S3 PBI Embedded
(Import)
Databricks
(pySpark)
PR Python
Operator
/
LIVE CALCULATION Data Access
Iron Man - AWS
Complex reporting over light data
34. Hawkeye
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Any sources Small data volumes Local interest
Simple ETL Medium model Medium demand
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Quick project for POCs or short-lived needs
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
35. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 1
A fully self-service Power BI stack is possible on small data
volumes and low ETL complexity. This architecture should be kept
to proof of concepts or temporairy projects where the time-to-
market is paramount and maintainability is not required.
While PBI dataflows are able to handle larger and larger data
volumes with premium capacities, current price-performance
ratios are highly suboptimal on larger data volumes.
1
Hawkeye
Quick project for POCs or short-lived needs
36. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 2
Power Query and the M language are capable of handling low-
to-medium levels of complexities in the ETL.
However, they currently lack the life-cycle tooling (version
control, automated deployments, etc) that are required with
professional development.
2
Hawkeye
Quick project for POCs or short-lived needs
37. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 3
A major roadblock preventing this architecture to be deployed to
more than POCs and temporary projects is the current lack of
integration with external storage systems.
The only current possibilities are thighly integrated to the Common
Data Model initiative in ADLS, which has yet to prove it viability
beyond Dynamics 365.
3
Hawkeye
Quick project for POCs or short-lived needs
38. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 4
The calculation and Data Access are of course made in Power BI
directly.
Here again, keep data volumes to a minimum. While the
premium capacties can accomodate larger and larger volumes,
the price-performance ratios are downright disastrous compared
to alternatives (AAS cubes and composite models).
4
Hawkeye
Quick project for POCs or short-lived needs
39. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Cloud sources Small data volumes Local interest
Simple ETL Simple model Medium demand
Project
ressources
Shared
ressources
Superset
Snowflake
PR Python
Operator
Airflow
Amazon S3
LIVE CALCULATION Data Access
Hawkeye
Quick project for POCs or short-lived needs
40. Where ?
Project
ressources
Shared
ressources
The Core DB is present in the Hulk and Thor templates.
It is a single database used by several projects.
Thor Hulk
SOURCES PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
ADLS
Core DB
Data mart
AAS PBI Embedded
Databricks
(pySpark)
41. Core DB contains global information that can be used for all affiliates and across several
projects. It is a central repository that is gradually built from widely used business data (e-
commerce, prisma, websites, consumer activities, etc).
What is Core DB ?
SOURCES DATA LAKE
CORE DB
PROJECT DATA MART
REF
Ref/MDM data
Data quality
3rd Normal Form
Conformed
Dimensions
Star schema
EDW
This widely used pattern allows :
- Consistency across projects
- Better quality and overall data
management
- Lower project costs through
reuse of validated assets
42. Core DB contains global information that can be used for all affiliates and across several
projects:
• Common dimensions and referentials:
• products,
• entities,
• contacts
• geography
• …
• Widely used business events:
• e-commerce orders,
• prisma data,
• website views,
• consumer activities
• …
What is Core DB ?
43. The model is business oriented, and not source oriented.
It has its own IDs, to allow cross-source identifications:
For the same business item (for example a contact) it can ingest data from several sources
(CDB, client database, employee database…).
Therefore, the model and schemas do not depend on the sources.
What is Core DB ?
Contact ID CDB_id Email City Score
Core DB
Sources
44. Because of it’s multi-project nature, CoreDB has special requirements in terms of data
management and development practices.
Core DB requirements
- The data model is managed by a
central data architect
- Changes must be handled by pull
request on a central repository
- Permissions have to be managed
granularly per user and asset
- Access must be granted through
objects (views, procs) which can
part of a automated testing
pipeline
PROD TEST
DEV BRANCHES
Daily rebuild
& sanitizing
Pull request
Automated Testing
Branching
Pull request
Project-based DB development
On-demand CI/CD build
45. Datalake vs Core DB
Core DB approach :
• Strong cost of input (ETL)
• Small cost of output
• Structured data
• Business event oriented
Recommendation : only common data
Data Lake approach:
• Small cost of input
• Strong cost of output (data prep)
• Miscellaneous data
• “find signal in the noise”
Recommendation : all data