Discovery Day 2019 Sofia - Big data clusters

Powered by
SQL Server 2019
Big Data Clusters
Rozalina Zaharieva
&
Dimitar Zahariev

SQLServer Big Data Cluster Layout
IoT data
Controller
Cluster
Compute plane
Compute pool Compute pool
SQL Compute
Node
SQL Compute
Node
Compute pool
SQL Compute
Node
SQL Compute
Node
SQL Compute
Node
Control planeSQL Server
Master instance
Storage plane
Directly read
From HDFS
Data pool
SQL Data
Node
SQL Data
Node
Storage Storage
HDFS Data Node
Spark
SQL
Server
Storage pool
Spark
SQL
Server
HDFS Data Node HDFS Data Node
Spark
SQL
Server
Kubernetes pod
External data sources
Microsoft SQL Server
Node
Persistent storage
Node Node Node Node Node Node Node
Analytics
Custom
apps
BI

Architecturedissection
• Kubernetes (K8s) concepts
• SQL Server 2019 big data cluster (BDC) components

WhatisKubernetesandwhatitdoes?
 Kubernetes is a container orchestrator and is responsible for:
 Run a cluster of hosts
 Schedule containers to run on different hosts
 Facilitate the communication between the containers
 Provide and control access to/from outside world
 Track and optimize the resource usage
 Similar solutions
 Docker Swarm, Mesos Marathon, Amazon ECS, Hashicorp Nomad

K8sarchitectureoverview
kube-proxy
Kubelet
Node1
Pod1
PodN
...
kube-proxy
Kubelet
NodeK
Pod1
PodM
...
Master Node
Scheduler
Controller
api-server
Key-Value Store
Master Node
Scheduler
Controller
api-server
Key-Value Store
Master Node
Scheduler
Controller
api-server
Key-Value Store

MasterNodes
 Responsible for managing the cluster
 Typically more than one is installed
 In HA mode one Master node is the
Leader
 Can be reached via CLI (kubectl),
APIs, or Dashboard
Master Node
Scheduler
Controller
api-server
Key-Value Store
Master Node
Scheduler
Controller
api-server
Key-Value Store
Master Node
Scheduler
Controller
api-server
Key-Value Store
Schedules the work on
different nodes
Takes care of:
1) Control loops
2) Desired state
Performs:
1) Administrative tasks
2) Stores cluster state
etcd is used and it can
be:
1) part of the master
2) installed externally

(Worker)Nodes
 Initially called Minions
 Container runtime
 containerd, rkt, lxd
 Kubelet
 Communicates with master
 Uses CRI shims
 kube-proxy
 Network proxy
Node
kube-proxy Kubelet
Container Runtime
Pod 1
Pod 2

Pods(1)
 Smallest unit of scheduling
 Contains one or more
containers
 Containers share the pod
environment
 Scheduled on nodes
 Created via manifest files
Pod
Main container
Supporting containers
net mount ...
Environment

Pods(2)
 Each pod has unique IP address
 Inter-pod communication is via a pod network
 Intra-pod communication is via localhost and
port
Pod 2
10.10.20.21
Pod network
Pod 1
10.10.20.20
localhost

ReplicationControllers
 Higher level workload
 Looks after pod or set of pods
 Scale up/down pods
 Sets Desired State
Replication Controller
Pod

Deployment
Deployments
 Even higher level workload
 Simplifies updates
and rollbacks
 Declarative and imperative
approach
 Self documenting
 Suitable for versioning
Replication Set
Pod

Services(1)
 Provide reliable network endpoint
 IP address
 DNS name
 Port
 Expose Pods to the outside world
 NodePort (cluster-wide port)
 LoadBalancer (cloud-based)
 Use End Point object to track Pods
IP = 10.10.10.1
DNS = demo-svc
Port = 32000
Service
Pod A IP, Pod B IP, ...
End Point
Node 1
Pod A
10.10.20.21
Node 2
Pod B
10.10.20.22

Services(2)
 Services use label selectors to do their magic
Service
version=v01
app=myapp
Pod
version=v01
app=myapp
Pod
version=v01
app=myapp

Services(2)
Service
version=v01
app=myapp
Pod
version=v01
app=myapp
Pod
version=v02
app=myapp
Pod
version=v02
app=myapp
Pod
version=v01
app=myapp

Services(2)
Service
version=v02
app=myapp
Pod
version=v01
app=myapp
Pod
version=v02
app=myapp
Pod
version=v02
app=myapp
Pod
version=v01
app=myapp

Services(2)
Service
version=v02
app=myapp
Pod
version=v02
app=myapp
Pod
version=v02
app=myapp

SQL Server 2019 big data cluster (BDC)
components

Basenodeconfiguration
Applies to nodes across all planes. Services:
 kubelet – K8s local agent
 kube-proxy – network config and forwarding
 supervisord – process monitor and control
 fluentd – node logging
 flanneld – Software defined network
 collectd – OS and application data collection
SQL Big Data watchdog– config sync, watchdog, data
collector (DMV, etc)
Kubernetes node
watchdog
kubelet
kube-proxy
supervisord
fluentd
flanned
collectd

ControlPlane
External Endpoints:
 Kubernetes (REST)
 Aris Control Service (REST)
 Knox Gateway (REST gateway for Hadoop APIs)
 SQL Server Master (TDS gateway for data marts and
SQL Master Service)
Services:
 etcd
 Kubernetes Master Services Controller
 SQL Master instance
 SQL Big Data Admin Portal
 Knox Gateway
 HDFS Name Service
 YARN Master
 Hive Metastore
 InfluxDB (metrics store)
 Livy (REST interface for Spark)
 Spark Driver
Kubernetes node
Base node services + etcd
K8s Master service
Spark driver
SQL Big Data Admin portal
InfluxDB
Grafana
Kubernetes node
Controller
Proxy
SQL Master
HDFS Name Node
Kibana
Kubernetes node
Livy
Knox
Elastic Search
HIVE Metastore
YARN Master

Controller
 External REST/HTTPS Endpoint
 Bootstrap and Build out
 Manage Capacity
 Configure High Availability and recover from failure (AGs)
Security (authN, authZ, certificate rotation)
 Lifecycle (upgrade/downgrade/rollback)
 Configuration management
 Monitoring - capacity, health, metrics, logs
 Troubleshooting – performance, failures
 Cluster Admin Portal
Controller service
Buildout
Upgrade/Rollback
Add/Remove capacity
Central AuthZ/AutnN
Cluster Admin Portal
Troubleshooting
Controller
Metadata

SQLMasterInstance
 TDS endpoint into the cluster
 High value data
 OLTP server
 Data connectors
 Machine learning & extensibility
 Scalable query engine
Master instance Availability Group
Primary
Readable
Secondary
Readable
Secondary

Computeplane
 Hosts one or more SQL
Compute Pools
 Compute pool is a group of
instances that forms a data,
security, and resource boundary.
 Compute pool processes
complex distributed queries
against the data plane.
 Local storage is used for
shuffling data if necessary.
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine

Dataplane
Storage pool:
 Data ingestion through Spark (batch and streaming)
 Data storage in HDFS
 Data access through HDFS and SQL endpoints. SQL
engine reads files in HDFS directly
Data pool:
 Partitioned, in-memory cache for external data
 Scale-out data storage for append only data sets
 Data ingestion through Spark
 Provide persistent SQL Server storage for the cluster
Storage pool node
Base node services
SQL Engine
HDFS
Spark
Data pool node
Base node services
SQL Engine
Storage pool node
Base node services
SQL Engine
HDFS
Spark

Installation,configurationsandtools
Installation methods:
• Cloud - platform such as Azure Kubernetes Service (AKS)
• On-premis - VMs, Bare Metal
• Localhost - using minikube (to be used only for training and testing)
Configurations:
• All-in-One Single Node and Different Multi Node Options
Tools:
• mssqlctl, kubectl, Azure Data Studio, SQL Server 2019 extension,
• Azure CLI (for AKS), mssql-cli, sqlcmd, curl

Discovery Day 2019 Sofia - Big data clusters

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Discovery Day 2019 Sofia - Big data clusters

Similar to Discovery Day 2019 Sofia - Big data clusters (20)

More from Ivan Donev

More from Ivan Donev (9)

Recently uploaded

Recently uploaded (20)

Discovery Day 2019 Sofia - Big data clusters