Data Ingest Self Service and Management using Nifi and Kafka

Data Ingest Self-Service and Management
using NiFi and Kafka
Imran Amjad, Principal Engineer
Dave Torok, Principal Architect
June 14, 2017

XFINITY TV
XFINITY Internet
XFINITY Voice
XFINITY Home
Digital & OtherOther
*Minority interest and/or non-controlling interest.
Slide is not comprehensive of all Comcast NBCUniversal assets
Updated: December 22, 2015

Introduction and Background
• Customer Experience UI with 30,000 unique internal users per month
• Ingesting about 2 Billion Events / Month
• Typical “Big Data Analytics” Pipeline
• Data ETL, land in a data lake (e.g. HBase)
• API / several channels of consumers / 14 million requests per day
• Grew from a few dozen to 150+ data sources / feeds in about a year
• Pipeline of 5-10 new data feeds per two week sprint
Data Ingestion Self-Service and Management using NiFi and Kafka3

High Level Architecture
Streaming Compute Pipeline
UI
and
Other
Consumers
HTTP
Gateway
BATCH
Data
Sources
Analytics DB
Event Storage
DB
Rules
Enrich
Standardize
Detect
Aggregate
Real
Time
Data
Sources
“Pull”
Kafka
Bridge
(NiFi)
Store
REST API
Filestore
Apache
Flink
Apache
NiFi
Kafka
(Event
Bus)
NiFi
Kafka (Event Bus)

UI
and
Other
Consumers
HTTP
Gateway
BATCH
Data
Sources
Analytics DB
Event Storage
DB
Rules
Enrich
Standardize
Detect
Aggregate
Real
Time
Data
Sources
“Pull”
Kafka
Bridge
(NiFi)
Store
REST API
Filestore
Apache
Flink
Apache
NiFi
Kafka
(Event
Bus)
NiFi
Kafka (Event Bus)
High Level Architecture
Data Ingestion (“ETL”)
Analytics &
Active
Decisioning
UI / Services

Problem Statement and Motivation for Self Service
Time-to-market
VP: “I'd like to be able to add a new
event stream in 10 minutes.”
Manual Processes
Code Deployment

Dimensions of Ingest Variability
Transport Protocol
Kafka
Kinesis
HTTP/S
Files
(S)FTP
Format
JSON
XML
AVRO
CSV / Delimited
Custom
Timing
[Near] Real-Time
Streaming
Batch / Periodic
Ingest Control
Pull from Source
Push by Producer

Data Source Onboarding – Before Self-Service

Data Source Onboarding – Before Self-Service
Manual
Process
Code

Self-Service Architecture Principles
Metadata
Driven
Data Ingestion,
Processing,
and Rendering
Driven by
Metadata
Automation
Orchestrated
Deployment for
New Data
Feeds
Rapid
Onboarding
Portal for Data
Source
Management
Light Data
Governance
Schema-
backed Data,
Schema
Registry
Monitoring and
Metrics
Ingestion, Data
Quality, and
Operational
Status

UI
and
Other
Consumers
HTTP
Gateway
BATCH
Data
Sources
Analytics DB
Event Storage
DB
Rules
Enrich
Standardize
Detect
Aggregate
Real
Time
Data
Sources
“Pull”
Kafka
Bridge
(NiFi)
Store
REST API
Self-Serve
Metadata +
Content
Management
DB
Self
Service
API
Filestore
Apache
Flink
Apache
NiFi
Kafka
(Event
Bus)
NiFi
Self-Service
UI
Kafka (Event Bus)
High Level Architecture with Self-Service
Analytics &
Active
Decisioning
UI / Services
Self-Service

UI
and
Other
Consumers
HTTP
Gateway
BATCH
Data
Sources
Analytics DB
Event Storage
DB
Rules
Enrich
Standardize
Detect
Aggregate
Real
Time
Data
Sources
“Pull”
Kafka
Bridge
(NiFi)
Store
REST API
Self-Serve
Metadata +
Content
Management
DB
Self
Service
API
Filestore
Apache
Flink
Apache
NiFi
Kafka
(Event
Bus)
NiFi
Self-Service
UI
Kafka (Event Bus)
Analytics &
Active
Decisioning
UI / Services
Self-Service
High Level Architecture with Self-Service

Metadata and Data Governance

Models and Metadata to enable Self-Service
Self Service Metadata Management
CORE METADATA
Data Model and Data Dictionary
INGEST
And
ETL
Metadata
PROCESSING Metadata
Lookups, Enrichment,
Aggregation, Expressions
UI / RENDERING
METADATA
BUSINESS CONTENT
Enrichment and Notification
Templates and Lookups

Metadata Management – Example “Tags” on individual data fields
Display Field (true)
Category (“Product”)
Icon File (“x1_tv.svg”)
Icon Color (“green”)
UI Rendering
Sensitivity (none / encrypt)
Encryption KeyId (“SomeKeyId”)
Security
ShortName (“x1view”)
Description (“X1 View Program”)
Format (“json”)
Data Source
Information
Ingest Handling (Ingest /Drop)
Field Name (“viewcode”)
Field JsonPath
(“$.EVT.VALUE.CODE”)
Source Field
Information
Target Domain Object (“error”)
Field JsonPath
(“$.data.error.viewCode)
Target Field
Information

Why Data Governance?
Validate Types against Schema
Detect Structure Changes
Backwards/Forwards Compatibility
Universally Required Information
Data Quality Data Rationalization
Standard Syntax and Semantics
Domains (“Customer”, “Device”)
Standard Field Names and Types
Message Data Format
Support Integration / Correlation
Data Curation
Metadata Management
Data Source Registry
Schema Repository
Support Lineage Traceability
Additional Properties (e.g. Security)

Lightweight Data Governance
Why JSON Schema over Avro for our use case?
• Data sources / producers generally aren’t using Avro
• Database Storage, UI, REST API is not Avro
• Tolerate data change: Detect, Accept, Notify, Correct (Later)
• Don’t “drop data on the floor” – AVRO ignores unknown fields
• [Some] AWS Services – JSON friendly but not Avro friendly
AVRO Schema
vs.
JSON Schema
JSON + JSON Schema AVRO + AVRO Schema
Validation Validation
JSON Serialization Framework
Optimized for Data Storage
Upfront Data Governance

Versioned Data and Schema
• Allow ingestion of multiple versions (particularly from different sources)
Data Quality
• Invalid Data Types
• Non-parseable Payloads
• Missing Required Fields
Data Change Tolerance
• Detect Additional Data / “Unknown Fields”
• Store alongside schema-modeled data
• Allow for display in UI only after updating Schema
Quality Feedback loop to data producers

Reduced schema review and approval process
Generate JSON Schema Artifacts from Field Metadata
Lightweight Metadata Repository
• Considering Apache Atlas and Hortonworks Schema Registry
Pre-defined Core Schema / Domain Objects
• 5-15 (or so) domain object schema to be included in an event schema
• E.g. Customer, Device, Geolocation

Portal for Data Source Onboarding

Data Source Onboarding UI

Platform Automation for Kafka and NiFi

Data Source Onboarding – With Self-Service
Create Schema
and Metadata
Register the
Schema to
generate the
artifacts
Test & Validate
with Sample
data
Publish & Start
Ingesting Live
Data
Data
Producer

Generating NiFi Flows from Metadata

Finding Data with JsonPath (Example from http://petstore.swagger.io/ )
{
"id": 0,
"category": {
"id": 0,
"name": “dogs" },
"name": “Fido",
"photoUrls": [ “http://myimage.com" ],
"tags": [
{ "id": 0, "name": “friendly" },
{“id”:1, “name” : “housebroken”} ],
"status": "available"
}
$.id = 0
$.category.name = “dogs”
$.tags[?(@.id == 0)] =
{id = 0, “name =“friendly”}

Example Source Data Payload
{
"PV": "1.1",
"APP": {
"APP_NAME": "XRE",
"APP_VER": "X1"
},
"DEV": {
"DEVICE_TYPE": "Xi3",
},
"ACNT": {
"BILL_ID": " 1234321234 "
},
"LOC": {
"UTC_OFF": "-05"
},
"EVT": {
"ETS": 1487880967000,
"NAME": "program",
"VALUE": {
"TYPE": "XRE",
"CODE": "XRE-12345",
"DESCRIPTION": "Customer started a
program"
}
},
"PTS": 1468876861829
}
$.ACNT.BILL_ID
$.LOC.UTC_OFF
$.EVT.ETS

Example Target Schema Payload
{
"schema" : "comcast/xreinfo/jsonschema/1-0-0",
"billingId":"1234321234",
"eventSource":"xre",
"eventType":"xreinfo",
"timestamp":1487880967000,
"data" : {
"device" : {
"deviceType" : "Xi3"
},
"customerAccount" : {
"billingId" : "1234321234",
"accountType" : "residential"
},
“timeInformation" : {
“utcOffset" : “-5"
},
"info" : {
"infoCode" : "XRE-12345"
}
},
"unknownData" : {
"EVT" : {
“INFO": {
"ABXXY" : “MORE DATA"
}
}
}
}

Simple Jsonpath Transformation
FROM $.ACNT.BILL_ID
TO $.billingId
TO $.data.customerAccount.billingId)
FROM $.EVT.ETS
TO $.data.timestamp
FROM $.LOC.UTC_OFF
TO $.data.timeInformation.utcOffset
{
"schema" : "comcast/xreinfo/jsonschema/1-0-0",
"billingId":"1234321234",
"eventSource":"xre",
"eventType":"xreinfo",
"timestamp":1487880967000,
"data" : {
"device" : {
"deviceType" : “Xi3"
},
"customerAccount" : {
"billingId" : "1234321234"
},
“timeInformation" : {
“utcOffset" : “-5"
},
"info" : {
"infoCode" : "XRE-12345"
}
},

Advanced JSON Transformation using JOLT
Rich DSL for transforming and manipulating JSON
• Transform Structure (“Shift”)
• Defaults
• Sort
• Remove
• Expressions
https://github.com/bazaarvoice/jolt
Apache 2.0 License
NiFi “JoltTransformJSON” Processor

JOLT Transformation for X1 Event

Advanced Transform Examples (“JOLT” Processor)
INPUT: {"ChangedFields":["JobStateCd","DispatcherStatusCd"]}
OUTPUT:{“ChangedFields_0”:“JobStateCd”,
“ChangedFields_1”:“DispatcherStatusCd”,
“ChangedReason”, “JobStateCd|DispatcherStatusCd||||” }

"spec": { "ChangedFields_0": "=elementAt(@(1,ChangedField),0)",
"ChangedFields_1": "=elementAt(@(1,ChangedField),1)",
"ChangedFields_5": "=elementAt(@(1,ChangedField),5)" } },
"spec": { "ChangedReason":
"=concat(@(1,ChangedFields_0),'|',@(1,ChangedFields_1),'|',@(1,ChangedFields
_2),'|',@(1,ChangedFields_3),'|',@(1,ChangedFields_4),'|',@(1,ChangedFields_
5))" }

"spec": { "ChangedFields_0": "=elementAt(@(1,ChangedField),0)",
"ChangedFields_5": "=elementAt(@(1,ChangedField),5)" } },
"spec": { "ChangedReason":
"=concat(@(1,ChangedFields_0),'|',@(1,ChangedFields_1),'|',@(1,ChangedFields
_2),'|',@(1,ChangedFields_3),'|',@(1,ChangedFields_4),'|',@(1,ChangedFields_
5))" }

“Devices” : {
“array” : [
{“fulfillmentID”: “12345”,
“Device”: {“schema.core.Device": {
“model” : { “string”: “AX013ANM” }}}},
{“fulfillmentID”: “23456”,
“Device”: {“schema": {
“model” : { “string” :“PXD01ANI” }}}}]}
"Devices": {
"array": {
"*": { "fulfillmentID": { "string": "events[0].data.fid_&2"},
"device": { “schema.core.Device": { “model": { "string": "events[0].data.model_&4"
“events” : [ {
“data” : {
“fid_0" : “12345",
"model_0" : "AX013ANM ",
"fid_1" : “23456",
“model_1" : "PXD01ANI” } } ]

NiFi REST API – Create a Process Group
POST http://localhost:8080/nifi-api/process-groups/root/process-groups
{
"revision": {
"version" : 0
},
"component": {
"name" : "x1info Process Group"
}
}

NiFi REST API – Create a ConsumeKafka Processor
POST http://localhost:8080/nifi-api/process-groups/{ID}/processors
{
"revision": {
"version": 0
},
"component": {
"config": {
"properties": {
"bootstrap.servers": "localhost:9092",
"topic": "raw.mystream.x1info",
"group.id": "nifi-stage-0522",
"auto.offset.reset": "latest“
}},
"name": "ConsumeKafka - x1info",
"type": "org.apache.nifi.processors.kafka.pubsub.ConsumeKafka“
}}

Generated NiFi Flow

Monitoring and Metrics
• Dashboard of Data Quality
• Ingestion Rate Monitoring (and possibly alerting with anomaly detection)
• Alerting to producers (Data Quality)

Ingestion Status Dashboard
Data Center 1 Data Center 2
Event_type1
Event_type2
MyEvent
WeekendEvent

Anomaly Detection

Self-Service Lessons Learned
Design with automation and configuration in mind
Metadata-driven design reduces code deployments and custom solutions
Make Simple things Simple – But allow hard things to be “code” and not UI-driven
Let Data Producers be accountable for Data Quality
NiFi and JOLT = Powerful Toolkit

Data Ingest Self Service and Management using Nifi and Kafka

Data Ingest Self Service and Management using Nifi and Kafka

More Related Content

What's hot

What's hot (20)

Similar to Data Ingest Self Service and Management using Nifi and Kafka

Similar to Data Ingest Self Service and Management using Nifi and Kafka (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Data Ingest Self Service and Management using Nifi and Kafka