SlideShare a Scribd company logo
Incorta spark integration
Incorta Spark Integration
Dylan Wan
Solution Architect at Incorta
Agenda
• Spark Overview
• Incorta and Spark
• Installation and Configuration
• Create your first MV in Incorta
• Demo
Spark Overview
Why Spark?
• General purpose framework for parallel processing in cluster
• Functional programming available in
• Scala
• Python
• Java
• Spark SQL and Dataframe
• Using the same framework for
• Data Streaming
• Machine Learning
• Graph processing
Spark Core
Spark
SQL
Spark
Streaming
Machine
Learning
Graph
Processing
Standalone
Scheduler
YARN Meso
Incorta Server Spark Server
Spark Execution Flow
Driver Program
Spark Master
(Cluster Manager)Spark Context
Worker Node
Executor Cache
Task Task
Worker Node
Executor Cache
Task Task
http://spark.apache.org/docs/latest/cluster-
overview.html
Spark Concepts
• Spark Context – like a connection in JDBC to hold the DB
session to a database. It is the connection to Spark cluster
• Master and Workers have its own JVM process and Listener
Port
• Master and Workers have its Web UI for display the progress
• Application codes are sent and assigned to executors
• Executors read, write and process the data
• Memory can be controlled at the worker level and are allocated
to individually executors
Spark Dataframe
• Organized Data into named columns like database table
• A dataframe can be created from a parquet file
• A dataframe can be written into and stored as a parquet file
• A dataframe can be processed via DataFrame API
https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.
sql.DataFrame
• A dataframe can be registered as a table and processed by
SQL
Incorta and Spark
Incorta Data Load
web
service
*.csv
*.xlsx
SQL
DB
Extract Load Backup
Restart
Formula
Column,
Reference, etc
Materialized View in Incorta
• An object created within Incorta, not loaded from data sources
• Created based on other tables loaded from Incorta
• Once a MV is loaded, it works like other tables.
• Join from and to another table or MV
• Formula columns can be created in a MV
• Aliases can be created against a MV
• MV can be defined via Spark Python and Spark SQL
• Spark Python or Spark SQL are executed as part of the regular
Loader jobs
Incorta Data Load
web
service
*.csv
*.xlsx
SQL
DB
Extract
Load Backup
Restart
Formula Column,
Reference, etc
Read
Save
Installation and Configuration
Spark Installation
• Download Spark
• http://spark.apache.org/downloads.html
• Select Package Type of “Prebuilt for Hadoop XX”
• Select Spark 1.6.2 for Incorta Release 2.8 or before
• Download the Tarball file or get the link from browser
• Run wget <download URL> from the server machine
• Unzip and uncompress the tar file
• tar –xzvf spark-1.6.2-bin-hadoop2.6.tgz
• The spark is ready to use! Try this
• bin/pyspark
• exit()
Spark Configurations
• Edit the spark-env.sh in the <Spark Home>/conf
• Change the WebUI ports if there is any conflict (optional)
• If not all ports are open to use or available from browser, you can specify
• SPARK_MASTER_WEBUI_PORT
• SPARK_WORKER_WEBUI_PORT
• Specify the external IP for monitoring Spark jobs (optional)
• If the server machine runs under a firewall and the external IP and internal IP
are different
• Set SPARK_PUBLIC_DNS to the external IP
• Limit the memory used by Spark jobs (optional)
• SPARK_WORKER_MEMORY
• Control total available, not the individual assignment to the executors
Spark Configurations
• Enable Logging – Useful for investigating issues (recommended)
• Create a directory for holding he log files
• cd <spark home>
• mkdir eventlogs
• Edit the spark-defaults.conf in <spark home>/conf
• spark.eventLog.enabled true
• spark.eventLog.dir <spark home>/eventlogs
• Enable History Server (recommended)
• Edit <spark home>/conf/spark-env.sh
• SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=<spark
home>/eventlogs”
• Start the history server
• ./sbin/start-history-server.sh
Spark Configuration – Hive metastore DB
• Hive metadata is stored in Hive
metastore
• Hive metastore requires a
database
• Create the hive-site.xml in
<spark home>/conf
• Edit <Spark Home>/conf/spark-
env.sh
• SPARK_HIVE=true
• SPARK_SUBMIT_CLASSPATH
• SPARK_CLASSPATH
• Make sure JDBC driver is
available to Spark
hive-site.xml for mySQL
[incorta@clorox2-poc spark-1.6.2-bin-hadoop2.6]$ cat ~/spark-1.6.2-bin-hadoop2.6/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3307/mydb?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mysql_root</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>
Starting Spark Master and Worker
• Go to <Spark Home> and start the spark master process
• sbin/start-master.sh
• Check the log file to get the master WebUI URL
• Open the webUI from a browser
• Start the spark slave processes
• sbin/start-slave.sh spark://<spark master host>:7077
• Check the log file to ensure that the worker started properly
• Refresh the browser page to check worker processes
• Start history server (optional, but recommended)
• sbin/start-history-server.sh
Spark Processes
Incorta Configuration
• Edit <Incorta Home>/incorta/server.properties
• spark.home=/home/incorta/spark-1.6.2-bin-hadoop2.6
• spark.master.url=spark://clorox2-poc:7077
• Please ensure that the spark.master.url set to the URL generated in the
log file when you launch the Spark master.
• You can also see it in the Spark Master Web UI
Monitoring
• Spark Master WebUI
• Check if the job is submitted to Spark master
• Check if the worker has allocated the resources to execute the job
• Check DAG for optimizing the performance
• Incorta Log
• Use tail –f <incorta home>/server/logs/incorta/<tenant>/incorta-…out
• See runtime errors
Create your first MV in Incorta
Understand read() and save()
• Read(“schema.table”) – get
the data from incorta
• Save(dataframe) – create the
data from the dataframe as
an MV
• These are incorta functions,
internally they call
• sqlContext.read.parquet
• df.write.mode("overwrite").parq
uet
Demo
Incorta spark integration

More Related Content

What's hot

Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
Zahra Eskandari
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
Tanel Poder
 
Voldemort Nosql
Voldemort NosqlVoldemort Nosql
Voldemort Nosql
elliando dias
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
Databricks
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
Zahra Eskandari
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
宇 傅
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
phanleson
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
Pooyan Mehrparvar
 
Using extended events for troubleshooting sql server
Using extended events for troubleshooting sql serverUsing extended events for troubleshooting sql server
Using extended events for troubleshooting sql server
Antonios Chatzipavlis
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2
PgTraining
 

What's hot (20)

Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Voldemort Nosql
Voldemort NosqlVoldemort Nosql
Voldemort Nosql
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Using extended events for troubleshooting sql server
Using extended events for troubleshooting sql serverUsing extended events for troubleshooting sql server
Using extended events for troubleshooting sql server
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2
 

Similar to Incorta spark integration

실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
Databricks
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
Spark Summit
 
Cassandra and Spark SQL
Cassandra and Spark SQLCassandra and Spark SQL
Cassandra and Spark SQL
Russell Spitzer
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
Eran Rom
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Building Out Your Kafka Developer CDC Ecosystem
Building Out Your Kafka Developer CDC  EcosystemBuilding Out Your Kafka Developer CDC  Ecosystem
Building Out Your Kafka Developer CDC Ecosystem
confluent
 
Getting started with Apache Spark
Getting started with Apache SparkGetting started with Apache Spark
Getting started with Apache Spark
Habib Ahmed Bhutto
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
Demi Ben-Ari
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
 

Similar to Incorta spark integration (20)

실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Cassandra and Spark SQL
Cassandra and Spark SQLCassandra and Spark SQL
Cassandra and Spark SQL
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Building Out Your Kafka Developer CDC Ecosystem
Building Out Your Kafka Developer CDC  EcosystemBuilding Out Your Kafka Developer CDC  Ecosystem
Building Out Your Kafka Developer CDC Ecosystem
 
Getting started with Apache Spark
Getting started with Apache SparkGetting started with Apache Spark
Getting started with Apache Spark
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 

More from Dylan Wan

Exploratory Data Analysis with Sweetviz in Incorta
Exploratory Data Analysis with Sweetviz in IncortaExploratory Data Analysis with Sweetviz in Incorta
Exploratory Data Analysis with Sweetviz in Incorta
Dylan Wan
 
Home and Auto Insurance Policy
Home and Auto Insurance PolicyHome and Auto Insurance Policy
Home and Auto Insurance Policy
Dylan Wan
 
2017 Classroom Analytics
2017 Classroom Analytics2017 Classroom Analytics
2017 Classroom Analytics
Dylan Wan
 
Incorta Data Security
Incorta Data SecurityIncorta Data Security
Incorta Data Security
Dylan Wan
 
BI Apps Architecture
BI Apps ArchitectureBI Apps Architecture
BI Apps Architecture
Dylan Wan
 
Conformed Dimension and Data Mining
Conformed Dimension and Data MiningConformed Dimension and Data Mining
Conformed Dimension and Data Mining
Dylan Wan
 
Data Mining Scoring Engine development process
Data Mining Scoring Engine development processData Mining Scoring Engine development process
Data Mining Scoring Engine development process
Dylan Wan
 

More from Dylan Wan (7)

Exploratory Data Analysis with Sweetviz in Incorta
Exploratory Data Analysis with Sweetviz in IncortaExploratory Data Analysis with Sweetviz in Incorta
Exploratory Data Analysis with Sweetviz in Incorta
 
Home and Auto Insurance Policy
Home and Auto Insurance PolicyHome and Auto Insurance Policy
Home and Auto Insurance Policy
 
2017 Classroom Analytics
2017 Classroom Analytics2017 Classroom Analytics
2017 Classroom Analytics
 
Incorta Data Security
Incorta Data SecurityIncorta Data Security
Incorta Data Security
 
BI Apps Architecture
BI Apps ArchitectureBI Apps Architecture
BI Apps Architecture
 
Conformed Dimension and Data Mining
Conformed Dimension and Data MiningConformed Dimension and Data Mining
Conformed Dimension and Data Mining
 
Data Mining Scoring Engine development process
Data Mining Scoring Engine development processData Mining Scoring Engine development process
Data Mining Scoring Engine development process
 

Recently uploaded

Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
Riya Sen
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
Aadee4
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
JeevanKp7
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
AkhinaRomdoni
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Samuel Jackson
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
AltanAtabarut
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
femim26318
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
deepikakumaridk25
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Riya Sen
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
evwcarr
 

Recently uploaded (20)

Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
 

Incorta spark integration

  • 2. Incorta Spark Integration Dylan Wan Solution Architect at Incorta
  • 3. Agenda • Spark Overview • Incorta and Spark • Installation and Configuration • Create your first MV in Incorta • Demo
  • 5. Why Spark? • General purpose framework for parallel processing in cluster • Functional programming available in • Scala • Python • Java • Spark SQL and Dataframe • Using the same framework for • Data Streaming • Machine Learning • Graph processing Spark Core Spark SQL Spark Streaming Machine Learning Graph Processing Standalone Scheduler YARN Meso
  • 6. Incorta Server Spark Server Spark Execution Flow Driver Program Spark Master (Cluster Manager)Spark Context Worker Node Executor Cache Task Task Worker Node Executor Cache Task Task http://spark.apache.org/docs/latest/cluster- overview.html
  • 7. Spark Concepts • Spark Context – like a connection in JDBC to hold the DB session to a database. It is the connection to Spark cluster • Master and Workers have its own JVM process and Listener Port • Master and Workers have its Web UI for display the progress • Application codes are sent and assigned to executors • Executors read, write and process the data • Memory can be controlled at the worker level and are allocated to individually executors
  • 8. Spark Dataframe • Organized Data into named columns like database table • A dataframe can be created from a parquet file • A dataframe can be written into and stored as a parquet file • A dataframe can be processed via DataFrame API https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark. sql.DataFrame • A dataframe can be registered as a table and processed by SQL
  • 10. Incorta Data Load web service *.csv *.xlsx SQL DB Extract Load Backup Restart Formula Column, Reference, etc
  • 11. Materialized View in Incorta • An object created within Incorta, not loaded from data sources • Created based on other tables loaded from Incorta • Once a MV is loaded, it works like other tables. • Join from and to another table or MV • Formula columns can be created in a MV • Aliases can be created against a MV • MV can be defined via Spark Python and Spark SQL • Spark Python or Spark SQL are executed as part of the regular Loader jobs
  • 12. Incorta Data Load web service *.csv *.xlsx SQL DB Extract Load Backup Restart Formula Column, Reference, etc Read Save
  • 14. Spark Installation • Download Spark • http://spark.apache.org/downloads.html • Select Package Type of “Prebuilt for Hadoop XX” • Select Spark 1.6.2 for Incorta Release 2.8 or before • Download the Tarball file or get the link from browser • Run wget <download URL> from the server machine • Unzip and uncompress the tar file • tar –xzvf spark-1.6.2-bin-hadoop2.6.tgz • The spark is ready to use! Try this • bin/pyspark • exit()
  • 15. Spark Configurations • Edit the spark-env.sh in the <Spark Home>/conf • Change the WebUI ports if there is any conflict (optional) • If not all ports are open to use or available from browser, you can specify • SPARK_MASTER_WEBUI_PORT • SPARK_WORKER_WEBUI_PORT • Specify the external IP for monitoring Spark jobs (optional) • If the server machine runs under a firewall and the external IP and internal IP are different • Set SPARK_PUBLIC_DNS to the external IP • Limit the memory used by Spark jobs (optional) • SPARK_WORKER_MEMORY • Control total available, not the individual assignment to the executors
  • 16. Spark Configurations • Enable Logging – Useful for investigating issues (recommended) • Create a directory for holding he log files • cd <spark home> • mkdir eventlogs • Edit the spark-defaults.conf in <spark home>/conf • spark.eventLog.enabled true • spark.eventLog.dir <spark home>/eventlogs • Enable History Server (recommended) • Edit <spark home>/conf/spark-env.sh • SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=<spark home>/eventlogs” • Start the history server • ./sbin/start-history-server.sh
  • 17. Spark Configuration – Hive metastore DB • Hive metadata is stored in Hive metastore • Hive metastore requires a database • Create the hive-site.xml in <spark home>/conf • Edit <Spark Home>/conf/spark- env.sh • SPARK_HIVE=true • SPARK_SUBMIT_CLASSPATH • SPARK_CLASSPATH • Make sure JDBC driver is available to Spark
  • 18. hive-site.xml for mySQL [incorta@clorox2-poc spark-1.6.2-bin-hadoop2.6]$ cat ~/spark-1.6.2-bin-hadoop2.6/conf/hive-site.xml <?xml version="1.0" encoding="UTF-8"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3307/mydb?createDatabaseIfNotExist=true</value> <description>metadata is stored in a MySQL server</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>MySQL JDBC driver class</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>user name for connecting to mysql server</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>mysql_root</value> <description>password for connecting to mysql server</description> </property> </configuration>
  • 19. Starting Spark Master and Worker • Go to <Spark Home> and start the spark master process • sbin/start-master.sh • Check the log file to get the master WebUI URL • Open the webUI from a browser • Start the spark slave processes • sbin/start-slave.sh spark://<spark master host>:7077 • Check the log file to ensure that the worker started properly • Refresh the browser page to check worker processes • Start history server (optional, but recommended) • sbin/start-history-server.sh
  • 21. Incorta Configuration • Edit <Incorta Home>/incorta/server.properties • spark.home=/home/incorta/spark-1.6.2-bin-hadoop2.6 • spark.master.url=spark://clorox2-poc:7077 • Please ensure that the spark.master.url set to the URL generated in the log file when you launch the Spark master. • You can also see it in the Spark Master Web UI
  • 22. Monitoring • Spark Master WebUI • Check if the job is submitted to Spark master • Check if the worker has allocated the resources to execute the job • Check DAG for optimizing the performance • Incorta Log • Use tail –f <incorta home>/server/logs/incorta/<tenant>/incorta-…out • See runtime errors
  • 23. Create your first MV in Incorta
  • 24. Understand read() and save() • Read(“schema.table”) – get the data from incorta • Save(dataframe) – create the data from the dataframe as an MV • These are incorta functions, internally they call • sqlContext.read.parquet • df.write.mode("overwrite").parq uet
  • 25. Demo