Incorta allows users to create materialized views (MVs) using Spark. It provides functions to read data from Incorta tables and save Spark DataFrames as MVs. The document discusses Spark integration with Incorta, including installing and configuring Spark, and creating the first MV using Spark Python APIs. It demonstrates reading data from Incorta and saving a DataFrame as a new MV.
Spark Streaming allows for scalable, fault-tolerant stream processing of data ingested from sources like Kafka. It works by dividing the data streams into micro-batches, which are then processed using transformations like map, reduce, join using the Spark engine. This allows streaming aggregations, windows, and stream-batch joins to be expressed similarly to batch queries. The example shows a streaming word count application that receives text from a TCP socket, splits it into words, counts the words, and updates the result continuously.
This document discusses connecting Hadoop and Oracle databases. It introduces the author Tanel Poder and his expertise in databases and big data. It then covers tools like Sqoop that can be used to load data between Hadoop and Oracle databases. It also discusses using query offloading to query Hadoop data directly from Oracle as if it were in an Oracle database.
Project Voldemort is a distributed key-value store inspired by Amazon Dynamo and Memcached. It was originally developed at LinkedIn to handle high volumes of data and queries in a scalable way across multiple servers. Voldemort uses consistent hashing to partition and replicate data, vector clocks to resolve concurrent write conflicts, and a layered architecture to provide flexibility. It prioritizes performance, availability, and simplicity over more complex consistency guarantees. LinkedIn uses multiple Voldemort clusters to power various real-time services and applications.
Sudarshan Kadambi presented this talk at the Bay Area Spark Meetup @ Bloomberg. He covered Bloomberg Apache Spark Server and contributions to Apache Spark. The talk also talked about challenges of doing high-volume online analytics while still observing high-levels of SLAs
The document discusses designing robust data architectures for decision making. It advocates for building architectures that can easily add new data sources, improve and expand analytics, standardize metadata and storage for easy data access, discover and recover from mistakes. The key aspects discussed are using Kafka as a data bus to decouple pipelines, retaining all data for recovery and experimentation, treating the filesystem as a database by storing intermediate data, leveraging Spark and Spark Streaming for batch and stream processing, and maintaining schemas for integration and evolution of the system.
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn. This was a presentation made at QCon 2009 and is embedded on LinkedIn's blog - http://blog.linkedin.com/
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
Streaming Analytics with Spark, Kafka, Cassandra, and Akka discusses rethinking architectures for streaming analytics. The document discusses:
1) The need to build scalable, fault-tolerant systems to handle massive amounts of streaming data from different sources with varying structures.
2) An example use case of profiling cyber threat actors using streaming machine data to detect intrusions and security breaches.
3) Rethinking architectures by moving away from ETL pipelines and dual batch/stream systems like Lambda architecture toward unified stream processing with Spark Streaming, Kafka, Cassandra and Akka. This simplifies analytics and eliminates duplicate code and systems.
This document provides an overview and agenda for a course on Spark MLlib. The course covers Spark fundamentals, SQL, streaming and MLlib. The MLlib section includes an overview of MLlib, a quick review of machine learning concepts, and why MLlib is useful. It describes the main concepts in MLlib like DataFrames, transformers, estimators and pipelines. It provides examples of classification using logistic regression on text data, regression to predict tweet impressions, and topic modeling on tweets. Finally, it lists some of the algorithms in MLlib, including classification, regression, clustering and tree ensemble methods.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Spark and Spark Streaming can process streaming data using a technique called Discretized Streams (D-Streams) that divides the data into small batch intervals. This allows Spark to provide fault tolerance through checkpointing and recovery of state across intervals. Spark Streaming also introduces the concept of "exactly-once" processing semantics through checkpointing and write ahead logs. Spark Structured Streaming builds on these concepts and adds SQL support and watermarking to allow incremental processing of streaming data.
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Building real time Data Pipeline using Spark Streamingdatamantra
This document summarizes the key challenges and solutions in building a real-time data pipeline that ingests data from a database, transforms it using Spark Streaming, and publishes the output to Salesforce. The pipeline aims to have a latency of 1 minute with zero data loss and ordering guarantees. Some challenges discussed include handling out of sequence and late arrival events, schema evolution, bootstrap loading, data loss/corruption, and diagnosing issues. Solutions proposed use Kafka, checkpointing, replay capabilities, and careful broker/connect setups to help meet the reliability requirements for the pipeline.
This chapter introduces advanced Spark programming features such as accumulators, broadcast variables, working on a per-partition basis, piping to external programs, and numeric RDD operations. It discusses how accumulators aggregate information across partitions, broadcast variables efficiently distribute large read-only values, and how to optimize these processes. It also covers running custom code on each partition, interfacing with other programs, and built-in numeric RDD functionality. The chapter aims to expand on core Spark concepts and functionality.
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
The document summarizes Pinterest's migration of ETL workflows from Cascading and Scalding to Spark. Key points:
- Pinterest runs Spark on AWS but manages its own clusters to avoid vendor lock-in. They have multiple Spark clusters with hundreds to thousands of nodes.
- The migration plan is to move remaining workloads from Hive, Cascading/Scalding, and Hadoop streaming to SparkSQL, PySpark, and native Spark over time. An automatic migration service helps with the process.
- Technical challenges included secondary sorting, accumulators behaving differently between frameworks, and output committer issues. Performance profiling and tuning was also important.
- Results of migrating so far include
Reactive dashboard’s using apache sparkRahul Kumar
Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
This document discusses Pearson's use of Apache Blur for distributed search and indexing of data from Kafka streams into Blur. It provides an overview of Pearson's learning platform and data architecture, describes the benefits of using Blur including its scalability, fault tolerance and query support. It also outlines the challenges of integrating Kafka streams with Blur using Spark and the solution developed to provide a reliable, low-level Kafka consumer within Spark that indexes messages from Kafka into Blur in near real-time.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Faster Data Integration Pipeline Execution using Spark-JobserverDatabricks
As you may already know, the open-source Spark Job Server offers a powerful platform for managing Spark jobs, jars, and contexts, turning Spark into a much more convenient and easy-to-use service. The Spark-Jobserver can keep Spark context warmed up and readily available for accepting new jobs. At Informatica we are leveraging the Spark-Jobserver offerings to solve the data-visualization use-case.
Lessons Learned From Running Spark On DockerSpark Summit
Running Spark on Docker containers provides flexibility for data scientists and control for IT. Some key lessons learned include optimizing CPU and memory resources to avoid noisy neighbor problems, managing Docker images efficiently, using network plugins for multi-host connectivity, and addressing storage and security considerations. Performance testing showed Spark on Docker containers can achieve comparable performance to bare metal deployments for large-scale data processing workloads.
The document discusses using Spark SQL and the Spark SQL Thrift Server to access Cassandra data using SQL. It describes how the Thrift Server allows sharing of a single Spark context and cached data across clients. It also explains how predicates can be pushed down to Cassandra to optimize queries, though this does not always work for cases involving date comparisons.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Spark-Storlets is an open source project that aims to boost Spark analytic workloads by offloading compute tasks to the OpenStack Swift object store using Storlets. Storlets allow computations to be executed locally within Swift nodes and invoked on data objects during operations like GET and PUT. This allows filtering and extracting data directly in Swift. The Spark-Storlets project utilizes the Spark SQL Data Sources API to integrate Storlets and allow partitioning, filtering, and other operations to be pushed down and executed remotely in Swift via Storlets.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Operational Tips For Deploying Apache SparkDatabricks
Operational Tips for Deploying Apache Spark provides an overview of Apache Spark configuration, pipeline design best practices, and debugging techniques. It discusses how to configure Spark through command line options, programmatically, and Hadoop configs. It also covers topics like file formats, compression codecs, partitioning, and monitoring Spark jobs. The document provides tips on common issues like OutOfMemoryErrors, debugging SQL queries, and tuning shuffle partitions.
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
The document discusses productionizing Apache Spark and using the Spark REST Job Server. It provides an overview of Spark deployment options like YARN, Mesos, and Spark Standalone mode. It also covers Spark configuration topics like jars management, classpath configuration, and tuning garbage collection. The document then discusses running Spark applications in a cluster using tools like spark-submit and the Spark Job Server. It highlights features of the Spark Job Server like enabling low-latency Spark queries and sharing cached RDDs across jobs. Finally, it provides examples of using the Spark Job Server in production environments.
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
Did you know almost every feature of the Spark Cassandra connector can be accessed without even a single Monad! In this talk I’ll demonstrate how you can take advantage of Spark on Cassandra using only the SQL you already know! Learn how to register tables, ETL data, and analyze query plans all from the comfort of your very own JDBC Client. Find out how you can access Cassandra with ease from the BI tool of your choice and take your analysis to the next level. Discover the tricks of debugging and analyzing predicate pushdowns using the Spark SQL Thrift Server. Preview the latest developments of the Spark Cassandra Connector.
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
This document summarizes Uber's use of Spark as a data platform to support multi-tenancy and various data applications. Key points include:
- Uber uses Spark on YARN for resource management and isolation between teams/jobs. Parquet is used as the columnar file format for performance and schema support.
- Challenges include sharing infrastructure between many teams with different backgrounds and use cases. Spark provides a common platform.
- An Uber Development Kit (UDK) is used to help users get Spark jobs running quickly on Uber's infrastructure, with templates, defaults, and APIs for common tasks.
Building Out Your Kafka Developer CDC Ecosystemconfluent
The document describes how to build a local Apache Kafka Connect CDC (change data capture) ecosystem using Docker and Docker Compose. It provides instructions for capturing changes from Oracle and MySQL databases and transforming and loading the data into Kafka. The setup leverages Confluent's Docker images and includes containers for Kafka, Schema Registry, Kafka Connect, and the databases. It also describes how to build custom Docker images, write single message transforms, and debug common issues.
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Whirlpools in the Stream with Jayesh LalwaniDatabricks
This document summarizes some challenges and solutions related to structured streaming in Spark. It discusses issues with joining streaming and batch data due to lack of pushdown predicates. It also covers problems with caching batch dataframes, lack of a JDBC sink in streaming mode initially, issues with checkpoints being inconsistent, and limitations on aggregating aggregated dataframes. Solutions proposed include caching data outside Spark, looking up batch data in map/flatmap, direct database writes, using NFS for checkpoints, and custom aggregations without Spark SQL.
Apache CarbonData & Spark Meetup
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Exploratory Data Analysis with Sweetviz in IncortaDylan Wan
This document discusses using Sweetviz for exploratory data analysis within the Incorta notebook. It provides code to analyze a single data set, compare two data sets, and display the Sweetviz HTML report within the notebook or in a separate browser tab. The code defines a function to load and format the Sweetviz report for display.
This document provides information about various types of auto and home insurance coverage. It includes summaries of property and casualty insurance, different types of auto insurance coverage like liability, collision, comprehensive, uninsured motorist, and homeowner's insurance policies. Video links are included that provide further explanation of terms like bodily injury limits, property damage limits, deductibles, premiums, claims process, and co-insurance.
This document discusses developing classroom analytics using data from student information systems and assessment systems. It describes integrating student performance data, including grades, GPA, attendance and behavior records from SIS. It also covers importing assessment scores from test vendors and normalizing the data to allow for comparison using indicators and rubrics. The goal is to provide teachers, schools and districts analytics on student performance to support data-driven instruction and interventions.
This presentation demonstrate how Incorta support the data security requirements. It describes how to define the session variable and how to define the security filters.
- The document discusses Oracle BI Applications, including its prebuilt dashboards, data warehouse model, ETL processes, and dimensional modeling best practices.
- It describes the typical architecture of BI Apps, including the presentation layer, metadata, data warehouse, ETL processes of extract, load, and post load, and supporting the dimensional data model.
- Key aspects of the dimensional data model are discussed, including star schemas, conformed dimensions, and handling multiple data sources.
Conformed dimensions allow data from different subject areas like HR, CRM, and finance to be joined together side by side, revealing insights into relationships between employee compensation, training, and performance. Data mining can then be used to find correlations and determine how factors like compensation plans and training affect performance. The consolidated data and insights generated through data mining can produce specific recommendations to improve performance through actions like redesigning compensation and training plans or job criteria.
Data Mining Scoring Engine development processDylan Wan
The document discusses the development process for a data mining scoring engine. It describes how the engine accepts inputs, produces outputs, and works within an environment like a software system. The development process involves a requirement analysis and design phase to define business questions, identify source data and mining algorithms. It also includes an engine building phase where training data is gathered and prepared, mining logic is implemented using tools, and the engine is deployed. The engine then uses models like regression or decision trees trained on historical data to score new data.
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...JeevanKp7
Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients.
The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease.
A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Combined supervised and unsupervised neural networks for pulse shape discrimi...Samuel Jackson
Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem.
Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataSamuel Jackson
We present our work to improve data accessibility and performance for data-intensive tasks within the fusion research community. Our primary goal is to develop services that facilitate efficient access for data-intensive applications while ensuring compliance with FAIR principles [1], as well as adoption of interoperable tools, methods and standards.
The major outcome of our work is the successful creation and deployment of a data service for the MAST (Mega Ampere Spherical Tokamak) experiment [2], leading to substantial enhancements in data discoverability, accessibility, and overall data retrieval performance, particularly in scenarios involving large-scale data access. Our work follows the principles of Analysis-Ready, Cloud Optimised (ARCO) data [3] by using cloud optimised data formats for fusion data.
Our system consists of a query-able metadata catalogue, complemented with an object storage system for publicly serving data from the MAST experiment. We will show how our solution integrates with the Pandata stack [4] to enable data analysis and processing at scales that would have previously been intractable, paving the way for data-intensive workflows running routinely with minimal pre-processing on the part of the researcher. By using a cloud-optimised file format such as zarr [5] we can enable interactive data analysis and visualisation while avoiding large data transfers. Our solution integrates with common python data analysis libraries for large, complex scientific data such as xarray [6] for complex data structures and dask [7] for parallel computation and lazily working with larger that memory datasets.
The incorporation of these technologies is vital for advancing simulation, design, and enabling emerging technologies like machine learning and foundation models, all of which rely on efficient access to extensive repositories of high-quality data. Relying on the FAIR guiding principles for data stewardship not only enhances data findability, accessibility, and reusability, but also fosters international cooperation on the interoperability of data and tools, driving fusion research into new realms and ensuring its relevance in an era characterised by advanced technologies in data science.
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18
[2] M Cox, The Mega Amp Spherical Tokamak, Fusion Engineering and Design, Volume 46, Issues 2–4, 1999, Pages 397-404, ISSN 0920-3796, https://doi.org/10.1016/S0920-3796(99)00031-9
[3] Stern, Charles, et al. "Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production." Frontiers in Climate 3 (2022): 782909.
[4] Bednar, James A., and Martin Durant. "The Pandata Scalable Open-Source Analysis Stack." (2023).
[5] Alistair Miles (2024) ‘zarr-developers/zarr-python: v2.17.1’. Zenodo. doi: 10.5281/zenodo.10790679
[6] Hoyer, S. & Hamman, J., (20
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
5. Why Spark?
• General purpose framework for parallel processing in cluster
• Functional programming available in
• Scala
• Python
• Java
• Spark SQL and Dataframe
• Using the same framework for
• Data Streaming
• Machine Learning
• Graph processing
Spark Core
Spark
SQL
Spark
Streaming
Machine
Learning
Graph
Processing
Standalone
Scheduler
YARN Meso
6. Incorta Server Spark Server
Spark Execution Flow
Driver Program
Spark Master
(Cluster Manager)Spark Context
Worker Node
Executor Cache
Task Task
Worker Node
Executor Cache
Task Task
http://spark.apache.org/docs/latest/cluster-
overview.html
7. Spark Concepts
• Spark Context – like a connection in JDBC to hold the DB
session to a database. It is the connection to Spark cluster
• Master and Workers have its own JVM process and Listener
Port
• Master and Workers have its Web UI for display the progress
• Application codes are sent and assigned to executors
• Executors read, write and process the data
• Memory can be controlled at the worker level and are allocated
to individually executors
8. Spark Dataframe
• Organized Data into named columns like database table
• A dataframe can be created from a parquet file
• A dataframe can be written into and stored as a parquet file
• A dataframe can be processed via DataFrame API
https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.
sql.DataFrame
• A dataframe can be registered as a table and processed by
SQL
11. Materialized View in Incorta
• An object created within Incorta, not loaded from data sources
• Created based on other tables loaded from Incorta
• Once a MV is loaded, it works like other tables.
• Join from and to another table or MV
• Formula columns can be created in a MV
• Aliases can be created against a MV
• MV can be defined via Spark Python and Spark SQL
• Spark Python or Spark SQL are executed as part of the regular
Loader jobs
14. Spark Installation
• Download Spark
• http://spark.apache.org/downloads.html
• Select Package Type of “Prebuilt for Hadoop XX”
• Select Spark 1.6.2 for Incorta Release 2.8 or before
• Download the Tarball file or get the link from browser
• Run wget <download URL> from the server machine
• Unzip and uncompress the tar file
• tar –xzvf spark-1.6.2-bin-hadoop2.6.tgz
• The spark is ready to use! Try this
• bin/pyspark
• exit()
15. Spark Configurations
• Edit the spark-env.sh in the <Spark Home>/conf
• Change the WebUI ports if there is any conflict (optional)
• If not all ports are open to use or available from browser, you can specify
• SPARK_MASTER_WEBUI_PORT
• SPARK_WORKER_WEBUI_PORT
• Specify the external IP for monitoring Spark jobs (optional)
• If the server machine runs under a firewall and the external IP and internal IP
are different
• Set SPARK_PUBLIC_DNS to the external IP
• Limit the memory used by Spark jobs (optional)
• SPARK_WORKER_MEMORY
• Control total available, not the individual assignment to the executors
16. Spark Configurations
• Enable Logging – Useful for investigating issues (recommended)
• Create a directory for holding he log files
• cd <spark home>
• mkdir eventlogs
• Edit the spark-defaults.conf in <spark home>/conf
• spark.eventLog.enabled true
• spark.eventLog.dir <spark home>/eventlogs
• Enable History Server (recommended)
• Edit <spark home>/conf/spark-env.sh
• SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=<spark
home>/eventlogs”
• Start the history server
• ./sbin/start-history-server.sh
17. Spark Configuration – Hive metastore DB
• Hive metadata is stored in Hive
metastore
• Hive metastore requires a
database
• Create the hive-site.xml in
<spark home>/conf
• Edit <Spark Home>/conf/spark-
env.sh
• SPARK_HIVE=true
• SPARK_SUBMIT_CLASSPATH
• SPARK_CLASSPATH
• Make sure JDBC driver is
available to Spark
18. hive-site.xml for mySQL
[incorta@clorox2-poc spark-1.6.2-bin-hadoop2.6]$ cat ~/spark-1.6.2-bin-hadoop2.6/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3307/mydb?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mysql_root</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>
19. Starting Spark Master and Worker
• Go to <Spark Home> and start the spark master process
• sbin/start-master.sh
• Check the log file to get the master WebUI URL
• Open the webUI from a browser
• Start the spark slave processes
• sbin/start-slave.sh spark://<spark master host>:7077
• Check the log file to ensure that the worker started properly
• Refresh the browser page to check worker processes
• Start history server (optional, but recommended)
• sbin/start-history-server.sh
21. Incorta Configuration
• Edit <Incorta Home>/incorta/server.properties
• spark.home=/home/incorta/spark-1.6.2-bin-hadoop2.6
• spark.master.url=spark://clorox2-poc:7077
• Please ensure that the spark.master.url set to the URL generated in the
log file when you launch the Spark master.
• You can also see it in the Spark Master Web UI
22. Monitoring
• Spark Master WebUI
• Check if the job is submitted to Spark master
• Check if the worker has allocated the resources to execute the job
• Check DAG for optimizing the performance
• Incorta Log
• Use tail –f <incorta home>/server/logs/incorta/<tenant>/incorta-…out
• See runtime errors
24. Understand read() and save()
• Read(“schema.table”) – get
the data from incorta
• Save(dataframe) – create the
data from the dataframe as
an MV
• These are incorta functions,
internally they call
• sqlContext.read.parquet
• df.write.mode("overwrite").parq
uet