Big Data Processing with Spark and .NET - Microsoft Ignite 2019

 Apache Spark is an OSS fast analytics engine for big data and machine
learning
 Improves efficiency through:
 General computation graphs beyond map/reduce
 In-memory computing primitives
 Allows developers to scale out their user code & write in their language of
choice
 Rich APIs in Java, Scala, Python, R, SparkSQL etc.
 Batch processing, streaming and interactive shell
 Available on Azure via
Azure Synapse Azure Databricks
Azure HDInsight IaaS/Kubernetes

.NET Developers 💖 Apache Spark…
A lot of big data-usable business logic (millions
of lines of code) is written in .NET!
Expensive and difficult to translate into
Python/Scala/Java!
Locked out from big data processing due to
lack of .NET support in OSS big data solutions
In a recently conducted .NET Developer survey (> 1000 developers), more than 70%
expressed interest in Apache Spark!
Would like to tap into OSS eco-system for: Code libraries, support, hiring

Goal: .NET for Apache Spark is aimed at providing
.NET developers a first-class experience when
working with Apache Spark.
Non-Goal: Converting existing Scala/Python/Java
Spark developers.

We are developing it in the open!
Contributions to foundational OSS projects:
• Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284,
SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373
• Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737,
ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887,
ARROW-5908, ARROW-6314, ARROW-6682
• Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to
Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance
.NET for Apache Spark is open source
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
• Version 0.6 released Oct 2019
Spark project improvement proposals:
• Interop support for Spark language extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006

.NET provides full-spectrum Spark support
Spark DataFrames
with SparkSQL
Works with
Spark v2.3.x/v2.4.x
and includes
~300 SparkSQL
functions
Grouped Map
Delta Lake
.NET Spark UDFs
Batch &
streaming
Including
Spark Structured
Streaming and all
Spark-supported data
sources
.NET Standard 2.0
Works with
.NET Framework v4.6.1+
and .NET Core v2.1/v3.x
and includes C#/F#
support
.NET
Standard
Data Science
Including access to
ML.NET
Interactive Notebook
with C# REPL
Speed &
productivity
Performance optimized
interop, as fast or faster
than pySpark,
Support for HW
Vectorization
https://github.com/dotnet/spark/examples

0.6
8
DataStreamWriter.PartitionBy()
RelationalGroupedDataset.Mean(),Max(),Avg(),Min(),Agg(),Count()
SparkSession.*Session(),Range(),Conf()
UDF with Row as a parameter
Delta Lake’s DeltaTable
SparkSession.Catalog
UDF with Array.Map as a return type
UDF debugging
Vector & GroupedMap UDFspark.yarn.archives support
Compatibility check for Microsoft.Spark.Worker
AssemblyLoader enhancement for loading UDFs
Resolver signer fix
Arrow & Pickling perf improvement
Arcade build infrastructure
TPC-H update with Arrow
DataStreamWriter.Trigger
ComplexTypes.MapType
Support for Spark 2.3.*, Spark 2.4.[1/2/4]
Worker binaries for MacOS
UDF with dependent types
DataFrameReader.Load() Source link for Nuget packageSparkFile
.NET for Apache Spark

Language comparison: TPC-H Query 2
val europe = region.filter($"r_name" === "EUROPE")
.join(nation, $"r_regionkey" === nation("n_regionkey"))
.join(supplier, $"n_nationkey" === supplier("s_nationkey"))
.join(partsupp,
supplier("s_suppkey") === partsupp("ps_suppkey"))
val brass = part.filter(part("p_size") === 15
&& part("p_type").endsWith("BRASS"))
.join(europe, europe("ps_partkey") === $"p_partkey")
val minCost = brass.groupBy(brass("ps_partkey"))
.agg(min("ps_supplycost").as("min"))
brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey"))
.filter(brass("ps_supplycost") === minCost("min"))
.select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.sort($"s_acctbal".desc,
$"n_name", $"s_name", $"p_partkey")
.limit(100)
.show()
var europe = region.Filter(Col("r_name") == "EUROPE")
.Join(nation, Col("r_regionkey") == nation["n_regionkey"])
.Join(supplier, Col("n_nationkey") == supplier["s_nationkey"])
.Join(partsupp,
supplier["s_suppkey"] == partsupp["ps_suppkey"]);
var brass = part.Filter(part["p_size"] == 15
& part["p_type"].EndsWith("BRASS"))
.Join(europe, europe["ps_partkey"] == Col("p_partkey"));
var minCost = brass.GroupBy(brass["ps_partkey"])
.Agg(Min("ps_supplycost").As("min"));
brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"])
.Filter(brass["ps_supplycost"] == minCost["min"])
.Select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.Sort(Col("s_acctbal").Desc(),
Col("n_name"), Col("s_name"), Col("p_partkey"))
.Limit(100)
.Show();
Similar syntax – dangerously copy/paste friendly!
$”col_name” vs. Col(“col_name”) Capitalization
Scala C#
C# vs Scala (e.g., == vs ===)

Demo 2: Twitter analysis in the Cloud

What is happening when you write .NET Spark code?
DataFrame
SparkSQL
.NET for
Apache
Spark
.NET
Program
Did you
define
a .NET
UDF?
Regular execution path
(no .NET runtime during execution)
Same Speed as with Scala Spark
Interop between Spark and .NET
Faster than with PySpark
No
Yes
Spark
operation tree

Works everywhere!
Cross platform
Cross Cloud
Windows Ubuntu
Azure & AWS
Databricks
macOS
AWS EMR
Spark
Azure HDI
Spark
Installed out of
the box
Azure
Synapse
Installation docs
on Github

More
programming
experiences in
.NET
(UDAF, UDT
support, multi-
language UDFs)
What’s next?
Spark data
connectors in
.NET
(e.g., Apache Kafka,
Azure Blob Store,
Azure Data Lake)
Tooling
experiences
(e.g., Jupyter, VS
Code, Visual
Studio, others?)
Idiomatic
experiences
for C# and F#
(LINQ, Type
Provider)
Go to https://github.com/dotnet/spark and let us know what is important to you!
Out-of-Box
Experiences
(Azure Synapse,
Azure HDInsight,
Azure Databricks,
Cosmos DB
Spark, SQL 2019
BDC, …)

Call to action: Engage, use & guide us!
Useful links:
• http://github.com/dotnet/spark
• https://www.nuget.org/packages/Microsoft.Spark
https://aka.ms/GoDotNetForSpark
• https://docs.microsoft.com/dotnet/spark
Website:
• https://dot.net/spark (Request a Demo!)
Starter Videos .NET for Apache Spark 101:
• Watch on YouTube
• Watch on Channel 9
Available out-of-box on
Azure Synapse & Azure HDInsight Spark
Running .NET for Spark anywhere—
https://aka.ms/InstallDotNetForSpark
You & .NET

@MikeDoesBigData, mrys@microsoft.com
#DotNetForSpark #MSIgnite #THR3110
I will be available at the Connect with Expert and/or the Azure Synapse booth

Big Data Processing with Spark and .NET - Microsoft Ignite 2019

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Big Data Processing with Spark and .NET - Microsoft Ignite 2019

Similar to Big Data Processing with Spark and .NET - Microsoft Ignite 2019 (20)

More from Michael Rys

More from Michael Rys (20)

Recently uploaded

Recently uploaded (20)

Big Data Processing with Spark and .NET - Microsoft Ignite 2019

Editor's Notes