The document discusses a company's migration from their in-house computation engine to Apache Spark. It describes five key issues encountered during the migration process: 1) difficulty adapting to Spark's low-level RDD API, 2) limitations of DataSource predicates, 3) incomplete Spark SQL functionality, 4) performance issues with round trips between Spark and other systems, and 5) OutOfMemory errors due to large result sizes. Lessons learned include being aware of new Spark features and data formats, and designing architectures and data structures to minimize data movement between systems.
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Databricks
Near real-time analytics has become a common requirement for many data teams as the technology has caught up to the demand. One of the hardest aspects of enabling near-realtime analytics is making sure the source data is ingested and deduplicated often enough to be useful to analysts while writing the data in a format that is usable by your analytics query engine. This is usually the domain of many tools since there are three different aspects of the problem: streaming ingestion of data, deduplication using an ETL process, and interactive analytics. With Spark, this can be done with one tool. This talk with walk you through how to use Spark Streaming to ingest change-log data, use Spark batch jobs to perform major and minor compaction, and query the results with Spark.SQL. At the end of this talk you will know what is required to setup near-realtime analytics at your organization, the common gotchas including file formats and distributed file systems, and how to handle data the unique data integrity issues that arise from near-realtime analytics.
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Databricks
Building a curated data lake on real time data is an emerging data warehouse pattern with delta. However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Databricks
In rapidly changing conditions, many companies build ETL pipelines using ad-hoc strategy. Such an approach makes automated testing for data reliability almost impossible and leads to ineffective and time-consuming manual ETL monitoring.
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
This document discusses machine learning data lineage using Delta Lake. It introduces Richard Zang and Denny Lee, then outlines the machine learning lifecycle and challenges of model management. It describes how MLflow Model Registry can track model versions, stages, and metadata. It also discusses how Delta Lake allows data to be processed continuously and incrementally in a data lake. Delta Lake uses a transaction log and file format to provide ACID transactions and allow optimistic concurrency control for conflicts.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
This document describes BBVA's implementation of a Big Data Lake using Apache Spark for log collection, storage, and analytics. It discusses:
1) Using Syslog-ng for log collection from over 2,000 applications and devices, distributing logs to Kafka.
2) Storing normalized logs in HDFS and performing analytics using Spark, with outputs to analytics, compliance, and indexing systems.
3) Choosing Spark because it allows interactive, batch, and stream processing with one system using RDDs, SQL, streaming, and machine learning.
Realtime streaming architecture in INFINARIOJozo Kovac
About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
Operationalizing Big Data Pipelines At ScaleDatabricks
Running a global, world-class business with data-driven decision making requires ingesting and processing diverse sets of data at tremendous scale. How does a company achieve this while ensuring quality and honoring their commitment as responsible stewards of data? This session will detail how Starbucks has embraced big data, building robust, high-quality pipelines for faster insights to drive world-class customer experiences.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Griffin is a data quality platform built by eBay on Hadoop and Spark to provide a unified process for detecting data quality issues in both real-time and batch data across multiple systems. It defines common data quality dimensions and metrics and calculates measurement values and quality scores, storing results and generating trending reports. Griffin provides a centralized data quality service for eBay and has been deployed processing over 1.2PB of data and 800M daily records using 100+ metrics. It is open source and contributions are welcome.
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
Deploying machine learning models seems like it should be a relatively easy task. Take your model and pass it some features in production. The reality is that the code written during the prototyping phase of model development doesn’t always work when applied at scale or on “real” data. This talk will explore 1) common problems at the intersection of data science and data engineering 2) how you can structure your code so there is minimal friction between prototyping and production, and 3) how you can use Apache Spark to run predictions on your models in batch or streaming contexts.
You will take away how to address some of productionizing issues that data scientists and data engineers face while deploying machine learning models at scale and a better understanding of how to work collaboratively to minimize disparity between prototyping and productizing.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
Building Data Intensive Analytic Application on Top of Delta LakesDatabricks
Why to build your own analytics application on top on Delta lake : – Every enterprise is building a data lake. However, these data lakes are plagued by low user adoption, poor data quality, and result in lower ROI. – BI tools may not be enough for your use case, especially, when you want to build a data driven analytical web application such as paysa. – Delta’s ACID guarantees allows you to build a real-time reporting app that displays consistent and reliable data
In this talk we will learn :
how to build your own analytics app on top of delta lake.
how Delta Lake helps you build pristine data lake with several ways to expose data to end-users
how analytics web application can be backed by custom Query layer that executes Spark SQL in remote Databricks cluster.
We’ll explore various options to build an analytics application using various backend technologies.
Various Architecture pattern/components/frameworks can be used to build custom analytics platform in no time.
How to leverage machine learning to build advanced analytics applications Demo: Analytics application built on Play Framework(for back-end), React(for front-end), Structured Streaming for ingesting data from Delta table. Live query analytics on real time data ML predictions based on analytics data
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
This ebook deep dives into Apache Spark optimizations that improve performance, reduce costs and deliver unmatched scale
https://www.qubole.com/resources/ebooks/accelerating-time-to-value-of-big-data-of-apache-spark
Story of architecture evolution of one project from zero to Lambda Architecture. Also includes information on how we scaled cluster as soon as architecture is set up.
Contains nice performance charts after every architecture change.
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.
AWS Simple Workflow: Distributed Out of the Box! - Morning@LohikaSerhiy Batyuk
The document provides an overview of AWS Simple Workflow (SWF) presented by Serhiy Batyuk. Some key points:
- SWF is a fully managed AWS service for coordinating work across distributed application components through the use of workflows and activities.
- It allows building scalable applications by coordinating work across components through asynchronous calls using workflows and tasks.
- The presentation demonstrates how to build a sample application in Java using the SWF APIs and SDK to coordinate preparation tasks for attending a conference.
- Key concepts covered include workflows, activities, deciders, retries, scalability, and replay of workflow executions for reliability.
This document discusses optimizing performance for high-load projects. It summarizes the delivery loads and technologies used for several projects including mGage, mobclix and XXXX. It then discusses optimizations made to improve performance, including using Solr for search, Redis for real-time data, Hadoop for reporting, and various Java optimizations in moving to Java 7. Specific optimizations discussed include reducing garbage collection, improving random number generation, and minimizing I/O operations.
The workshop tells about HBase data model, architecture and schema design principles.
Source code demo:
https://github.com/moisieienko-valerii/hbase-workshop
This document provides an overview of React, Flux, and Redux. It discusses the history of React and how it aims to solve issues with directly manipulating the DOM by using a virtual DOM. It also explains that React focuses on building reusable components with unidirectional data flow. Flux is then introduced as an architecture based on this one-way data flow, but it has issues with boilerplate code and complex store dependencies. Redux is presented as an improved implementation of Flux that uses a single immutable state tree and pure reducer functions to update the state, providing a more predictable state management approach. Demos are provided and useful links listed for further exploring each topic.
This document summarizes some key aspects of the Marionette library:
- Marionette provides common design patterns for building large-scale Backbone applications with features like nested views, view rendering on model changes, and region-based view management.
- The library includes classes like ItemView, CollectionView, and CompositeView that automatically render views based on model or collection data. It also has a messaging bus for application-level events.
- The messaging bus includes an event aggregator for pub/sub messaging, commands for triggering actions, and a request/response system for requesting data without tight coupling between components.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
How Spark Fits into Baidu's Scale-(James Peng, Baidu)Spark Summit
This document summarizes Baidu's use of Apache Spark and Tachyon for interactive data analytics. It describes Baidu's need for faster interactive queries to analyze petabytes of data within 30 seconds. Baidu implemented a query architecture using Spark, SQL and UDFs over Tachyon for hot data and HDFS storage. This provided over 50x acceleration compared to MapReduce, enabling 95% of queries to finish within 30 seconds. While Spark has improved performance, MapReduce still handles high throughput batch workloads at larger scale. Future opportunities include hardware acceleration and continued Spark optimizations for memory management and query planning.
Anastasiia Kornilova has over 3 years of experience in data science. She has an MS in Applied Mathematics and runs two blogs. Her interests include recommendation systems, natural language processing, and scalable data solutions. The agenda of her presentation includes defining data science, who data scientists are and what they do, and how to start a career in data science. She discusses the wide availability of data, how data science makes sense of and provides feedback on data, common data science applications, and who employs data scientists. The presentation outlines the typical data science workflow and skills required, including domain knowledge, math/statistics, programming, communication/visualization, and how these skills can be obtained. It provides examples of data science
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
This document discusses data ingestion with Spark. It provides an overview of Spark, which is a unified analytics engine that can handle batch processing, streaming, SQL queries, machine learning and graph processing. Spark improves on MapReduce by keeping data in-memory between jobs for faster processing. The document contrasts data collection, which occurs where data originates, with data ingestion, which receives and routes data, sometimes coupled with storage.
Spark is going to replace Apache Hadoop! Know Why?Edureka!
The document discusses how Spark is emerging to replace Hadoop for big data processing. It notes that Hadoop MapReduce is limited to batch processing and is not fast enough for real-time processing needs. In contrast, Spark is up to 100 times faster than Hadoop MapReduce, supports both batch and real-time processing, and stores data in memory for faster analysis. A survey is cited showing increasing adoption of Spark over Hadoop in industries handling large volumes of data. The document concludes that while Hadoop will still be used, Spark will replace Hadoop MapReduce as the primary framework for big data applications due to its ability to support real-time processing demands.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.
Presented at IDEAS SoCal on Oct 20, 2018. I discuss main approaches of deploying data science engines to production and provide sample code for the comprehensive approach of real time scoring with MLeap and Spark ML.
Presentation on Presto (http://prestodb.io) basics, design and Teradata's open source involvement. Presented on Sept 24th 2015 by Wojciech Biela and Łukasz Osipiuk at the #20 Warsaw Hadoop User Group meetup http://www.meetup.com/warsaw-hug/events/224872317
Graph analytics can be used to analyze a social graph constructed from email messages on the Spark user mailing list. Key metrics like PageRank, in-degrees, and strongly connected components can be computed using the GraphX API in Spark. For example, PageRank was computed on the 4Q2014 email graph, identifying the top contributors to the mailing list.
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
This document summarizes Spark's growth and development in 2015 and outlines its future direction. It discusses how Spark has become the most active open source big data project, with growing community and contributor numbers. Spark is now used across diverse industries for applications like log processing, recommendations, and business intelligence. The document highlights how Spark supports diverse runtime environments beyond Hadoop and how its user base has expanded beyond data engineers. It outlines upcoming features like the Dataset API and streaming DataFrames that will provide more optimized and easier to use APIs. The goal is for Spark to serve as a unified engine for all data workloads through continued optimization and support for new technologies like 3D XPoint memory.
This document discusses new directions for Apache Spark in 2015, including improved interfaces for data science, external data sources, and machine learning pipelines. It also summarizes Spark's growth in 2014 with over 500 contributors, 370,000 lines of code, and 500 production deployments. The author proposes that Spark will become a unified engine for all data sources, workloads, and environments.
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax Academy
In this in-depth workshop you will gain hands on experience with using Spark and Cassandra inside the DataStax Enterprise Platform. The focus of the workshop will be working through data analytics exercises to understand the major developer developer considerations. You will also gain an understanding of the internals behind the integration that allow for large scale data loading and analysis. It will also review some of the major machine learning libraries in Spark as an example of data analysis.
The workshop will start with a review the basics of how Spark and Cassandra are integrated. Then we will work through a series of exercises that will show how to perform large scale Data Analytics with Spark and Cassandra. A major part of the workshop will be to understand effective data modeling techniques in Cassandra that allow for fast parallel loading of the data into Spark to perform large scale analytics on that data. The exercises will also look at how to how to use the open source Spark Notebook to run interactive data analytics with the DataStax Enterprise Platform.
Microservices, Events, and Breaking the Data Monolith with KafkaVMware Tanzu
One of the trickiest problems with microservices is dealing with data as it becomes spread across many different bounded contexts. An event architecture and event-streaming platform like Kafka provide a respite to this problem. Event-first thinking has a plethora of other advantages too, pulling in concepts from event sourcing, stream processing, and domain-driven design.
In this talk, Ben and Cornelia will tackle how to do the following:
● Transform the data monolith to microservices
● Manage bounded contexts for data fields that overlap
● Use event architectures that apply streaming technologies like Kafka to address the challenges of distributed data
Speakers:
Cornelia Davis, Author & VP, Technology, Pivotal
Ben Stopford, Author & Technologist, Office of CTO, Confluent
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
This document provides an agenda for a 3+ hour workshop on Apache Spark 2.x on Databricks. It includes introductions to Databricks, Spark fundamentals and architecture, new features in Spark 2.0 like unified APIs, and workshops on DataFrames/Datasets, Spark SQL, and structured streaming concepts. The agenda covers lunch and breaks and is divided into hour and half hour segments.
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
As one of the few closed-loop payment platforms, PayPal is uniquely positioned to provide merchants with insights aimed to identify opportunities to help grow and manage their business. PayPal processes billions of data events every day around our users, risk, payments, web behavior and identity. We are motivated to use this data to enable solutions to help our merchants maximize the number of successful transactions (checkout-conversion), better understand who their customers are and find additional opportunities to grow and attract new customers.
As part of the Merchant Data Analytics, we have built a platform that serves low latency, scalable analytics and insights by leveraging some of the established and emerging platforms to best realize returns on the many business objectives at PayPal.
Join us to learn more about how we leveraged platforms and technologies like Spark, Hive, Druid, Elastic Search and HBase to process large scale data for enabling impactful merchant solutions. We’ll share the architecture of our data pipelines, some real dashboards and the challenges involved.
Speakers
Kasiviswanathan Natarajan, Member of Technical Staff, PayPal
Deepika Khera, Senior Manager - Merchant Data Analytics, PayPal
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
The document provides an overview of Graph Analytics in Spark. It discusses Spark components and key distinctions from MapReduce. It also covers GraphX terminology and examples of composing node and edge RDDs into a graph. The document provides examples of simple traversals and routing problems on graphs. It discusses using GraphX for topic modeling with LDA and provides further reading resources on GraphX, algebraic graph theory, and graph analysis tools and frameworks.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...DataKitchen
The main objective of this workshop is to give the audience hands on experience with several Hadoop technologies and jump start their hadoop journey. In this workshop, you will load data and submit queries using Hadoop! Before jumping in to the technology, the Founders of DataKitchen review Hadoop and some of its technologies (MapReduce, Hive, Pig, Impala and Spark), look at performance, and present a rubric for choosing which technology to use when.
NOTE: To complete hands on poriton in the time allotted, attendees should come with a newly created AWS (Amazon Web Services) Account and complete the other prerequisites found in the DataKitchen blog <http: />.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Modified O-RAN 5G Edge Reference Architecture using RNNijwmn
Paper Title
Modified O-RAN 5G Edge Reference Architecture using RNN
Authors
M.V.S Phani Narasimham1 and Y.V.S Sai Pragathi2, 1Wipro Technologies, India, 2Stanley College of Engineering & Technology for Women (Autonomous), India
Abstract
This paper explores the implementation of 6G/5G standards by network providers using cloud-native technologies such as Kubernetes. The primary focus is on proposing algorithms to improve the quality of user parameters for advanced networks like car as cloud and automated guided vehicle. The study involves a survey of AI algorithm modifications suggested by researchers to enhance the 5G and 6G core. Additionally, the paper introduces a modified edge architecture that seamlessly integrates the RNN technologies into O-RAN, aiming to provide end users with optimal performance experiences. The authors propose a selection of cutting-edge technologies to facilitate easy implementation of these modifications by developers.
Keywords
5G O-RAN, 5G-Core, AI Modelling, RNN, Tensor Flow, MEC Host, Edge Applications.
Volume URL: https://airccse.org/journal/jwmn_current24.html
Abstract URL: https://aircconline.com/abstract/ijwmn/v16n3/16324ijwmn01.html
Youtube URL: https://youtu.be/rIYGvf478Oc
Pdf URL: https://aircconline.com/ijwmn/V16N3/16324ijwmn01.pdf
#callforpapers #researchpapers #cfp #researchers #phdstudent #researchScholar #journalpaper #submission #journalsubmission #WBAN #requirements #tailoredtreatment #MACstrategy #enhancedefficiency #protrcal #computing #analysis #wirelessbodyareanetworks #wirelessnetworks
#adhocnetwork #VANETs #OLSRrouting #routing #MPR #nderesidualenergy #korea #cognitiveradionetworks #radionetworks #rendezvoussequence
Here's where you can reach us : ijwmn@airccse.org or ijwmn@aircconline.com
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...gerogepatton
This paper summarizes the most cogent advantages and risks associated with Artificial Intelligence from an
in-depth review of the literature. Then the authors synthesize the salient risk-related models currently being
used in AI, technology and business-related scenarios. Next, in view of an updated context of AI along with
theories and models reviewed and expanded constructs, the writers propose a new framework called “The
Transformation Risk-Benefit Model of Artificial Intelligence” to address the increasing fears and levels of
AIrisk. Using the model characteristics, the article emphasizes practical and innovative solutions where
benefitsoutweigh risks and three use cases in healthcare, climate change/environment and cyber security to
illustrate unique interplay of principles, dimensions and processes of this powerful AI transformational
model.
Vijay Engineering and Machinery Company (VEMC) is a leading company in the field of electromechanical engineering products and services, with over 70 years of experience.
2. About me
Roman Chukh
11+ years of experience
Java / PHP / Ruby / etc.
~1 year with Apache Spark
Interested in
Data Storage / Data Flow
Monitoring
Provisioning Tools
3. Agenda
Why Spark?
Our Migration to Spark
Issues
… and solutions
… or workarounds
… or at least the lessons learnt
13. Migrating To Spark
The Product
Cloud-based analytics application
Won the Big Data Startup Challenge
In-house computation engine
14. Migrating To Spark
Reasons
More data
More granular data
Support various data backends
Support Machine Learning algorithms
15. Migrating To Spark
Use Cases
❏ supplement Graph database used to
store/query big dimensions
❏ supplement RDBMS for querying of high
volumes of data
❏ represent existing computation graph as
flow of Spark-based operations
16. Migrating To Spark
Star Schema
Dimension DimensionMetric
Process /
Filter
Dimension
Filter
Metric
Process /
Filter
Dimension
Result
Data
Processing
...
23. Issue #1: Low-Level API
RDD: Issues
Functional transformations (e.g. map/reduce)
are not as intuitive
Manual memory management
High (dev) maintenance cost
24. Issue #1: Low-Level API
DataFrame: Overview
❏ (Semi-) Structured data
❏ Columnar Storage
❏ Graph mutation
❏ Code generation
❏ "on" by default in 1.5+
❏ "always on" in latest master
25. Issue #1: Low-Level API
DataFrame: Example
lines.json
{"line":"some"}
{"line":"lines"}
{"line":"for"}
{"line":"test"}
26. Issue #1: Low-Level API
DataFrame vs RDD
Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
27. Issue #1: Low-Level API
DataFrame: Graph Mutation
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
28. Issue #1: Low-Level API
Lessons Learnt
❏ Be aware of the new features
❏ … especially why they were introduced
❏ Low-Level API != Better Performance
30. “
“The fastest way to process big
data is to never read it”
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
31. Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
32. Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
AND y < 10
WHERE
y < 10
AND
33. Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
OR y < 10
WHERE
y < 10
OR
34. Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
OR y < 10
WHERE
y < 10
OR
35. … is at a very early stage
❏ Only simple predicates
<, <=, >, >=, =
❏ Only ‘AND’ predicate groups
(no OR support)
Issue #2: DataSource Predicates
JDBC
36. … is buggy
❏ Parquet < 1.7
❏ PARQUET-136 - NPE if all column values are
null
❏ Parquet 1.7
❏ PARQUET-251 - Possible incorrect results
for String/Decimal/Binary columns
Issue #2: DataSource Predicates
Apache Parquet
37. Issue #2: DataSource Predicates
Lessons Learnt
❏ Know your data format / data storage features
❏ ... and issues
❏ Its hard to check predicate pushdown behavior
❏ SPARK-11390: Pushdown information
❏ Simple aggregation operations are not supported
❏ Check out the talk “The Pushdown of Everything”
39. ❏ Window functions (e.g. row_number)
❏ Introduced for HiveContext in 1.4
❏ Introduced for SparkContext in 1.5
❏ Subquery (e.g. not exists) support is still missing
❏ Can sometimes be replaced with left semi join
Issue #3: Spark (sort of) SQL
Missing Functionality
40. Issue #3: Spark (sort of) SQL
Lessons Learnt
❏ Know your use-case
❏ Spark SQL is still quite young
❏ SQL grammar is incomplete
❏ … but actively extended
42. Issue #4: Round Trips
Background
Metric
Data Processing
...
Filter
Metric
Result
Internal API
Process / Filter
Dimension
Dimension
ids
Dimension
43. Issue #4: Round Trips
Background
Metric
Data Processing
...
Filter
Metric
Result
Internal API
Process / Filter
Dimension
Dimension
ids
Dimension
44. Get ID for the ‘Year 2015’
Issue #4: Round Trips
Resolving Dimensions
Dimension
WHERE
key = ‘2015’
Result
45. Get IDs of all passed months of the current year
Dimension
WHERE parent = 2015
and level = month
Dim. id
of ‘2015’
WHERE
key = ‘2015’
Issue #4: Round Trips
Resolving Dimensions
Result
46. Get IDs of all passed months of the current year
AND their siblings from the previous year
Dimension
WHERE
parent = 2015
and
level = month
Dim. id
of ‘2015’
Jan,
Feb,
…
WHERE
key = ‘2015’
WHERE
sibling_id =
sibling_id - 1
Result
Issue #4: Round Trips
Resolving Dimensions
47. ❏ Spark is better suited for a single complex request
❏ … though not too complex yet
❏ Invest time in architecture analysis and data flow
❏ It might be better to replace a more high-level API
Issue #4: Round Trips
Lessons Learnt
49. “
“RAM's cheap, but not that cheap”
Source: http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-everything-to-ram-and-run-it-from-there
50. Issue #5: OOM
Background
❏ Receive request
❏ Select / Filter / Process data (on Spark)
❏ Collect results
❏ … Out Of Memory
51. ❏ Same data as before
❏ Same external API
Issue #5: OOM
Workaround: Requirements
52. ❏ Result holds ~ 1M objects
❏ (Average) Object size 928 bytes
❏ Result size ~880 MB
Issue #5: OOM
Workaround: Before
53. Issue #5: OOM
Workaround: After
❏ Result holds ~ 1M objects
❏ (Average) Object size 272 bytes
❏ Result size ~261 MB
54. ❏ Invest (more) time in data structures
❏ Some java performance tips:
http://java-performance.com/
❏ Know your serializer
❏ E.g. Kryo (v2.2.1) prepares object for
deserialization by using default constructor.
Issue #5: OOM
Lessons Learnt
56. “
“The fact that there is a highway to hell
and only a stairway to heaven says a lot
about the traffic trends”
Source: https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_that_there_is_a_highway_to_hell_and_only