The document provides an overview of big data concepts and frameworks. It discusses the dimensions of big data including volume, velocity, variety, veracity, value and variability. It then describes the traditional approach to data processing and its limitations in dealing with large, complex data. Hadoop and its core components HDFS and YARN are introduced as the solution. Spark is presented as a faster alternative to Hadoop for processing large datasets in memory. Other frameworks like Hive, Pig and Presto are also briefly mentioned.
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It provides the freedom to query data at scale using either serverless or dedicated options. Azure HDInsight allows the use of open source frameworks like Hadoop, Spark, Hive, and Kafka for processing large volumes of data. Azure Databricks offers environments for SQL, data science/engineering, and machine learning. The Azure IoT Hub enables scalable IoT solutions by allowing bidirectional communication between IoT applications and connected devices.
Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.
This document discusses MySQL and Hadoop integration. It covers structured versus unstructured data and the capabilities and limitations of relational databases, NoSQL, and Hadoop. It also describes several tools for integrating MySQL and Hadoop, including Sqoop for data transfers, MySQL Applier for streaming changes to Hadoop, and MySQL NoSQL interfaces. The document outlines the typical life cycle of big data with MySQL playing a role in data acquisition, organization, analysis, and decisions.
The document outlines Oracle's Big Data Appliance product. It discusses how businesses can use big data to gain insights and make better decisions. It then provides an overview of big data technologies like Hadoop and NoSQL databases. The rest of the document details the hardware, software, and applications that come pre-installed on Oracle's Big Data Appliance - including Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and tools for loading and analyzing data. The summary states that the Big Data Appliance provides a complete, optimized solution for storing and analyzing less structured data, and integrates with Oracle Exadata for combined analysis of all data sources.
Big Data visualization with Apache Spark and Zeppelin
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
Big Data Retrospective - STL Big Data IDEA Jan 2019
Slides from the STL Big Data IDEA meeting from January 2019. The presenters discussed technologies to continue using, stop using, and start using in 2019.
Big Data Developers Moscow Meetup 1 - sql on hadoop
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
Hadoop and SQL: Delivery Analytics Across the Organization
This document summarizes a presentation given by Nicholas Berg of Seagate and Adriana Zubiri of IBM on delivering analytics across organizations using Hadoop and SQL. Some key points discussed include Seagate's plans to use Hadoop to enable deeper analysis of factory and field data, the evolving Hadoop landscape and rise of SQL, and a performance comparison showing IBM's Big SQL outperforming Spark SQL, especially at scale. The document provides an overview of Seagate and IBM's strategies and experiences with Hadoop.
This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
In this webinar, we'll see how to use Spark to process data from various sources in R and Python and how new tools like Spark SQL and data frames make it easy to perform structured data processing.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.
This document discusses big data analytics platforms and techniques. It describes various open-source projects like Hadoop, Spark, and Mahout that can perform analytics on large datasets. It also discusses commercial analytics platforms from vendors like SAS, Alpine, and Revolution Analytics. Spark is highlighted as gaining rapid adoption for its speed and expanding machine learning capabilities. Key questions are raised about which open-source projects and commercial offerings will emerge as leaders in their categories.
Apache Spark is an open-source unified analytics engine for large-scale data processing. It was developed in 2009 to address some limitations of Hadoop, such as slow performance and inability to process data in real-time. Spark stores data in-memory for faster processing, supports real-time processing, multiple programming languages, and can be used for tasks like ETL, machine learning, and graph processing. Spark's architecture includes a cluster manager, driver, executors, and RDDs (Resilient Distributed Datasets) that represent data across a cluster.
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
This is the presentation I used for Oracle Week 2016 session about Apache Spark.
In the agenda:
- The Big Data problem and possible solutions
- Basic Spark Core
- Working with RDDs
- Working with Spark Cluster and Parallel programming
- Spark modules: Spark SQL and Spark Streaming
- Performance and Troubleshooting
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It provides the freedom to query data at scale using either serverless or dedicated options. Azure HDInsight allows the use of open source frameworks like Hadoop, Spark, Hive, and Kafka for processing large volumes of data. Azure Databricks offers environments for SQL, data science/engineering, and machine learning. The Azure IoT Hub enables scalable IoT solutions by allowing bidirectional communication between IoT applications and connected devices.
Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.
Colorado Springs Open Source Hadoop/MySQL David Smelker
This document discusses MySQL and Hadoop integration. It covers structured versus unstructured data and the capabilities and limitations of relational databases, NoSQL, and Hadoop. It also describes several tools for integrating MySQL and Hadoop, including Sqoop for data transfers, MySQL Applier for streaming changes to Hadoop, and MySQL NoSQL interfaces. The document outlines the typical life cycle of big data with MySQL playing a role in data acquisition, organization, analysis, and decisions.
Presentation big dataappliance-overview_oow_v3xKinAnx
The document outlines Oracle's Big Data Appliance product. It discusses how businesses can use big data to gain insights and make better decisions. It then provides an overview of big data technologies like Hadoop and NoSQL databases. The rest of the document details the hardware, software, and applications that come pre-installed on Oracle's Big Data Appliance - including Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and tools for loading and analyzing data. The summary states that the Big Data Appliance provides a complete, optimized solution for storing and analyzing less structured data, and integrates with Oracle Exadata for combined analysis of all data sources.
Big Data visualization with Apache Spark and Zeppelinprajods
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
Slides from the STL Big Data IDEA meeting from January 2019. The presenters discussed technologies to continue using, stop using, and start using in 2019.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
This document summarizes a presentation given by Nicholas Berg of Seagate and Adriana Zubiri of IBM on delivering analytics across organizations using Hadoop and SQL. Some key points discussed include Seagate's plans to use Hadoop to enable deeper analysis of factory and field data, the evolving Hadoop landscape and rise of SQL, and a performance comparison showing IBM's Big SQL outperforming Spark SQL, especially at scale. The document provides an overview of Seagate and IBM's strategies and experiences with Hadoop.
Technologies for Data Analytics PlatformN Masahiro
This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
In this webinar, we'll see how to use Spark to process data from various sources in R and Python and how new tools like Spark SQL and data frames make it easy to perform structured data processing.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and TuningDonghwan Lee
이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining 하거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다.
1. 파운데이션 모델을 처음부터 Training
2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training
3. 도메인에 맞게 모델을 Fine Tuning하는 방안
발표자:
Miron Perel, Principal ML GTM Specialist, AWS
Kristine Pearce, Principal ML BD, AWS
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
2. About me
• I’m Vishal Periyasamy Rajendran
• Senior Data Engineer
• Focused on architecting and developing big
data solutions at AWS cloud.
• 8x AWS certifications + other certifications on
Azure, Snowflake etc.
• You can find me on
• LinkedIn:
https://www.linkedin.com/in/vishal-p-
2703a9131/
• Medium:
https://medium.com/@vishalrv1904
2
3. Agenda
• Big data Overview
• Dimensions of Big data
• Traditional approach and limitations
• Hadoop Overview
• Spark Overview
• Hive Overview
• Other Big data frameworks
3
5. What is Big
data?
• Each user with a smartphone generates
approximately 40 Exabytes of data every month.
• According to Forbes, 2.5 Quintillion bytes of data are
created every day.
5
6. What is Big
data?
• Collection of data that is so huge & complex like none of
the traditional data management tool can store or process
it.
6
8. 6v’s Of Big data
• Volume
• The scale of data.
• Velocity
• Speed of data.
• Variety
• Diversity of data.
• Veracity
• Accuracy of data.
• Value
• Insights gained from data.
• Variability
• How often data can change.
8
10. Big Data Phases
• Data collection
• Data Cleansing / Validation
• Data Transformation
• Data Storage
• Data Visualization
Different Pipelines:
• ETL (Extract, Transform, Load)
• ELT (Extract, Load, Transform)
10
12. Traditional
Approach
• An enterprise will have a
computer to store and process
big data.
• Limitations:
• Processor that is
processing the data.
• Dealing with huge amounts
amounts of scalable data
12
13. Traditional
Approach
• Google’s Solution:
• Solved the processor
problem using an
algorithm called
MapReduce.
• Divides the task into small
parts and assigns them to
many computers.
13
15. Hadoop Overview
• Using the solution provided by
Google, Doug Cutting and his team
developed an Open-Source Project
called HADOOP.
15
16. Hadoop Overview
• Framework for distributed data processing Maps
data to key/value pairs
Reduces intermediate results to final output Largely
supplanted by Spark these days
• Yet Another Resource Negotiator
Manages cluster resources for multiple data
processing frameworks
• Hadoop Distributed File System
Distributes data blocks across clusters in a redundant
manner
16
18. Spark Overview
• Hadoop MapReduce must persist data back to the
disk after every Map or Reduce action.
• This brings processing slowness.
• Spark - Distributed processing framework for big
data.
• Apache Spark is very much popular for its speed.
It runs 100 times faster in memory and ten times
faster on disk than Hadoop MapReduce since it
processes data in memory (RAM).
• Supports Java, Scala, Python, and R.
18
20. How Spark
Works
• Spark apps are run as
independent processes on a
cluster.
• Executors run computations
and store data.
• Spark context sends
application code and tasks to
executors
• Cluster manager – Yarn
20
21. Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 1.x three entry points were introduced,
•
Spark Context:
• The entry point of all spark application
• Spark Context is the first step to use RDD and connect to Spark
Cluster
• SQL Context:
• Used for the spark SQL executions & Structured data processing.
•
Hive Context:
• Used for the application to communicate with the hive.
21
22. Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 2.x introduced spark session,
• Spark Session:
• Combination of spark context, SQL context and
hive context.
22
23. Resilient Distributed
Dataset (RDD) & Dataframe
• RDD (Resilient Distributed Dataset) is a fundamental data
structure of Spark.
• The data frame is organized into named columns.
• Data frame supports APIs such as select, agg, sum, avg
etc.
• Support Spark SQL
• Catalyst Optimizer is available.
• Both are fault-tolerant, immutable distributed collections of
objects, which means you cannot change once you create.
23
24. Different types of Evaluation
• Eager Evaluation:
• Is the evaluation strategy you’ll most probably be familiar with and is used in most
programming languages
• Lazy Evaluation:
• Is an evaluation strategy that delays the evaluation of an expression until its value is
needed.
• Lazy Evaluation means that you can apply as many TRANSFORMATIONs as you want,
but Spark will not start the execution of the process until an ACTION is called.
24
25. Transformation & Actions
• Transformations are the instructions you use to modify the Data Frame in the way you want and
are lazily executed.
• Narrow transformations:
• Select
• Filter
• with column
• Wide transformations:
• Group by
• Repartition
• Actions are statements that will ask for a value to be computed immediately and are eager statements.
• Show, collect, save, count.
25
26. Spark’s Catalyst
Optimizer
• When performing different transformations,
Spark will store them in a Directed Acyclic
Graph (or DAG).
• Once the DAG is constructed, Spark’s catalyst
optimizer will perform a set of rule-based
and cost-based optimizations to determine
a logical and then physical plan of execution.
• Spark’s Catalyst optimizer will group
operations together, reducing the number of
passes on data and improving performance.
26
28. Spark Assignment
• Input:
• Covid data CSV file
• Expected outputs:
• Convert all state names to lowercase.
• The day had a greater number of covid cases.
• The state has the second-largest number of covid cases.
• Which Union Territory has the least number of death.
• The state has the lowest Death to Total Confirmed cases
ratio.
• Find which month the more Newer recovered cases.
• If the month is 02 it should display as February.
28
30. Apache Hive
• Uses familiar SQL syntax (HiveQL)
• Scalable – works with “big data” on a
cluster
• Really most appropriate for data
warehouse applications
• Easy OLAP queries – WAY easier than
writing MapReduce in Java.
• Interactive & Highly optimized.
30
32. Other Big
Data
Frameworks
32
• Pig introduces Pig Latin, a scripting language that lets you
use SQL-like syntax to define your map and reduce steps.
Apache Pig:
• Non-relational, petabyte-scale database.
• In-memory, Based on Google’s Bigtable, on top of HDFS
Apache HBase:
• It can connect to many different “big data” databases and
data stores at once, and query across them.
• Interactive queries at the petabyte scale.
Presto:
• Interactively run scripts/code against your data.
Apache Zeppelin: