This document discusses Apache Spark, an open-source cluster computing framework. It provides an overview of Spark, including its main concepts like RDDs (Resilient Distributed Datasets) and transformations. Spark is presented as a faster alternative to Hadoop for iterative jobs and machine learning through its ability to keep data in-memory. Example code is shown for Spark's programming model in Scala and Python. The document concludes that Spark offers a rich API to make data analytics fast, achieving speedups of up to 100x over Hadoop in real applications.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
This document provides an overview of Spark SQL and its architecture. Spark SQL allows users to run SQL queries over SchemaRDDs, which are RDDs with a schema and column names. It introduces a SQL-like query abstraction over RDDs and allows querying data in a declarative manner. The Spark SQL component consists of Catalyst, a logical query optimizer, and execution engines for different data sources. It can integrate with data sources like Parquet, JSON, and Cassandra.
Apache Spark is a fast and general engine for large-scale data processing. It provides a unified API for batch, interactive, and streaming data processing using in-memory primitives. A benchmark showed Spark was able to sort 100TB of data 3 times faster than Hadoop using 10 times fewer machines by keeping data in memory between jobs.
Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
This document summarizes Spark and Spark Streaming internals. It discusses the Resilient Distributed Dataset (RDD) model in Spark, which allows for fault tolerance through lineage-based recomputation. It provides an example of log mining using RDD transformations and actions. It then discusses Spark Streaming, which provides a simple API for stream processing by treating streams as series of small batch jobs on RDDs. Key concepts discussed include Discretized Stream (DStream), transformations, and output operations. An example Twitter hashtag extraction job is outlined.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
The document discusses Resilient Distributed Datasets (RDDs) in Spark. It explains that RDDs hold references to partition objects containing subsets of data across a cluster. When a transformation like map is applied to an RDD, a new RDD is created to store the operation and maintain a dependency on the original RDD. This allows chained transformations to be lazily executed together in jobs scheduled by Spark.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Here are the steps to complete the assignment:
1. Create RDDs to filter each file for lines containing "Spark":
val readme = sc.textFile("README.md").filter(_.contains("Spark"))
val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark"))
2. Perform WordCount on each:
val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
3. Join the two RDDs:
val joined = readmeCounts.join(changes
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
- Apache Spark is an open-source cluster computing framework that provides fast, general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) that allow in-memory processing for speed.
- The document discusses Spark's key concepts like transformations, actions, and directed acyclic graphs (DAGs) that represent Spark job execution. It also summarizes Spark SQL, MLlib, and Spark Streaming modules.
- The presenter is a solutions architect who provides an overview of Spark and how it addresses limitations of Hadoop by enabling faster, in-memory processing using RDDs and a more intuitive API compared to MapReduce.
we will see internal architecture of spark cluster i.e what is driver, worker,executer and cluster manager, how spark program will be run on cluster and what are jobs,stages and task.
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
This document provides an overview of Spark's RDD abstraction and the life cycle of a Spark application.
It defines RDDs as distributed collections characterized by partitions, dependencies, a compute function, and optional properties like a partitioner. It describes how Spark builds a DAG of stages from transformations and actions, and schedules tasks across executors.
The document also covers performance debugging techniques like identifying slow stages, stragglers, garbage collection issues, and profiling tasks locally. Debugging tools discussed include the Spark UI, executor logs, jstack, and YourKit.
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
This document provides an overview of installing and programming with Apache Spark on Hortonworks Data Platform (HDP). It introduces Spark and its components, benefits over other frameworks, and Hortonworks' commitment to Spark. The document outlines an example Spark programming workflow using Resilient Distributed Datasets (RDDs) in Scala, and covers common RDD transformations, actions, and persistence methods. It also discusses Spark deployment modes like standalone and on YARN, and reference HDP architectures using Spark.
This document discusses various machine learning applications and companies. It provides examples of companies using machine learning for security/fraud detection (BrightPoint Sentinel), HR/recruiting analysis (Textio, hiQ People Analytics), sales recommendations (Sentient Aware), marketing personalization (LiftIgniter), customer support insights (Clarabridge, Quantifind), internal data knowledge management (Alation), and market intelligence (Mattermark). It also discusses machine learning techniques like classification, regression, clustering, and time series analysis. Frameworks mentioned include SAP HANA, SAP Automated Predictive Library, and SAP Lumira.
My study notes on the Apache Spark papers from Hotcloud2010 and NSDI2012. The paper talks about a distributed data processing system that aims to cover more general-purpose use cases than the Google MapReduce framework.
Spark is a fast and general engine for large-scale data processing. It runs programs up to 100x faster than Hadoop in memory, and 10x faster on disk. Spark supports Scala, Java, Python and can run on standalone, YARN, or Mesos clusters. It provides high-level APIs for SQL, streaming, machine learning, and graph processing.
Just as millions of people around the globe have found their purpose via Dr. Rick Warren\'s NY Times bestseller, The Purpose Driven Life, Derrick Miles, Chairman and CEO of The Milestone Brand, intends to teach millions how to determine and define their "gifts." "Too often people find themselves in employment situations that do not maximize or even appreciate their natural gifts and talents," states Miles. "Our company, can provide assessments and training to enhance and maximize the performance of employees within a designated period of time."
The brothers have co-authored the book Superhuman Performance. They are now among the elite business authors, such as, Stephen R. Covey, Jack Welch and John C. Maxwell, whose books are coveted by those subscribing to the Soundview Executive Book Summaries service.
The public will have an opportunity to learn the specifics on assessing and defining their gifts through a webinar produced by SoundviewLive. The webinar entitled: Utilizing Your Gifts for Superhuman Performance will be held Tuesday, November 22, and will begin promptly at 12:00 p.m. EST. The fee for the webinar is $59.00. To register visit: http://www.summary.com and click onto the webinar tab featuring Derrick Miles as speaker. Journalists and established bloggers may request a complimentary link for the webinar by sending their requests via e-mail to: webinarlink@milestonebrand.com .
About The Milestone Brand
The #1 Global Brand for Maximizing Human Performance and Potential. Our team consists of some of the most gifted leaders, administrators, teachers and encouragers that have improved the lives of millions worldwide. We utilize many plattorms to educate and train humans about how to utilize the power they have inside of themselves to positively impact their work, their families and the communities they serve.
About Derrick Miles
Devoted husband and father to two sons, business executive, leader, Chairman and CEO of the #1 global brand for maximizing human performance and potential--The Milestone Brand, Derrick Miles fully operates in his gift of encouragement. Miles\' decade-plus experience as an healthcare executive allowed him to fully develop his gift, and now share it with the world. To learn more, visit http://www.milestonebrand.com .
The document provides background information on William Shakespeare and his play Julius Caesar. It summarizes the plot of the play, which is set in 44 BC Rome and depicts the assassination of Julius Caesar by a group of conspirators led by Brutus and Cassius, who feared Caesar becoming king. The document also discusses Shakespeare's source material and the political context of Elizabethan England that may have influenced his writing of the play.
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
This document provides an overview of SparkContext and Resilient Distributed Datasets (RDDs) in Apache Spark. It discusses how to create RDDs using SparkContext functions like parallelize(), range(), and textFile(). It also covers DataFrames and converting between RDDs and DataFrames. The document discusses partitions and the level of parallelism in Spark, as well as the execution environment involving DAGScheduler, TaskScheduler, and SchedulerBackend. It provides examples of RDD lineage and describes Spark clusters like Spark Standalone and the Spark web UI.
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
Intel's big data journey began in 2011 with an evaluation of Hadoop. Since then, Intel has expanded its use of Hadoop and Cloudera across multiple environments. Intel's 3-year roadmap focuses on evolving its Hadoop platform to support more advanced analytics, real-time capabilities, and integrating with traditional BI tools. Key strategies include designing for scalability, following an iterative approach to understand data, and leveraging open source technologies.
This document provides an overview of big data processing with Scala and Spark. It introduces big data concepts and defines Spark as an Apache project that provides a general-purpose cluster computing system. The document outlines why Spark is useful, such as its speed, APIs, and tools. It also describes Scala as a functional programming language integrated with Java that is well-suited for big data frameworks like Spark. The remainder covers Spark's ecosystem of modules and concludes with an invitation to try a basic "Hello World" Spark program.
This document provides guidance on creating an effective one-page executive summary to attract investors. The summary should concisely communicate key information about the company's management team, business description and strategy, product or service, competitive advantages, target market, sales channels, competition, five-year outlook, and financial projections. It advises connecting all relevant details to clearly demonstrate how the company plans to make money and minimize risks so investors understand the investment opportunity and are willing to meet in person.
The document summarizes key principles from Eric Ries' book on building successful startups using a "Lean" approach. It discusses 5 principles: (1) entrepreneurs are everywhere, (2) entrepreneurship is management, (3) startups exist to learn how to build a sustainable business through validated learning, (4) the build-measure-learn cycle allows for rapid iteration, and (5) innovation accounting helps prioritize work. It emphasizes the importance of rapid prototyping to validate assumptions and learn quickly from customers through metrics. Pivoting the business model based on learnings, rather than stubbornly sticking to initial ideas, is also highlighted as critical to the Lean approach for building enduring businesses.
Inside Apple by Adam Lashinsky explores the inner workings of Apple and how its founder, the late Steve Jobs, returned from exile to transform a floundering computer company into an unrivaled producer of products that inspire near-religious devotion from fans. Jobs’s management style resulted in a commitment to exceptionally high standards, an intense focus on minute points of design, and the tight control of Apple’s image. The company’s obsession with secrecy means that pre-release product information is kept from most employees with the same rigidity as it is kept from the public. In addition to analyzing Jobs’s management techniques, Lashinsky examines the capabilities of Apple’s most promising senior executives, as much speculation has risen concerning Jobs’s replacement and how the company will proceed without him.
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.
Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
Spark is a fast and general cluster computing system that improves on MapReduce by keeping data in-memory between jobs. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark core provides in-memory computing capabilities and a programming model that allows users to write programs as transformations on distributed datasets.
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
This document provides an overview of Spark and using Spark on HDInsight. It discusses Spark concepts like RDDs, transformations, and actions. It also covers Spark extensions like Spark SQL, Spark Streaming, and MLlib. Finally, it highlights benefits of using Spark on HDInsight like integration with Azure services, scalability, and support.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
Spark has evolved its APIs and engine over the last 6 years to combine the best aspects of previous systems like databases, MapReduce, and data frames. Its latest structured APIs like DataFrames provide a declarative interface inspired by data frames in R/Python for ease of use, along with optimizations from databases for performance and future-proofing. This unified approach allows Spark to scale massively like MapReduce while retaining flexibility.
Apache Spark is a fast and general engine for large-scale data processing. It was originally developed in 2009 and is now supported by Databricks. Spark provides APIs in Java, Scala, Python and can run on Hadoop, Mesos, standalone or in the cloud. It provides high-level APIs like Spark SQL, MLlib, GraphX and Spark Streaming for structured data processing, machine learning, graph analytics and stream processing.
Reactive dashboard’s using apache sparkRahul Kumar
Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.
End-to-end Data Pipeline with Apache SparkDatabricks
This document discusses Apache Spark, a fast and general cluster computing system. It summarizes Spark's capabilities for machine learning workflows, including feature preparation, model training, evaluation, and production use. It also outlines new high-level APIs for data science in Spark, including DataFrames, machine learning pipelines, and an R interface, with the goal of making Spark more similar to single-machine libraries like SciKit-Learn. These new APIs are designed to make Spark easier to use for machine learning and interactive data analysis.
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
Apache Cassandra is a powerful system for supporting large-scale, low-latency data systems, but it has some tradeoffs. Apache Spark can help fill those gaps, and this presentation will show you how.
Machine Learning - Convolutional Neural NetworkRichard Kuo
The document provides an overview of convolutional neural networks (CNNs) for visual recognition. It discusses the basic concepts of CNNs such as convolutional layers, activation functions, pooling layers, and network architectures. Examples of classic CNN architectures like LeNet-5 and AlexNet are presented. Modern architectures such as Inception and ResNet are also discussed. Code examples for image classification using TensorFlow, Keras, and Fastai are provided.
View Orchestration from Model Driven Engineering ProspectiveRichard Kuo
The document provides an overview of orchestration from a model-driven engineering perspective. It discusses concepts like Resource-Event-Agent (REA) modeling, business process modeling using BPMN, value chain analysis, Telecom Management Network (TMN), and enhanced Telecom Operations Map (eTOM). It also covers orchestration, use cases involving delivering a pizza order and deploying a telecom service to the cloud. Open source MANO frameworks like ETSI-MANO and challenges around virtualization, software-defined technologies, and microservices are summarized.
The Telecom Infra Project (TIP) aims to reimagine traditional telecom network infrastructure building and deployment. Formed in 2016 by SK Telecom, Deutsche Telekom, Facebook, Intel and Nokia, TIP is an engineering initiative that separates hardware and software in network architecture. This is intended to increase efficiency and design new solutions to allow more users and devices to access the internet globally as traditional telecom scaling has not kept pace with demand. TIP works on open source projects and partnerships to streamline development and encourage faster innovation.
The document discusses trends in telecom infrastructure including digitalization, virtualization, and mobility that are increasing data volumes and requiring more advanced data processing. It highlights edge computing, graphics processing units (GPUs), and field programmable gate arrays (FPGAs) as technologies that can help meet these demands. Edge computing distributes computing power and data storage closer to end users to reduce latency and network loads. GPUs and FPGAs can provide increased computing power for tasks like machine learning and help customize network nodes. The document also examines how these technologies may support the evolution to 5G networks through features like network slicing and distributed analytics.
This document is a study note on machine learning that covers:
- The instructor is Andrew Ng from Stanford University.
- It discusses the components and goals of machine learning including prediction and understanding systems.
- Reasons for the increased interest and capabilities in machine learning include big data, computing power, and algorithm advances.
Kubernetes is a platform for hosting containers in a clustered environment that provides container grouping, load balancing, auto scaling, and self-healing. It was started by Google. Kubernetes maintains the state of the cluster with a master node and runs container workloads on worker nodes. The main components are Pods to run containers, ReplicationControllers to maintain Pod availability, Services for load balancing, and kubectl to manage the cluster from a client. A demo shows creating a Redis master and frontend web application on Google Cloud using Kubernetes.
This document discusses using Chef and Vagrant to build cloud computing infrastructure. It introduces infrastructure as code and configuration tools like Chef for automating infrastructure provisioning and configuration. It outlines the workflow for setting up Chef including installing prerequisites, configuring a Chef server, uploading cookbooks, and using tools like Vagrant and knife plugins to test recipes locally and deploy to clouds.
This document provides an overview of ontologies, semantic web technologies, and their applications. It discusses syntactic web limitations and the need to add semantics. Key concepts covered include ontology, RDF, RDFS, OWL, Protege, and how these technologies enable a global linked database by semantically connecting data on the web.
This document discusses Software Defined Networking (SDN) and Network Function Virtualization (NFV). It provides a brief history of SDN including its origins at Stanford University and the development of the OpenFlow protocol. It also outlines the SDN architecture including abstraction layers, control programs, network operating systems, and switches. Key frameworks like OpenDaylight and controllers like Floodlight are mentioned. NFV is defined as implementing network functions in software rather than proprietary hardware to leverage standard servers and virtualization.
This document discusses graph databases and provides an overview of Neo4j. It describes how graph databases are useful for modeling connected data and performing complex queries over relationships. The document outlines the benefits of graph databases like expressing the domain as a graph and using graph traversals for queries. It then provides details on Neo4j, describing it as a widely used open source graph database that is scalable and supports ACID transactions. The document includes examples of creating nodes and relationships in Neo4j and traversing the graph.
UML, OWL and REA based enterprise business model 20110201aRichard Kuo
This document discusses several modeling languages and ontologies for enterprise business modeling including UML, OWL, REA, and enterprise ontology. It provides overviews of each approach, comparing UML and OWL, describing the key concepts of object-process modeling, events and actions, and the REA ontology. It also discusses modeling OWL in UML tools and provides examples of a REA-based enterprise business model.
This document provides an overview of the network emulation and visualization tools Mininet, Wireshark, and Open vSwitch. It describes how Mininet can be used to create virtual networks on a single machine and supports common topologies. Wireshark is summarized as a tool for network packet capture and analysis that supports many protocols. Open vSwitch is presented as a software switch that implements OpenFlow and supports Linux bridge functionality along with network virtualization features. The document also introduces some command line utilities for interacting with Open vSwitch components.
The document provides an overview and demonstration of Docker and CoreOS. It discusses how Docker allows for standardized packaging and isolation of applications and their dependencies into containers. CoreOS is introduced as a minimal Linux OS optimized for running Docker containers in highly available clusters, with automatic updates and tools for service management (Fleet) and distributed key-value storage (etcd). Examples of architectures using Docker and CoreOS are presented, along with potential benefits including more efficient application development, deployment and resource utilization.
Git is a free open source version control tool created by Linus Torvalds for managing code. It allows for a distributed workflow where developers can work offline and track the full history of files. Git uses immutable objects and reference pointers to track file contents rather than filenames. Key concepts include manipulating a graph of these objects, tracking contents rather than files, and using cryptographic hashes as immutable object names.
Cloud computing reference architecture from nist and ibmRichard Kuo
The document summarizes cloud computing reference architectures from NIST and IBM. It discusses why reference architectures are useful, including providing common understanding, reducing complexity, and enabling interoperability. It then provides overviews of the NIST cloud computing reference architecture, including essential cloud characteristics, service models, deployment models, and architectural components. It also summarizes the main IBM cloud computing reference architecture, focusing on roles, tools, management platforms, and portals.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
2. Agenda
• Big Data
• Overview of Spark
• Main Concepts
– RDD
– Transformations
– Programming Model
• Observation
01/06/15 Creative Common, BY, SA, NC 2
3. What is Apache Spark?
• Fast and general cluster computing system,
interoperable with Hadoop.
• Improves efficiency through:
– In-memory computing primitives
– General computational graph
• Improves usability through:
– Rich APIs in Scala, Java, Python
– Interactive shell
01/06/15 Creative Common, BY, SA, NC 3
4. Big Data: Hadoop Ecosystem
01/06/15 Creative Common, BY, SA, NC 4
6. Comparison with Hadoop
Hadoop Spark
Map Reduce Framework Generalized Computation
Usually data is on disk (HDFS) On disk or in memory
Not ideal for iterative works Data can be cached in memory, great for
iterative works
Batch process Real time streaming or batch
Up to 10x faster when data is in disk
Up to 100x faster when data is in memory
2-5x time less code to write
Support Scala, Java and Python
Code re-use across modules
Interactive shell for ad-hoc exploratory
Library support: GraphX, Machine
Learning, SQL, R, Streaming, …
01/06/15 Creative Common, BY, SA, NC 6
9. System performance degrade gracefully with less RAM
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully cached
Executiontime(s)
% of working set in cache
01/06/15 Creative Common, BY, SA, NC 9
10. Software Components
• Spark runs as a library in
your program (1 instance
per app)
• Runs tasks locally or on
cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems
via Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
01/06/15 Creative Common, BY, SA, NC 10
12. Key Concept: RDD’s
Resilient Distributed Datasets
• Collections of objects
spread across a cluster,
stored in RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on
failure
Operations
• Transformations
(e.g. map, filter, groupBy)
• Actions
(e.g. count, collect, save)
01/06/15 Creative Common, BY, SA, NC 12
Write programs in terms of operations on distributed
datasets
13. Fault Recovery
RDDs track the series of transformations used to build
them (their lineage) to re-compute lost data, no data
replication across wire.
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
01/06/15 Creative Common, BY, SA, NC 13
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
14. Language Support
Standalone Programs
•Python, Scala, & Java
Interactive Shells
• Python & Scala
Performance
• Java & Scala are faster due to
static typing
• …but Python is often fine
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
01/06/15 Creative Common, BY, SA, NC 14
15. Interactive Shell
• The fastest way to
learn Spark
• Available in Python and
Scala
• Runs as an application
on an existing Spark
Cluster…
• Or can run locally
01/06/15 Creative Common, BY, SA, NC 15
22. Conclusion
• Spark offers a rich API to make data analytics fast:
both less to write and fast to run.
• Achieves 100x speedups in real applications.
• Growing community.
01/06/15 Creative Common, BY, SA, NC 22
23. Observations:
• A lot of data, different kinds of data, generated
faster, need analyzed in real-time.
• All* products are data products.
• More complicate analytic algorithms applies to
commercial products and services.
• Not all data analysis requires the same accuracy.
• Expectation on service delivery increases.
01/06/15 Creative Common, BY, SA, NC 23
24. Reference:
• AMPLab at UC Berkeley
• Databrick
• UC BerkeleyX
– CS100.1x Introduction to Big Data with Apache Spark, starts 23 Feb 2015,
5 weeks
– CS190.1x Scalable Machine Learning, starts 14 Apr 2015, 5 weeks
• Spark Summit 2014 Training
• Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing
• An Architecture for Fast and General Data Processing on
Large Clusters
• Richard’s Study Notes
– Self Study AMPCamp
– Hortonworks HDP 2.2 Study
01/06/15 Creative Common, BY, SA, NC 24
The barrier to entry for working with the spark API is minimal
(word, 1L)
reduceByKey(_, _)
from http://spark.apache.org/docs/latest/streaming-programming-guide.html
/**
* Usage: NetworkWordCount <hostname> <port>
* To run this on your local machine, you need to first run a Netcat server
* `$ nc -lk 9999`
* and then run the example
* `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
*/