The document provides an overview of big data concepts and architectures. It discusses key topics like Hadoop, HDFS, MapReduce, NoSQL databases, and MPP relational databases. It also covers network design considerations for big data, common traffic patterns in Hadoop, and how to optimize performance through techniques like data locality and quality of service policies.
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself. Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model. We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
The document discusses managing Hadoop, HBase and Storm clusters at Yahoo scale. It describes Yahoo's grid infrastructure which includes 3 data centers with over 45k nodes across 18 Hadoop clusters, 9 HBase clusters and 13 Storm clusters. It then provides details on the rolling upgrade processes for HDFS, YARN, HBase and Storm which involve minimizing downtime, upgrading components independently and verifying upgrades. CI/CD processes are used to automate software deployment and upgrades.
This document discusses using Sqoop to transfer data between relational databases and Hadoop. It begins by providing context on big data and Hadoop. It then introduces Sqoop as a tool for efficiently importing and exporting large amounts of structured data between databases and Hadoop. The document explains that Sqoop allows importing data from databases into HDFS for analysis and exporting summarized data back to databases. It also outlines how Sqoop works, including providing a pluggable connector mechanism and allowing scheduling of jobs.
The document discusses Rocana Search, a system built by Rocana to enable large scale real-time collection, processing, and analysis of event data. It aims to provide higher indexing throughput and better horizontal scaling than general purpose search systems like Solr. Key features include fully parallelized ingest and query, dynamic partitioning of data, and assigning partitions to nodes to maximize parallelism and locality. Initial benchmarks show Rocana Search can index over 3 times as many events per second as Solr.
This document discusses loading data from Hadoop into Oracle databases using Oracle connectors. It describes how the Oracle Loader for Hadoop and Oracle SQL Connector for HDFS can load data from HDFS into Oracle tables much faster than traditional methods like Sqoop by leveraging parallel processing in Hadoop. The connectors optimize the loading process by automatically partitioning, sorting, and formatting the data into Oracle blocks to achieve high performance loads. Measuring the CPU time needed per gigabyte loaded allows estimating how long full loads will take based on available resources.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Splice Machine is a SQL relational database management system built on Hadoop. It aims to provide the scalability, flexibility and cost-effectiveness of Hadoop with the transactional consistency, SQL support and real-time capabilities of a traditional RDBMS. Key features include ANSI SQL support, horizontal scaling on commodity hardware, distributed transactions using multi-version concurrency control, and massively parallel query processing by pushing computations down to individual HBase regions. It combines Apache Derby for SQL parsing and processing with HBase/HDFS for storage and distribution. This allows it to elastically scale out while supporting rich SQL, transactions, analytics and real-time updates on large datasets.
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
The document discusses Hadoop 2.2.0 and new features in YARN and MapReduce. Key points include: YARN introduces a new application framework and resource management system that replaces the jobtracker, allowing multiple data processing engines besides MapReduce; MapReduce is now a library that runs on YARN; Tez is introduced as a new data processing framework to improve performance beyond MapReduce.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
Hadoop 2 introduces the YARN framework to provide a common platform for multiple data processing paradigms beyond just MapReduce. YARN splits cluster resource management from application execution, allowing different applications like MapReduce, Spark, Storm and others to run on the same Hadoop cluster. HDFS 2 improves HDFS with features like high availability, federation and snapshots. Apache Tez provides a new data processing engine that enables pipelining of jobs to improve performance over traditional MapReduce.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
Matthew Powers gave a talk on optimizing data lakes for Apache Spark. He discussed community goals like standardizing method signatures. He advocated for using Spark helper libraries like spark-daria and spark-fast-tests. Powers explained how to build better data lakes using techniques like partitioning data on relevant fields to skip data and speed up queries significantly. He also covered modern Scala libraries, incremental updates, compacting small files, and using Delta Lakes to more easily update partitioned data lakes over time.
This document discusses optimizing a data warehouse by using Hadoop to handle large and changing datasets more efficiently. It outlines challenges with traditional data warehousing as data volumes grow. Requirements for an optimized solution include unlimited scalability, handling all data types, and supporting agile methodologies. The document then describes a process flow for offloading ELT and loading to Hadoop. It provides an example use case of updating large datasets on Hadoop more efficiently using partitioning and temporary tables to minimize impact. A demo is referenced to illustrate the approach.
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
This document discusses distributed computing and Hadoop. It begins by explaining distributed computing and how it divides programs across several computers. It then introduces Hadoop, an open-source Java framework for distributed processing of large data sets across clusters of computers. Key aspects of Hadoop include its scalable distributed file system (HDFS), MapReduce programming model, and ability to reliably process petabytes of data on thousands of nodes. Common use cases and challenges of using Hadoop are also outlined.
it just provide information about hadoop what is hadoop and how hadoop overcomes the disadvantage of distributed system and i have also shown an example program for mapreduce
The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
Hadoop is a Java software framework that supports data-intensive distributed applications and is developed under open source license. It enables applications to work with thousands of nodes and petabytes of data.
This document provides an introduction to big data and related technologies. It defines big data as datasets that are too large to be processed by traditional methods. The motivation for big data is the massive growth in data volume and variety. Technologies like Hadoop and Spark were developed to process this data across clusters of commodity servers. Hadoop uses HDFS for storage and MapReduce for processing. Spark improves on MapReduce with its use of resilient distributed datasets (RDDs) and lazy evaluation. The document outlines several big data use cases and projects involving areas like radio astronomy, particle physics, and engine sensor data. It also discusses when Hadoop and Spark are suitable technologies.
Big data refers to large volumes of unstructured or semi-structured data that is difficult to process using traditional databases and analysis tools. The amount of data generated daily is growing exponentially due to factors like increased internet usage and data collection by organizations. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for reliable storage and MapReduce as a programming model to process data in parallel across nodes.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today’s data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability.
To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends.
Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization.
Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability.
Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier.
In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.
Topics Covered
* The Hadoop architecture.
* Thinking in MapReduce.
* Run some sample MapReduce Jobs (using Hadoop Streaming).
* Introduce PigLatin, a easy to use data processing language.
Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.
1. Big Data - Introduction(what is bigdata).pdfAmanCSE050
Big Data Characteristics
Contents
Explosion in Quantity of Data
Importance of Big Data
Usage Example in Big Data
Challenges in Big Data
Hadoop Ecosystem
The document discusses Hadoop, its components, and how they work together. It covers HDFS, which stores and manages large files across commodity servers; MapReduce, which processes large datasets in parallel; and other tools like Pig and Hive that provide interfaces for Hadoop. Key points are that Hadoop is designed for large datasets and hardware failures, HDFS replicates data for reliability, and MapReduce moves computation instead of data for efficiency.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.
The document discusses big data and distributed computing. It explains that big data refers to large, unstructured datasets that are too large for traditional databases. Distributed computing uses multiple computers connected via a network to process large datasets in parallel. Hadoop is an open-source framework for distributed computing that uses MapReduce and HDFS for parallel processing and storage across clusters. HDFS stores data redundantly across nodes for fault tolerance.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It addresses problems like hardware failure and combining data after analysis. The core components are HDFS for distributed storage and MapReduce for distributed processing. HDFS stores data as blocks across nodes and handles replication for reliability. The Namenode manages the file system namespace and metadata, while Datanodes store and retrieve blocks. Hadoop supports reliable analysis of large datasets in a distributed manner through its scalable architecture.
Similar to Cisco connect toronto 2015 big data sean mc keown (20)
Cisco connect montreal 2018 - Network Slicing: Horizontal VirtualizationCisco Canada
The document discusses network slicing, which is the next step in virtualization for 4G/5G mobile networks. Network slicing allows the core network to be partitioned into multiple logical networks or "slices", each with its own network functions to support the requirements of different services. This approach enables network resources and functions to be allocated to specific services or customer segments in a flexible manner. It reduces complexity compared to existing networks that must support many different services and customers on a single common infrastructure. The key benefits of network slicing include improved network agility and the ability to support diverse service requirements.
The document summarizes a Cisco presentation on next-generation datacenter security. It discusses how the majority of security teams' time is spent securing servers and data in the datacenter. It then covers challenges such as budget constraints, product overload, and complexity of threats. The presentation introduces Cisco's architectural approach to datacenter security focusing on threat prevention, visibility, segmentation, threat intelligence, automation, and analytics. It provides examples of Cisco solutions that integrate to deliver firewall, access control, analytics, and other capabilities.
Cisco connect montreal 2018 vision mondiale analyse localeCisco Canada
The document discusses Cisco's multi-cloud strategy and products. It introduces Cisco Container Platform (CCP) as a solution that automates deploying, running, and operating containers on physical or virtual machines. CCP is based on Kubernetes and provides integrated networking, management, security and analytics capabilities while allowing containers to run in hybrid cloud environments across VM, bare metal, Cisco HyperFlex, ACI and public clouds.
Cisco Connect Montreal 2018 Securité : Sécuriser votre mobilité avec CiscoCisco Canada
The document discusses Cisco's solutions for securing mobility, including Meraki SM, Cisco AMP for Endpoint, Cisco Umbrella, Cisco Cloudlock, Cisco Cloud Email Security, Cisco Threat Response, Identity Service Engine, and Cisco DUO Security. Representatives from Cisco provide overviews of each solution for securing users, data, and applications across SaaS, PaaS, and IaaS environments.
Cisco connect montreal 2018 collaboration les services webex hybridesCisco Canada
Cisco Connect Montreal provided information on Cisco's Webex Hybrid Services which allow for integration between on-premises and cloud collaboration solutions. The key services discussed included Hybrid Directory Service for user synchronization, Hybrid Calendar Service for calendaring integration, Hybrid Call Service for calling capabilities, Hybrid Message Service for messaging interoperability, and the new Cisco Webex Edge service for enhanced audio, video mesh, and media experiences.
Integration cisco et microsoft connect montreal 2018Cisco Canada
The document discusses Cisco and Microsoft integrations for collaboration. It describes major areas of integration including calling, messaging, meetings, email/calendar, content management, and instant messaging. It provides details on Cisco and Microsoft integrations for meetings, with examples of joining internal and external participants. The document also discusses Cisco Spark and Webex capabilities for open collaboration across organizations and platforms.
Cisco connect montreal 2018 saalvare md-program-xr-v2Cisco Canada
This document summarizes a presentation on model-driven programmability for Cisco IOS XR. The presentation covers data models, management protocols like NETCONF and gRPC, the YANG Development Kit (YDK) SDK, and telemetry. It defines key concepts like model-driven manageability, native and open data models, protocol operations, and the benefits of the YDK for simplifying application development through model-driven abstractions. Example code demonstrates basic YDK usage and a potential peering configuration use case is outlined. Resources for further information are also provided.
Cisco connect montreal 2018 sd wan - delivering intent-based networking to th...Cisco Canada
The document discusses Cisco SD-WAN and its advantages over traditional and legacy WAN architectures. It highlights how Cisco SD-WAN uses a centralized control plane and software-defined intelligence to provide automated, predictive, and intent-based networking. This allows for flexible, scalable, and secure connectivity across hybrid WAN transports in a way that is simpler to manage and operate than hardware-centric WAN solutions.
Cisco Connect Toronto 2018 DNA automation-the evolution to intent-based net...Cisco Canada
The document discusses Cisco's DNA Center and its capabilities for automating network management. It covers:
- Why intent-based networking is needed to reduce costs and errors from manual network changes
- How DNA Center supports intent-based networking by allowing administrators to define policies and have them automatically implemented across the network
- Key automation use cases DNA Center addresses like onboarding new devices, managing software upgrades, creating configuration templates, and deploying wireless networks
- Demonstrations of DNA Center's capabilities for plug-and-play deployment, software management, template configuration, and wireless provisioning
Cisco Connect Toronto 2018 an introduction to Cisco kineticCisco Canada
Robert Barton from Cisco presented on Cisco Kinetic, an IoT analytics platform. Cisco Kinetic consists of three modules: the Gateway Management Module for onboarding and managing IoT gateways at scale, the Edge and Fog Processing Module for analyzing IoT data in real-time at the edge, and the Data Control Module for securely routing IoT data between edge, fog, and cloud according to data policies. Cisco Kinetic aims to enable end-to-end IoT analytics across the entire network from device to cloud.
Cisco Connect Toronto 2018 DevNet OverviewCisco Canada
Hank Preston, a Cisco engineer, gave a presentation on DevNet and how it is helping developers. He discussed how DevNet has grown significantly, now with over 100,000 members and 500,000 learning labs completed. DevNet provides resources like APIs, sandboxes, and training to help developers build applications and automate networks. Preston emphasized that networks are becoming more programmable and automated through DevNet tools and platforms.
Cisco Connect Toronto 2018 DNA assuranceCisco Canada
The document discusses Cisco's DNA Assurance solution. It provides an agenda that covers business requirements, context, learning, user requirements, technology requirements, and the various components of DNA Assurance including client assurance, network assurance, application assurance, and machine learning. It discusses challenges around network operations including time spent troubleshooting and replicating issues. It also covers how DNA Assurance uses concepts like context, learning, and design thinking to provide insights and automate remediation.
Cisco Connect Toronto 2018 network-slicingCisco Canada
The document discusses network slicing, which is the partitioning of network resources and functions to run selected applications, services, or connections in isolation from each other for specific business purposes. This allows mobile operators to offer virtual private networks on a common infrastructure through network slicing on an end-to-end basis across access, transport, and core networks. Slicing enables new revenue opportunities through network slices optimized for different vertical industries while simplifying service delivery and management.
Cisco Connect Toronto 2018 the intelligent network with cisco merakiCisco Canada
The document discusses Cisco Meraki's intelligent network and SD-WAN capabilities. It highlights that Meraki has over 14,000 customers using its SD-WAN, it has a renewal rate over 95%, and its newest product is WAN assurance. The presentation provides an overview of Meraki's cloud-managed solutions for wireless, switching, security, and other IT functions. It demonstrates Meraki's network monitoring and troubleshooting tools through examples and a demo of its capabilities.
Cisco Connect Toronto 2018 sixty to zeroCisco Canada
The document discusses automating security tasks through various solutions from Cisco. It introduces the Cisco Advanced Malware Protection (AMP) solution, which uses machine learning to detect known and unknown malware across endpoints, networks, and email. It also introduces Cisco Cognitive Threat Analytics, which analyzes web traffic using machine learning to detect anomalous and malicious activity inside organizations. The document provides examples of how these solutions can automate tasks like hunting for threats, detecting anomalies, and attributing suspicious activity to specific entities. It includes demos of the AMP and Cognitive Intelligence user interfaces.
1. Big Data Architecture and Deployment
Sean McKeown – Technical Solutions Architect
In partnership with:
2. Thank you for attending Cisco Connect Toronto 2015, here are a few
housekeeping notes to ensure we all enjoy the session today.
§ Please ensure your cellphones / laptops are set on silent to ensure no
one is disturbed during the session
§ A power bar is available under each desk in case you need to charge
your laptop (Labs only)
House Keeping Notes
3. § Big Data Concepts and Overview
§ Enterprise data management and big data
§ Infrastructure attributes
§ Hadoop, NOSQL and MPP Architecture concepts
§ Hadoop and the Network
§ Network behaviour, FAQs
§ Cisco UCS for Big Data
§ Building a big data cluster with the UCS Common Platform Architecture (CPA)
§ UCS Networking, Management, and Scaling for big data
§ Q & A
Agenda
5. “More data usually beats
better algorithms.”
-Anand Rajaraman, SVP @WalmartLabs
6. The Explosion of Unstructured Data
6
2005 20152010
• More than 90% is unstructured
data
• Approx. 500 quadrillion files
• Quantity doubles every 2 years
• Most unstructured data is neither
stored nor analysed!
1.8 trillion gigabytes of data
was created in 2011…
10,000
0
GBofData
(INBILLIONS)
STRUCTURED DATA
UNSTRUCTURED DATA
Source: Cloudera
7. When the size of the data itself is part of the problem.
What is Big Data?
7
8. What isn’t Big Data?
• Usually not blade servers
(not enough local storage)
• Usually not virtualised
(hypervisor only adds overhead)
• Usually not highly
oversubscribed
(significant east-west traffic)
• Usually not SAN/NAS
9. Classic NAS/SAN vs. New Scale-out DAS
Traditional –
separate
compute from
storage
New –
move the
compute to
the storage
Bottlenecks
$$$
Low-cost, DAS-based,
scale-out clustered
filesystem
9
11. Three Common Big Data Architectures
11
NoSQL
Fast key-value store/
retrieve in real time
Hadoop
Distributed batch, query,
and processing platform
MPP Relational
Database
Scale-out BI/DW
12. § Hadoop is a distributed, fault-
tolerant framework for storing and
analysing data
Hadoop: A Closer Look
§ Its two primary components are the
Hadoop Filesystem (HDFS) and the
MapReduce application engine
Hadoop
12
Hadoop Distributed File System
(HDFS)
Map-Reduce
PIG Hive
HBASE
13. Hadoop: A Closer Look
§ Hadoop 2.0 (with YARN) adds the
ability to run additional distributed
application engines concurrently on
the same underlying filesystem
Hadoop
13
Hadoop Distributed File System
(HDFS)
Map-Reduce
PIG Hive
HBASEImpalaSpark
YARN (Resource Negotiator)
14. Hadoop
MapReduce Example: Word Count
14
the
quick
brown
fox
the fox
ate the
mouse
how now
brown
cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
cat grep sort uniq
15. Hadoop Distributed File System
Block
1
Block
2
Block
3
Block
4
Block
5
Block
6• Scalable & Fault Tolerant
• Filesystem is distributed, stored across all
data nodes in the cluster
• Files are divided into multiple large blocks –
64MB default, typically 128MB – 512MB
• Data is stored reliably. Each block is
replicated 3 times by default
• Types of Nodes
– Name Node - Manages HDFS
– Job Tracker – Manages MapReduce
Jobs
– Data Node/Task Tracker – stores
blocks/does work
ToR FEX/
switch
Data
node 1
Data
node 2
Data
node 3
Data
node 4
Data
node 5
ToR FEX/
switch
Data
node 6
Data
node 7
Data
node 8
Data
node 9
Data
node 10
ToR FEX/
switch
Data
node 11
Data
node 12
Data
node 13
Name
Node
Job
Tracker
File
Hadoop
15
17. “Failure is the defining difference between distributed and local
programming”
- Ken Arnold, CORBA designer
17
Hadoop
18. HDFS Architecture
ToR FEX/
switch
Data
node 1
Data
node 2
Data
node 3
Data
node 4
Data
node 5
ToR FEX/
switch
Data
node 6
Data
node 7
Data
node 8
Data
node 9
Data
node 10
ToR FEX/
switch
Data
node 11
Data
node 12
Data
node 13
Data
node 14
Data
node 15
1
Switch
Name Node
/usr/sean/foo.txt:blk_1,blk_2
/usr/jacob/bar.txt:blk_3,blk_4
Data node 1:blk_1
Data node 2:blk_2, blk_3
Data node 3:blk_3
1
1
2
2
2
3
3
3
4
4
4
4
18
Hadoop
23. Hadoop Network Design
• The network is the fabric – the ‘bus’
- of the ‘supercomputer’
• Big data clusters often create high
east-west, any-to-any traffic flows
compared to traditional DC networks
• Hadoop networks are typically
isolated/dedicated; simple leaf-spine
designs are ideal
• 10GE typical from server to ToR, low
oversubscription from ToR to spine
• With Hadoop 2.0, clusters will likely
have heterogeneous, multi-workload
behaviour
23
24. Hadoop Network Traffic Types
24
Small Flows/Messaging
(Admin Related, Heart-beats, Keep-alive,
delay sensitive application messaging)
Small – Medium Incast
(Hadoop Shuffle)
Large Flows
(HDFS egress)
Large Pipeline
(Hadoop Replication)
26. Typical Hadoop Job Patterns
Different workloads can have widely varying network impact
26
Analyse (1:0.25) Transform (1:1) Explode (1:1.2)
Map
(input/read)
Shuffle
(network)
Reduce
(output/write)
Data
27. Reducers Start
Maps Finish
Job
Complete
Maps Start
The red line is
the total
amount of
traffic received
by hpc064
These symbols
represent a
node sending
traffic to
HPC064
Note:
Due the combination of the length of the Map phase and the reduced data set being shuffled, the
network is being utilised throughout the job, but by a limited amount.
Analyse Workload
Wordcount on 200K Copies of complete works of Shakespeare
27
Network graph of all
traffic received on a
single node (80
node run)
28. Transform Workload (1TB Terasort)
Network graph of all traffic received on a single node (80 node run)
Reducers Start
Maps Finish
Job
CompleteMaps Start
These symbols
represent a
node sending
traffic to
HPC064
The red line is
the total
amount of
traffic received
by hpc064
29. Output Data Replication Enabled
§ Replication of 3 enabled (1 copy stored locally, 2 stored remotely)
§ Each reduce output is replicated now, instead of just stored locally
Note:
If output replication is enabled, then at the end of the job HDFS must store additional copies. For a
1TB sort, additional 2TB will need to be replicated across the network.
Transform Workload (With Output Replication)
29
Network graph
of all traffic
received on a
single node (80
node run)
30. Job Patterns - Summary
Job Patterns have varying impact on network utilisation
Analyse
Simulated with Shakespeare Wordcount
Extract Transform Load
(ETL)
Simulated with Yahoo TeraSort
Extract Transform Load
(ETL)
Simulated with Yahoo TeraSort with
output replication
30
31. Data Locality in Hadoop
The ability to process data where it is locally stored
31
Reducers Start
Maps Finish
Job
CompleteMaps Start
Observations
§Notice this initial
spike in RX Traffic is
before the Reducers
kick in.
§ It represents data
each map task needs
that is not local.
§ Looking at the spike
it is mainly data from
only a few nodes.
Map Tasks:
Initial spike for
non-local data.
Sometimes a
task may be
scheduled on
a node that
does not have
the data
available
locally.
33. Can Hadoop Really Use 10GE?
• Analytic workloads tend to be
lighter on the network
• Transform workloads tend to be
heavier on the network
• Hadoop has numerous
parameters which affect network
• Take advantage of 10GE:
– mapred.reduce.slowstart.completed.maps
– dfs.balance.bandwidthPerSec
– mapred.reduce.parallel.copies
– mapred.reduce.tasks
– mapred.tasktracker.reduce.tasks.maximum
– mapred.compress.map.output
Definitely, so tune for it!
33
34. Can QoS Help?
An example with HBase
34
Map 1 Map 2 Map NMap 3
Reducer
1
Reducer
2
Reducer
3
Reducer
N
HDFS
Shuffle
Output
Replication
Region
Server
Region
Server
Client Client
Read/
Write
Read
Update
Update
Read
Major Compaction
36. ACI Fabric Load Balancing
Flowlet Switching
H1 H2
TCP flow
• Flowlet switching routes bursts
of packets from the same flow
independently, based on
measured congestion of both
external wires and internal
ASICs
• Allows packets from the same
flow to take different paths
• Maintains packet ordering
• Better path utilisation
• Transparent – nothing to modify
at the host/app level
37. ACI Fabric Load Balancing
Dynamic Packet Prioritisation
Real traffic is a mix of large (elephant) and small (mice) flows.
F1
F2
F3
Standard (single priority):
Large flows severely impact
performance (latency & loss).
for small flows
High
Priority
Dynamic Flow Prioritisation:
Fabric automatically gives a higher
priority to small flows.
Standard
Priority
Key Idea:
Fabric detects initial few flowlets of
each flow and assigns them to a
high priority class.
38. Dynamic Packet Prioritisation
Helping heterogeneous workloads
0
0.5
1
1.5
2
2.5
3
MemSQL Only MemSQL + Hadoop
on Traditional
Network
MemSQL + Hadoop
+ Dynamic Packet
Prioritization
ReadQueries/Sec(InMillions)
§ 80-node test cluster
§ MemSQL used to generate
heavy #’s of small flows - mice
§ Large file copy workload
unleashes elephant flows that
trample MemSQL performance
§ DPP enabled, helping to
“protect” the mice from the
elephants
2x Improvement in
Reads per Second
39. Network Summary
• The network is the “system bus” of the Hadoop
“supercomputer”
• Analytic- and ETL-style workloads can behave
very differently on the network
• Minimise oversubscription, leverage QoS and DPP,
and tune Hadoop to take advantage of 10GE
39
41. “Life is unfair, and the
unfairness is distributed
unfairly.”
-Russian proverb
42. Hadoop Server Hardware Evolving
Typical 2009
Hadoop node
• 1RU server
• 4 x 1TB 3.5”
spindles
• 2 x 4-core CPU
• 1 x GE
• 24 GB RAM
• Single PSU
• Running Apache
• $
Economics favor
“fat” nodes
• 6x-9x more data/
node
• 3x-6x more IOPS/
node
• Saturated gigabit,
10GE on the rise
• Fewer total nodes
lowers licensing/
support costs
• Increased
significance of node
and switch failure
Typical 2015
Hadoop node
• 2RU server
• 12 x 4TB 3.5” or 24
x 1TB 2.5” spindles
• 2 x 6-12 core CPU
• 2 x 10GE
• 128-256 GB RAM
• Dual PSU
• Running
commercial/licensed
distribution
• $$$
42
43. Cisco UCS Common Platform Architecture
Building Blocks for Big Data
UCS
6200
Series
Fabric
Interconnects
Nexus
2232
Fabric
Extenders
(op<onal)
UCS
Manager
UCS
C220/C240
M4
Servers
LAN,
SAN,
Management
43
44. UCS Reference Configurations for Big Data
Quarter-Rack UCS
Solution for MPP,
NoSQL – High
Performance
Full Rack UCS
Solution for Hadoop
Capacity-Optimised
Full Rack UCS
Solution for Hadoop,
NoSQL – Balanced
2 x UCS 6248
8 x C220 M4 (SFF)
2 x E5-2680v3
256GB
6 x 400-GB SAS SSD
2 x UCS 6296
16 x C240 M4 (LFF)
2 x E5-2620v3
128GB
12 x 4TB 7.2K SATA
2 x UCS 6296
16 x C240 M4 (SFF)
2 x E5-2680v3
256GB
24 x 1.2TB 10K SAS
46. Hadoop and JBOD
• It hurts performance:
– RAID-5 turns parallel sequential
reads into slower random reads
– RAID-5 means speed limited to the
slowest device in the group
• It’s wasteful: Hadoop already
replicates data, no need for more
replication
– Hadoop block copies serve two
purposes: 1) redundancy and 2)
performance (more copies available
increases data locality % for map
tasks)
Why not use RAID-5?
46
read read read read
JBOD
read read read read
RAID-5
47. Can I Virtualise?
• Hadoop and most big data architectures can run virtualised
• However this is typically not recommended for performance
reasons
– Virtualised data nodes will contend for storage and network I/O
– Hypervisor adds overhead, typically without benefit
• Some customers are running master/admin nodes (e.g.
Name Node, Job Tracker, Zookeeper, gateways, etc.) in
VM’s, but consider single point of failure
• UCS is ideal for virtualisation if you go this route
Yes you can (easy with UCS), but should you?
47
48. Does HDFS Support Storage Tiering?
• HDP 2.2 uses the concept of storage types –
SSD, DISK, ARCHIVE (other distros have similar
features)
• The flag is set at a volume level
• ARCHIVE has three associated policies for files:
– HOT: all replicas on DISK
– WARM: one replica on DISK, others on ARCHIVE
– COLD: all replicas on ARCHIVE
An archiving example with Hortonworks
48
Hadoop
50. UCS C3160 Dense Storage Rack Server
Up to 360TB in 4RU
50
Server
Node
2x
E5-‐2600
V2
CPUs
128/256GB
RAM
1GB/4GB
RAID
Cache
Op+onal
Disk
Expansion
4x
hot-‐swappable,
rear-‐load
LFF
4TB/6TB
HDD
HDD
4
Rows
of
hot-‐swappable
HDD
4TB/6TB
Total
top
load:
56
drives
Two
120GB
SSDs
(OS/Boot)
52. Cisco UCS: Physical Architecture
6200
Fabric A
6200
Fabric B
B200
VIC
F
E
X
B
F
E
X
A
SAN
A
SAN
B
ETH
1
ETH
2
MGMT MGMT
Chassis 1
Fabric Switch
Fabric Extenders
Uplink Ports
Compute Blades
Half / Full width
OOB Mgmt
Server Ports
Virtualised Adapters
Cluster
Rack Mount C240
VIC
FEX A FEX B
52
Optional, for
scalability
53. CPA: Single-connect Topology
Single wire for data and management, no oversubscription
53
2 x 10GE links
per server for all
traffic, data and
management
New (cheaper) bare-metal
port licensing available
54. CPA: FEX Topology (Optional, For Scalability)
Single wire for data and management
8 x 10GE
uplinks per
FEX= 2:1
oversub (16
servers/rack),
no portchannel
(static pinning)
2 x 10GE links
per server for all
traffic, data and
management
54
55. CPA Recommended FEX Connectivity
55
• 2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32
• Distribute servers across port groups to maximise buffer
performance and predictably distribute static pinning on uplinks
56. Virtualise the Physical Network Pipe
Adapter
Switch
10GE
A
Eth 1/1
FEX A
6200-A
Physical
Cable
Virtual Cable
(VN-Tag)
Server
Cables
vNIC
1
10GE
A
vEth
1
FEX A
Adapter
6200-A
vHBA
1
vFC
1
Service Profile
Server
vNIC
1
vEth
1
6200-A
vHBA
1
vFC
1
(Server)
ü Dynamic,
Rapid
Provisioning
ü State
abstraction
ü Location
Independence
ü Blade or Rack
What you getWhat you see
Chassis
56
57. “NIC bonding is one of Cloudera’s
highest case drivers for
misconfigurations.”
http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-
clusters-like-a-boss/
58. UCS Fabric Failover
• Fabric provides NIC failover
capabilities chosen when
defining a service profile
• Avoids traditional NIC
bonding in the OS
• Provides failover for both
unicast and multicast traffic
• Works for any OS on
bare metal
• (Also works for any
hypervisor-based servers)
vNIC
1
10GE
10GE
vEth
1
OS
/
Hypervisor
/
VM
vEth
1
FEX
FEX
Physical
Adapter
Virtual
Adapter
6200-A 6200-B
L1
L2
L1
L2
Physical Cable
Virtual Cable
Cisco
VIC 1225
58
59. UCS Networking with Hadoop
59
• VNIC 1 on Fabric A with
FF to B (internal cluster)
• VNIC 2 on Fabric B with
FF to A (external data)
• No OS bonding required
• VNIC 0 (management)
wiring not shown for
clarity (primary on Fabric
B, FF to A)
Note: cluster traffic will
flow northbound in the
event of a VNIC1
failover. Ensure
appropriate bandwidth/
topology.
VNIC
1
L2/L3 Switching
Data
Node
1
VNIC
2
Data
Node
2
6200 A
VNIC
2
6200 B
VNIC
1
EHM
EHM
Data ingress/egress
VNIC
0
VNIC
0
60. Create QoS Policies
Leverage simplicity of UCS Service Profiles
60
! !
Best Effort policy for management VLAN Platinum policy for cluster VLAN
61. Enable JumboFrames for Cluster VLAN
!
1. Select the LAN tab in the left
pane in the UCSM GUI.
2. Select LAN Cloud > QoS System
Class.
3. In the right pane, select the
General tab
4. In the Platinum row, enter 9000
for MTU.
5. Check the Enabled Check box
next to Platinum.
6. Click Save Changes.
7. Click OK.
61
63. Cluster Scalability
A general characteristic
of an optimally
configured cluster is a
linear relationship
between data set sizes
and job completion
times
63
64. Sizing
• Start with current storage requirement
– Factor in replication (typically 3x) and compression (varies by data set)
– Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL
systems
– Factor in average daily/weekly data ingest rate
– Factor in expected growth rate (i.e. increase in ingest rate over time)
• If I/O requirement known, use next table for guidance
• Most big data architectures are very linear, so more nodes = more
capacity and better performance
• Strike a balance between price/performance of individual nodes vs. total
# of nodes
Part science, part art
64
66. CPA Sizing and Application Guidelines
Server
CPU
2 x E5-2680v3
2 x E5-2680v3
2 x E5-2620v3
Memory (GB)
256
256
128
Disk Drives
6 x 400GB SSD
24 x 1.2TB 10K SFF
12 x 4TB 7.2K LFF
IO Bandwidth (GB/Sec)
2.6
2.6
1.1
Rack-Level
(32 x C220 or 16 x C240)
Cores
768
384
192
Memory (TB)
8
4
2
Capacity (TB)
64
460
768
IO Bandwidth (GB/Sec)
192
42
16
Applications
MPP DB
NoSQL
Hadoop
NoSQL
Hadoop
Best Performance Best Price/TB
66
67. Scaling the CPA
Single Rack
16 servers
Single Domain
Up to 10 racks, 160 servers
Multiple Domains
L2/L3 Switching
67
68. Scaling via Nexus 9K Validated Design
68
Use Nexus 9000 with ACI
to scale out multiple UCS
CPA domains (1000’s of
nodes) and/or to connect
them to other application
systems
Enable ACI’s Dynamic
Packet Prioritisation and
Dynamic Load Balancing
to optimise multi-workload
traffic flows
69. Consider
intra-‐
and
inter-‐domain
bandwidth:
Servers
Per
Domain
(Pair
of
Fabric
Interconnects)
Available
North-‐Bound
10GE
ports
(per
fabric)
Southbound
oversubscrip<on
(per
fabric)
Northbound
oversubscrip<on
(per
fabric)
Intra-‐domain
server-‐to-‐server
bandwidth
(per
fabric,
Gbits/sec)
Inter-‐domain
server-‐to-‐server
bandwidth
(per
fabric,
Gbits/sec)
160
16
2:1
(FEX)
5:1
5
1
128
32
2:1
(FEX)
2:1
5
2.5
80
16
1:1
(no
FEX)
5:1
10
2
64
32
1:1
(no
FEX)
2:1
10
5
Scaling the Common Platform Architecture
69
70. Rack Awareness
• Rack Awareness provides Hadoop the
optional ability to group nodes together in
logical “racks”
• Logical “racks” may or may not correspond to
physical data centre racks
• Distributes blocks across different “racks” to
avoid failure domain of a single “rack”
• It can also lessen block movement between
“racks”
• Can be useful to control block placement and
movement in UCSM integrated environments
“Rack” 1
Data
node 1
Data
node 2
Data
node 3
Data
node 4
Data
node 5
“Rack” 2
Data
node 6
Data
node 7
Data
node 8
Data
node 9
Data
node 10
“Rack” 3
Data
node 11
Data
node 12
Data
node 13
Data
node 14
Data
node 15
1
1
1
2
2
2
3
3
3
4
4
4
70
71. Recommendations: UCS Domains and Racks
Single Domain Recommendation
Turn off or enable at physical rack level
• For simplicity and ease of
use, leave Rack Awareness
off
• Consider turning it on to limit
physical rack level fault
domain (e.g. localised
failures due to physical data
centre issues – water, power,
cooling, etc.)
Multi Domain Recommendation
Create one Hadoop rack per UCS Domain
• With multiple domains,
enable Rack Awareness
such that each UCS Domain
is its own Hadoop rack
• Provides HDFS data
protection across domains
• Helps minimise cross-
domain traffic
71
72. “The future is here, it’s just
not evenly distributed.”
-William Gibson, author
73. Summary
• Think of big data clusters as a single “supercomputer”
• Think of the network as the “system bus” of the
supercomputer
• Strive for consistency in your deployments
• The goal is an even distribution of load – distribute fairly
• Cisco Nexus and UCS Common Platform Architecture
for Big Data can help!
Leverage UCS and Nexus to integrate big data into your data centre operations
73
74. § Cisco dCloud is a self-service platform that can be accessed via a browser, a high-speed
Internet connection, and a cisco.com account
§ Customers will have direct access to a subset of dCloud demos and labs
§ Restricted content must be brokered by an authorized user (Cisco or Partner) and then shared
with the customers (cisco.com user).
§ Go to dcloud.cisco.com, select the location closest to you, and log in with your cisco.com
credentials
§ Review the getting started videos and try Cisco dCloud today: https://dcloud-cms.cisco.com/help
dCloud
Customers now get full dCloud experience!