This document discusses improving the reliability and availability of Hadoop clusters. It notes that while Hadoop is taking on more database-like features, the uptime of many Hadoop clusters and lack of SLAs is still an afterthought. It proposes separating computing and storage to improve availability like cloud Hadoop offerings do. It also suggests building KPIs and monitoring around Hadoop clusters similar to how many companies monitor data warehouses. Centralizing Hadoop infrastructure management into a "Big Data as a Service" model is presented as another way to improve reliability.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
This document discusses hybrid analytics and the movement of data between on-premise and cloud environments. It notes that IT infrastructures now require both traditional and cloud-native approaches. Moving forward, hybrid analytics using both on-premise and cloud resources will become more common. Specifically, automatically moving data between on-premise storage and the cloud based on policies will help break down data silos and enable greater analysis. The document also explores using Dell EMC's OneFS software to enable this type of policy-based data movement to and from the cloud.
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files āthe directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done āand the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this āand what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
HBase has been in production in hundreds of clusters across the CDH/HDP customer base and Cloudera/Hortonworks support it for many years.
In this talk, based on our support experience, we aim to introduce useful information to troubleshoot HBase clusters efficiently. First off, we (Daisuke at Cloudera support) are going to talk about typical log messages and web UI info which we can use for troubleshooting (especially for struggling with performance issues). Since their meanings have been changing over the past versions, we would like to show the difference and improvements as well (e.g. HBASE-20232 for memstore flush, HBASE-16972 for slow scanner, HBASE-18469 for request counter, and also HBASE-21207 for sorting in web UI). We (Toshihiro at Cloudera, a former Hortonworks employee) will also cover some new tools (e.g. HBASE-21926 Profiler Servlet, HBASE-11062 htop, etc.), which should also be useful for performance troubleshooting.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
This document discusses enterprise-grade big data solutions from HPE. It outlines HPE's reference architecture for big data workloads including components like data lakes, data warehouses, archival storage, event processing, and in-memory analytics. It also discusses HPE's investments in Hortonworks and collaboration to optimize Hadoop for performance. The document promotes attending an HPE session at the Hadoop Summit on modernizing data warehouses and visiting the HPE booth for demos and a trivia game.
Apache Hadoop YARN is the modern Distributed Operating System. It enables the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Multiple organizations are able to leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.
In this talk, weāll first hit the ground with the current status of Apache Hadoop YARN ā how it is faring today in deployments large and small. We will cover different types of YARN deployments, in different environments and scale.
We'll then move on to the exciting present & future of YARN ā features that are further strengthening YARN as the first-class resource-management platform for datacenters running enterprise Hadoop. Weāll discuss the current status as well as the future promise of features and initiatives like ā 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.
This document discusses an advanced visualization tool for Spark and Flink jobs. It collects fine-grained data about task execution, including data characteristics and block fetch information. This information is exposed through a REST API and used to visualize the physical execution plan, detect issues like data skew, and help developers optimize their applications. The tool aims to help understand distributed data processing systems and guide testing of adaptive partitioning techniques. It has been extended to support Flink visualization as well. Future plans include open-sourcing the framework and adding more visualization features and metrics.
High Speed Continuous & Reliable Data Ingest into HadoopDataWorks Summit
This talk will explore the area of real-time data ingest into Hadoop and present the architectural trade-offs as well as demonstrate alternative implementations that strike the appropriate balance across the following common challenges: * Decentralized writes (multiple data centers and collectors) * Continuous Availability, High Reliability * No loss of data * Elasticity of introducing more writers * Bursts in Speed per syslog emitter * Continuous, real-time collection * Flexible Write Targets (local FS, HDFS etc.)
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
Hadoop clusters are operated on an ephemeral basis in the cloud by Qubole, processing over 300 petabytes of data per month across over 100 customers. Qubole addresses challenges of ephemeral clusters through auto-scaling of resources using YARN, optimizing performance for cloud storage, and storing job history remotely. Volatile low-cost nodes are leveraged through policies that ensure data replication despite potential node failures.
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.
This document discusses streaming data ingestion and processing options. It provides an overview of common streaming architectures including Kafka as an ingestion hub and various streaming engines. Spark Streaming is highlighted as a popular and full-featured option for processing streaming data due to its support for SQL, machine learning, and ease of transition from batch workflows. The document also briefly profiles StreamSets Data Collector as a higher-level tool for building streaming data pipelines.
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value. Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value.
The document discusses Apache Hive and Apache Druid for fast SQL on big data. It provides performance benchmarks showing Hive LLAP is faster than Presto and Spark SQL for TPC-DS queries. It describes features of Hive LLAP including in-memory caching, query result caching, and metadata caching. It also discusses new Hive 3 features like materialized views and optimizer improvements. The document then provides an overview of Apache Druid's capabilities for real-time ingestion and querying of streaming data before discussing how Hive and Druid can work together, with Hive able to push down queries to Druid.
The document summarizes recommendations for efficiently and effectively managing Apache Hadoop based on observations from analyzing over 1,000 customer bundles. It covers common operational mistakes like inconsistent operating system configurations involving locale, transparent huge pages, NTP, and legacy kernel issues. It also provides recommendations for optimizing configurations involving HDFS name node and data node settings, YARN resource manager and node manager memory settings, and YARN ATS timeline storage. The presentation encourages adopting recommendations built into the SmartSense analytics product to improve cluster operations and prevent issues.
1. The document discusses Microsoft's SCOPE analytics platform running on Apache Tez and YARN. It describes how Graphene was designed to integrate SCOPE with Tez to enable SCOPE jobs to run as Tez DAGs on YARN clusters.
2. Key components of Graphene include a DAG converter, Application Master, and tooling integration. The Application Master manages task execution and communicates with SCOPE engines running in containers.
3. Initial experience running SCOPE on Tez has been positive though challenges remain around scaling to very large workloads with over 15,000 parallel tasks and optimizing for opportunistic containers and Application Master recovery.
The document discusses the Stinger Initiative from Hortonworks to improve the performance and capabilities of interactive queries in Hive. The initiative takes a two-pronged approach, focusing on improvements to the query engine and the introduction of a new optimized column store file format called ORCFile. A new Tez execution engine is also introduced to avoid bottlenecks in MapReduce and enable lower latency queries. The goal is to extend Hive's ability to handle interactive queries with response times measured in seconds rather than minutes.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last yearās Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Presentation from physical to virtual to cloud emcxKinAnx
The document discusses three paradigm shifts in information technology: 1) From physical to virtual computing as virtualization becomes mainstream, 2) The network becoming the computer through network-centric architectures, 3) Storage evolving from a server-centric to a virtual, flexible model. These shifts are creating an industrialized "cloud computing" platform for intelligent, on-demand delivery of IT services.
ENGAGE2015: How is EMC Transforming Employee Communications? - Kevin Close, EMCGuideSpark
ENGAGE2015: Kevin Close, Senior Vice President of Compensation and Benefits at EMC, on approaching and innovating employee communications.
For more on the conference, visit http://www.guidespark.com/engage2015/. Follow GuideSpark on Twitter, LinkedIn and Facebook for up-to-date news.
This document summarizes EMC's position in the data storage market and strategies for capitalizing on emerging trends in big data and cloud computing. EMC is a global leader in data storage and management, with over $21 billion in annual revenues and over 50,000 employees. It has established leadership positions in various storage segments through both organic research and acquisitions. EMC is pursuing a dual innovation strategy of internal R&D and technology acquisitions to expand its portfolio and lead the transition to cloud and big data solutions. This includes over $24 billion invested in R&D and acquisitions from 2003 to present. EMC aims to help customers address the challenges of rapidly growing data volumes and IT budget constraints by transforming
The document summarizes how the BI team at Big Fish Games pitched their initial investment, structured their team, and approached their initial build out of BI capabilities. To pitch the initial investment, they focused on compelling business deliverables and iterating over key business problems. For their team structure, they brought in experienced engineers, paired people, and learned through real projects. In their initial build out, they focused on incremental delivery through business projects, gradually transitioned users, and leveraged their vendor(s).
This document provides a beginner's guide to contributing to open source projects. It discusses why people contribute (e.g. to expand knowledge), what organizations gain from contributions (e.g. business enablement), and how to get started. The guide recommends starting with documentation, answering questions, reporting bugs precisely, and eventually writing code as skills are built. Contributing helps individuals and moves projects forward for the benefit of all.
The document discusses EMC's Elastic Cloud Storage (ECS) product. It notes that unstructured data is growing rapidly and new applications require scalable, geo-distributed storage with cloud-like access. ECS provides hyper-scale object storage that can scale to billions of objects, with secure access from anywhere using multiple protocols. It offers an active-active geo-distributed architecture and compelling economics for both appliance and software-only deployments.
EMC Documentum - xCP 2.x Installation and DeploymentHaytham Ghandour
This document provides guidance on installing and deploying the EMC xCP application. It outlines the key components that must be installed, such as the JDK, Content Server, xPlore, and xMS agent. It also describes how to configure the application server and set up the xMS environment, including importing templates, creating hosts and services, and synchronizing the environment. Finally, it discusses some common deployment issues like incompatible versions of xPlore, Tomcat role configuration errors, and repository name issues. Logs and performance tuning tips are also presented to help troubleshoot failures.
This document provides an overview of the past, present, and future of Apache Hadoop YARN. It discusses how YARN has evolved from Apache Hadoop 2.6/2.7 to now support 2.8 with features like dynamic resource configuration, container resizing, and Docker support. Upcoming work includes support for arbitrary resource types, federation of multiple YARN clusters, and a new ResourceManager UI. The future of YARN scheduling may include distributed scheduling, intra-queue preemption, and scheduling based on actual resource usage.
Centrica implemented a Hadoop data platform to gain insights from large and diverse data sources. This provided a single customer view and enabled new applications and dashboards to improve customer service. The previous data infrastructure was complicated and could not scale to handle growing IoT and smart meter data. The Hadoop implementation followed agile and DevOps practices and has been successful, winning industry awards. Centrica aims to further collaboration and leverage cloud to reduce costs as big data adoption continues.
This document discusses best practices for running Spark in production. It begins with introductions from the presenters and an overview of Spark deployment modes on YARN. The main topics covered are Spark security using Kerberos authentication and authorization, communication channels and encryption in YARN cluster mode, common issues, and performance tuning. For performance, it recommends choosing executor and task sizes to balance efficiency and overhead, and increasing task parallelism to mitigate data skew problems. The goal is to understand workload patterns and monitor behavior to effectively tune Spark for different situations.
This document discusses Symantec's journey towards enabling self-service analytics clusters using Cloudbreak and Ambari. It describes how Symantec built a self-service analytics platform using Ambari to automate the deployment of Hadoop clusters on their private OpenStack cloud. However, they later needed a solution that could deploy clusters across different cloud providers. They adopted Cloudbreak to deploy clusters on AWS and contributed extensions like Keystone v3 support to enable Cloudbreak to work with their OpenStack cloud as well. This allows them to deploy analytics clusters across different clouds through a single tool and interface.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
The document discusses EMC's Elastic Cloud Storage (ECS) product. It provides examples of how ECS has been used by customers for applications such as global content repositories, modern application platforms, geo-scale big data analytics, cold archives, internet of things storage platforms, and analytics requiring data in place. It also outlines new features and integrations for ECS around monitoring, availability, performance, and deployment simplicity.
MT47 Modernize infrastructure for a modern data centerDell EMC World
Today's businesses need speed, efficiency and agility to deliver services back to their stakeholders, all at an affordable price. In the Modern Data Center, Flash, along with Scale-out, software-defined solutions, help to automate a modern infrastructure, the foundation of the modern data center. This session will show you how Dell EMC's industry leading storage portfolio can transform your company's infrastructure and drive your success. In addition, learn how to protect your modern data center with Dell EMCās comprehensive data protection portfolio.
Follow us at @DellEMCStorage
Learn more about Dell EMC All-Flash Solutions at DellEMC.com/All-flash.
This document discusses EMC Isilon scale-out NAS storage solutions. It provides an overview of EMC Isilon's market leadership in scale-out NAS, key trends in unstructured data growth, and how Isilon addresses next-generation workloads. The document also outlines Isilon's hardware and software features like its OneFS operating system, data protection and management tools, and product family which scales from high transactional to high density platforms.
This document discusses Dell's solutions for big data and analytics workloads. It describes Dell's portfolio for unstructured analytics including storage, servers, and reference architectures. It also outlines Dell's vision for a unified streaming and batch analytics platform called Project Nautilus that would integrate Isilon storage with real-time stream processing.
EMC Symmetrix VMAX: An Introduction to Enterprise Storage: Brian Boyd, Varrow...Brian Boyd
This session gives an overview of the EMC Symmetrix VMAC enterprise storage array. We will discuss the appropriate time to start looking at enterprise storage in your datacenter, the benefits and difference in technology between VMAC and other storage arrays, and give specific examples of how VMAX has helped out customers in their environments
The document discusses using IBM Flash and solutions to gain enhanced business insights from data. It describes how unstructured data is growing exponentially and how analytics is critical for businesses to gain insights. It then outlines IBM's flash storage portfolio, including all-flash arrays like FlashSystem and DeepFlash, a new class of flash optimized for big data workloads. It also discusses data protection schemes, shared storage versus shared-nothing architectures, and IBM tools for analytics, data management and security like Spectrum Scale, Spectrum Control and the Security Key Lifecycle Manager.
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "ŠŠ¾ŃŠ¾Š²Š¾Šµ ŠŗŠ¾Š¼ŠæŠ»ŠµŠŗŃŠ½Š¾Šµ ŠøŠ½ŃŃŠ°ŃŃŃŃŠŗŃŃŃŠ½Š¾...epamspb
This document discusses a confidential proposal from EPAM to provide big data solutions and services for a client. It outlines EPAM's experience with Hadoop, AWS, data engineering, ETL, analytics dashboards, and security implementations. The proposal describes setting up production and staging environments with Hadoop, Zabbix, Jenkins, Chef, Tableau, and integrating them with the client's existing infrastructure. It highlights EPAM's big data competency center and capabilities in data strategy, architecture, analytics, and platform support.
The document discusses disaggregated Hadoop stacks and data storage models. It compares tightly coupled and disaggregated models, with disaggregated providing more flexibility and variable costs. Examples are given showing how disaggregation can reduce data center space and costs when scaling to petabytes and exabytes of data. Performance tests show disaggregated storage on EMC Isilon outperforming direct-attached storage for Hadoop workloads. The document argues disaggregation allows choice of tools, data availability across locations, and flexibility.
This document discusses EMC's Isilon scale-out NAS architecture and how it can be used as the data lake foundation for analytics workloads like Hadoop. It provides an overview of how Isilon implements the HDFS protocol to allow analytics jobs to run directly against data stored in the Isilon cluster. It also highlights the performance benefits of using shared storage on Isilon versus local disks, including up to 4x faster ingest and 1.5x faster job runtimes. Finally, it discusses how Isilon supports features like data tiering, multi-tier architectures with different node types, and integration with object storage for archiving.
The document discusses IBM Spectrum Scale, a software-defined storage solution from IBM. It provides:
1) A family of software-defined storage products including IBM Spectrum Control, IBM Spectrum Protect, IBM Spectrum Archive, IBM Spectrum Virtualize, IBM Spectrum Accelerate, and IBM Spectrum Scale.
2) IBM Spectrum Scale allows storing data everywhere and running applications anywhere. It provides highly scalable, high-performance storage for files, objects, and analytics workloads.
3) The document provides an overview of the IBM Spectrum Scale product and its capabilities for optimizing storage costs, improving data protection, enabling global collaboration, and ensuring data availability, integrity and security.
This document summarizes a presentation about scale-out converged solutions for analytics. The presentation covers the history of analytic infrastructure, why scale-out converged solutions are beneficial, an analytic workflow enabled by EMC Isilon storage and Hadoop, test results showing performance benefits, customer use cases, and next steps. It includes an agenda, diagrams demonstrating analytic workflows, performance comparisons, and descriptions of enterprise features provided by using EMC Isilon with Hadoop.
This document provides an overview of IBM's Hadoop solution on Power Systems, including:
- The basic architecture of IBM's Hadoop solution using Power Systems servers and GPFS storage.
- Considerations for sizing a Hadoop cluster, such as compression rates and space for shuffle/sort data.
- The IBM Solution for Hadoop POWER System edition and IBM Data Engine for Analytics solutions.
- Networking recommendations for Hadoop clusters including appropriate switches and cabling.
EMC presented an overview of SQL Server 2012 and how it can help organizations unlock insights from data, improve performance of mission critical applications, and create business solutions across on-premises and cloud environments. EMC positions itself as the leader in mission critical infrastructure and discusses how its storage solutions like VNX, VMAX, and FAST cache can boost the performance of SQL Server workloads by 3-4x while improving reliability, availability, backup speeds and reducing storage needs. The presentation provides best practices for optimizing SQL Server deployments and highlights EMC's management and data protection tools for SQL Server.
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
The Development Bank of Singapore (DBS) has evolved its data platforms over three generations to address big data challenges and the explosion of data. It now uses a hybrid cloud model with Alluxio to provide a unified namespace across on-prem and cloud storage for analytics workloads. Alluxio enables "zero-copy" cloud bursting by caching hot data and orchestrating analytics jobs between on-prem and cloud resources like AWS EMR and Google Dataproc. This provides dynamic scaling of compute capacity while retaining data locality. Alluxio also offers intelligent data tiering and policy-driven data migration to cloud storage over time for cost efficiency and management.
The Transformation of your Data in modern IT (Presented by DellEMC)Cloudera, Inc.
Organizations have a wealth of data contained within the existing infrastructures. At DellEMC weāre helping customers remove the barriers of legacy datastores and transforming the customer experience in the modern datacentre. Learn how to unshackle the valuable data inside your existing data warehouse, leverage new techniques, applications and technology to enhance the financial impact of all your data sources
Highly Available, Highly Scalable ā Enterprise Manager 12c for Large Enterprises discusses using Oracle Enterprise Manager (EM) 12c to monitor a large enterprise environment with thousands of database instances, application servers, and other targets across multiple platforms and versions. It describes how EM 12c provides highly available monitoring with redundancy and disaster recovery, and how it addresses challenges of managing and reporting at large scale. Key points covered include building a highly available EM infrastructure, managing targets and alerts in bulk, leveraging the metric framework and reporting capabilities, and performing regular maintenance tasks to keep the EM environment healthy.
This document discusses Dell EMC ScaleIO software-defined block storage. It provides an overview of ScaleIO and its benefits, including massive scalability from 3 to over 1,000 nodes, extreme performance with tens of millions of IOPS, unparalleled flexibility to deploy on any hardware and choice of configurations, supreme elasticity to scale on the fly without downtime, and compelling economics with lower TCO. Case studies show how ScaleIO has helped customers drastically reduce costs, improve performance, and scale their storage infrastructure elastically.
The document provides an overview of EMC's big data solutions. It discusses the challenges of big data for IT in terms of complexity from multiple Hadoop distributions, costs of acquisition and operations, and security and governance challenges. It then introduces EMC's Hadoop starter kit which provides a simple and cost-effective way for customers to get started with Hadoop deployments on their existing EMC infrastructure. The starter kit includes deployment guides for various Hadoop distributions including Cloudera, Hortonworks, PivotalHD and Apache. It has seen over 1500 deployments worldwide.
ADV Slides: Platforming Your Data for Success ā Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? Weāll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if itās a bad fit.
Drop the herd mentality. In reality, there is no āone size fits allā right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
The document discusses trends in data and analytics, including the growth of digital data and devices. It summarizes predictions that by 2020 there will be over 30 billion connected devices, 7 billion people, and over 1 million new businesses. The document also discusses how analytics is converging databases and Hadoop to enable querying both structured and unstructured data, and how this will impact industries and skills. It focuses on trends like machine learning and the increasing importance of outcomes over specific technologies like Hadoop.
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.
This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.
This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.
The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.
This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.
This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing itās imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, weāve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how āThe Natixis Packā has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
ā¢ How and why the business and IT requirements originated
ā¢ How we leverage the platform to fulfill security and production requirements
ā¢ How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
ā¢ What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
There has been an explosion of data digitising our physical world ā from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses ā both operationally, and through their products and services ā by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is ānoā in most cases.
In this session, weāll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Increase Quality with User Access Policies - July 2024Peter Caitens
āļø Increase Quality with User Access Policies āļø, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about āUser Access Policiesā and how they can help you onboard users faster with greater quality.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparaviās AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
š Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
š” Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
š How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
š Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
š Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
š® Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. š§ š¼āØ
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ā.
š Agenda:
12:30 Welcome Coffee/Light Lunch ā
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges weāve faced, and the best practices weāve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites š»
2. 2EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Welcome !
Dr. Stefan Radtke
CTO Isilon, EMEA
EMC Emerging Technology Division
- 1995-2011: 17 Years for IBM in various technical roles
- 2011: Joined EMC
- 2012-today: CTO, EMEA for EMC Insilon
Phone: +49-176-34434460
E-Mail: Stefan.Radtke@emc.com
Linkedin: http://de.linkedin.com/in/drstefanradtke
Blog: http://stefanradtke.blogspot.com
3. 3EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
System Availability
Uptime Downtime (per year)
99.999% (AKA 5 nines) 5.26 minutes
99.99% (AKA 4 nines) 52.6 minutes
99.5% 1.83 days
99% (AKA 2 nines) 7.30 days
95% 18.25 days
What is your Data Warehousesā uptime SLA?
What is your Hadoop uptime SLA?
Why are they different?
4. 4EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
We have good Hadoop Outcomes
ļ Smart Grid
Fraud / Broken Devices & Grid Traffic Projections
ļ Fraud
ļ Healthcare research
Genomes and Healthcare ā BRCA
ļ Connected Car - Tesla
5. 5EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Hadoop takes on DB like Features
ā¢ Newly Added Features in Hadoop 3.0
ā Erasure Coding (HDFS-EC / HDFS-7485) is being introduced
to Hadoop
ā Additional Stand By Name Nodes for increase resiliency
(HDFS-6440)
ā¢ Future Features
ā Random read support from Indexed Name Node ā (HDFS-
8555)
ā Disaster Recovery (HDFS-5442)
6. 6EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
So...
ā¢ IF Hadoop is the Modern Database
AND
ā¢ IF Hadoop is taking on more Modern Database Features
AND
ā¢ Successful Outcomes are becoming more prolific...
Why are Operations of Hadoop and Uptime / SLAs seem
like such an afterthought on most clusters?
7. 7EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
KPIs
ā¢ Why do companies who have VERY successful Data
Warehouses, ETL processes, and KPI Dashboards
have so little of THOSE for their Hadoop instance
which is now generating all their Machine Learning
and Data & Analytics?
8. 8EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
What can go wrong?
ā¢ Forbes: ā..havenāt taken into account
some long-term or ongoing cost associated
with the projectā¦ā
ā¢ Information Week: āā¦Unanticipated
problems beyond the big data
technologyā¦ā
ā¢ Computerworld: āā¦there are enterprises
that underestimated the paradigm shiftā¦ā
9. 9EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
An Intervention
ā¢ Why is the concept of 99.99% seem bad for a
production Hadoop system?
ā¢ Why is solid KPIs around data collection and capture
sound absurd?
ā¢ Since when did a backup copy or backup of your
primary analytics data become not needed?
ā¢ Is this just because Hadoop is about standing up cheap
hardware?
ā¢ Why do companies need a catalyst before these things
seem common again?
10. 10EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Why wouldnāt you want:
ā¢ Two clusters fully addressable with data
replication located in separate geographies
ā¢ Data Re-silvering when additional capacity is
added
ā¢ Complete fault tolerance in the environment
and not just Data / Node redundancy to allow 4
Nines availability
ā¢ Operational scale that allows 24 x 7 support
EMPTYEMPTYEMPTYEMPTYEMPTYFULLFULLFULLFULLBALANCEDBALANCEDBALANCEDBALANCEDBALANCED
11. 11EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
What is my Idea - 1
ā¢ Separation of compute and storage.
ā Why do you think the cloud Hadoop is able to offer better SLAs
then on premise Hadoop? It isnāt because of a ton of single point
of failure compute boxes. They separate compute and storage.
ā¢ Look at Infrastructure / Big Data as a service centralization
ā Instead of trying to staff 25 hadoop clusters for 24 x 7, centralize
the team and provide QoS back to the applications
12. 12EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Data Gravity
ā¢ Data sets get bigger over time, and moving them becomes
increasingly difficult
ā This leads to switching costs & lock in
ā¢ Data is a strategic asset to enterprises with digital strategies
ā¢ Data becomes central ā build around it
ā Applications tend to migrate toward the data
ā Apply advanced analytics to the data āin-placeā
14. 14EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
THE PROBLEM OF DATA MOVEMENT
ā¢ To get statistically relevant results, a typical minimal required
data set is about 100 TB.
ā¢ Thatās also the recommendet minimal Hadoop cluster size
ā¢ To copy 100TB over a dedicated 10 GBE link takes about 24
hours.
You need a Data Lake that unserstands Posix/Windows
and HDFS to avoid data movement (=In-place Analytics)
15. 15EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
EMC DATA LAKE
Isilon
Servers
Applications
Finance Marketing Operations Sales
Servers Servers Servers Servers
CRMERP SCM CRM
Servers Servers Servers
Analytics + Mobile Applications
ā¢ Data Lake
Servers Servers Servers Servers
17. 17EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Isilon Data Lake Architecture
ClientsC
LAN
C
Clients
Clients
Isilon Node
GB/10GB
Ethernet
Isilon
SAS
Isilon Node
SAS
Isilon Node
SAS
Infiniband
Scale out Data Lake
ļ OneFS integrates RAID, Volume Manager and
Filesystem.
ļ Uses internal disk and spawns a single
filesystem accross disks
ļ Development start in the 2000ās
ļ Extremly mature, based on FreeBSD
ļ Supports many access protocols
ā¦
Scale Out
Clients
Clients
LAN
18. 18EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
ā¢ Multi-threaded daemon runs on all nodes
ā Services both NN and DN protocols
ā Translates HDFS RPCs to POSIX system calls
ā Stateless, underlying FS handles coherency
HDFS Implementation as a Protocol
OneFS Node
isi_hdfs_d
Thread
Request VFS
OneFS
Syscall
Response
19. 19EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
HDFS IMPLEMENTED LIKE A NAS PROTOCOL
OneFS runs a daemon that
speaks NameNode and
DataNode natively
OneFS Clustered FileSystem
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
Hadoop
Node
DFSClient
1) Request(ā/fileā)
2) Response
(block locations) 3) GetBlock(block)
23. 23EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
CLOUDPOOLS
DATA CENTER
23
CLOUD PROVIDER
APPS &
USERS
Access time
CLOUD ENABLED DATA LAKE
24. 24EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Parallel Replication
ļ Designed ground-up for scale-out storage
ļ Aggregate throughput scales with capacity
ļ Maintain consistent RPO over growing data sets
ļ Underlying FS knowledge
ā Snapshot integration
ā Block-level deltas
ā Rich meta-data transfer
ļ Automated Data Failover/Failback
25. 25EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Storage Considerations
STANDARD HADOOP CLUSTER
HADOOP USING EMC ISILON DATA LAKE
100 Nodes
Compute + DAS
24 TB per Node
/3 for
Hadoop
Copies
800TB Usable,
but rarely
achieved
5+
Cabinets
Spill space for
ingestion and
extraction
20 Nodes
Compute +
800TB Isilon
Single Copy with
Erasure Coding
800TB
Usable
1 Cabinet It is NAS
26. 26EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
What is my Idea - 2
ā¢ Build a fully functioning cost model that includes all
items you think are āfreeā, but costs stop when you
change the Architecture.
ā Project based funding is great until you want to centralize.
Centralization models (BDaaS) work when you consider all
the sundry costs typically excluded by project based
funding (i.e., 24 x 7 support for each cluster, all in costs
that appear free but are sunk)
27. 27EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
What is my Idea - 3
ā¢ Think about ābuild all yourselfā vs. ābuyā
ā¢ Focus on Analytics rather than infrastructure implementation,
software dependency, testing,.... etc.
ā¢ That has all been done already with EMC Big Data Systems and
Big Data Solutions
ā¢ Using pre-validated, installed and tested solutions reduces
complexity and increases reliability.
28. 28EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
EMC BIG DATA PORTFOLIO
ā¢ Data Lake
ā¢ Data Lake Extensions
ā¢ Cloud Enabled
ā¢ Vblock
ā¢ VxRack
ā¢ VxRail
ā¢ Federation Business
Data Lake
29. 29EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
HIGH PERFORMANCE
PREDICTABLE, LOW LATENCY
HDFS
Filesystem
Buffer Cache
Device Driver
SATA Controller
Disk
HDFS
Filesystem
Buffer Cache
Device Driver
PCIe SSD
PCIeSATA
PCIe
10msHDD
1000-2000ĀµsHDD
Traditional PCIe SSD
Hadoop
Kernel
Motherboard
HDFS
PCIe
<100Āµs
DSSD
āHDFS
Filesystem
Buffer Cache
Device Driver
SATA Controller
Disk
HDFS
Filesystem
Buffer Cache
Device Driver
PCIe SSD
PCIeSATA
PCIe
10msHDD
1000-2000ĀµsSDD
Traditional PCIe SSD
Hadoop
Kernel
Motherboard
DSSD Hadoop
Plugin accesses
flash directly
ā¢ 10X Throughput
ā¢ 1/13th Latency
ā¢ No Application
Changes Required
30. 30EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
P I V O T A L B I G D A T A S U I T E
V M W A R E V C L O U D S U I T E
EMC DATA LAKE FOUNDATION: ISILON + ECS
VCE VBLOCK | XTREMIO | DATA DOMAIN
O P E N
A N A L Y T I C S
T O O L B O X
D A T A A N D A N A L Y T I C S C A T A L O G
A D V A N C E D A N A L Y T I C SA P P L I C A T I O N S
A T S C A L E
D A T A
P R O C E S S I N G
GREENPLUM
DATABASE
HAWQ
SPRING XD PIVOTAL HDSPARK
REDIS
RABBITMQ
GEMFIRE
BDS ON PIVOTAL
CLOUD FOUNDRY
H A D O O P
PLATFORMMANAGER
DATAGOVERNOR
DATA
MANAGER
INGEST
MANAGER
ANALYTICS
MANAGER
EMC Business Data Lake
Look Demos at http://www.fbdldemo.com/
31. 31EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Thursday, April 14th, 15:00 UTC
Watch out for :
ā¢ Hadoop Everywhere: Geo-Distributed Storage
for Big Data
Pesenters:
ā¢ Nikhil Joshi, EMC
ā¢ Vishrut Shah,EMC
33. 33EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
A Remark on data locality
ā¢ U. C. Berkeleyās AMP Labs declared Data locality dead in
2011
ā¢ Cloudera has declared data locality dead in Hadoop 3.0
with HDFS-EC.
ā¢ Gartner has declared hadoop dead due to its limits
ā¢ Hadoop will only grow and have more dependency on it
going forward.
ā¢ A catalyst may be the next time I see you and uptime for
hadoop is your main concern.
34. 34EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Simple to manage
Single file system, single volume, global namespace
Massively scalable
Scales from 16 TB to over 50 PB in a single cluster
200GB/s throughput, 3.75M IOPS
Unmatched efficiency
Over 80% storage utilization, automated tiering and SmartDedupe
Enterprise data protection
Efficient backup and disaster recovery, and N+1 thru N+4 redundancy
Robust security and compliance options
RBAC, Access Zones, WORM data security, File System Auditing
Data At Rest Encryption with SEDs, STIG hardening
CAC/PIV Smartcard authentication, FIPS OpenSSL support
Operational flexibility
Multi-protocol support including NFS, SMB, HTTP, FTP and HDFS
Object and Cloud computing including OpenStack Swift
Isilon Scale-Out NAS
35. 35EMC CONFIDENTIALāINTERNAL USE ONLYEMC CONFIDENTIALāINTERNAL USE ONLY
Geo-Scale
Geo-Replicated and Distributed to multiple locations
Massively scalable
Scales to billions of objects in a single namespace
Support for all file sizes
Support for individual files of any size.
Multi-Tenant
Efficient backup and disaster recovery, and N+1 thru N+4 redundancy
HDFS Compatible
Hortonworks Certified HDFS Compatible File System
Swift Compatible
Natively support Open Stack storage
Native Cloud Interface
Natively works with existing cloud protocols like S3 and Azure.
Elastic Cloud Storage (ECS)