This document discusses configuring a secure, multitenant cluster for an enterprise. It covers setting up authentication using Kerberos and LDAP, authorization with HDFS permissions, Apache Sentry, and encryption. It also discusses auditing with Cloudera Navigator, resource isolation through static and dynamic partitioning of HDFS, HBase, Impala and YARN, and admission control for Impala. The goal is to enable multiple groups within an organization to securely share cluster resources.
This document discusses security features in Apache Kafka including SSL for encryption, SASL/Kerberos for authentication, authorization controls using an authorizer, and securing Zookeeper. It provides details on how these security components work, such as how SSL establishes an encrypted channel and SASL performs authentication. The authorizer implementation stores ACLs in Zookeeper and caches them for performance. Securing Zookeeper involves setting ACLs on Zookeeper nodes and migrating security configurations. Future plans include moving more functionality to the broker side and adding new authorization features.
Unprotected data stores are prone to data breaches. In this talk, I'll explain how to implement security on Hadoop. This talks covers basic elements, such as firewall, HA, backup, Kerberos, data encryption (both at rest and in transit).
I also shed light on how Cloudera handles security vulnerability reports, and a little bit on partner product certification process.
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
The data center has gone through several inflection points in the past decades: adoption of Linux, migration from physical infrastructure to virtualization and Cloud, and now large-scale data analytics with Big Data and Hadoop.
Please join us to learn about how Cloudera and Intel are jointly innovating through open source software to enable Hadoop to run best on IA (Intel Architecture) and to foster the evolution of a vibrant Big Data ecosystem.
Hadoop Security and Compliance - StampedeCon 2016StampedeCon
As Hadoop becomes a mainstream data platform across organizations, securing a vast and growing volume of critical information, especially financial and healthcare data, is more essential than ever. In this presentation, Derek will elaborate how to leverage Big Data technologies without sacrificing security and compliance, and will focus specially on how comprehensive security mechanisms should be put in place to secure a production ready Hadoop environment. The presentation will also highlight technologies, such as encryption in-motion and at-rest for Hadoop services, as well as the complicated compliant processes to meet strictest regulatory requirements and standards.
A deep dive into running data analytic workloads in the cloudCloudera, Inc.
This document discusses running data analytic workloads in the cloud using Cloudera Altus. It introduces Altus, which provides a platform-as-a-service for analyzing and processing data at scale in public clouds. The document outlines Altus features like low cost per-hour pricing, end-user focus, and cloud-native deployment. It then describes hands-on examples using Altus Data Engineering for ETL and the Altus Analytic Database for exploration and analytics. Workload analytics capabilities are also introduced for troubleshooting and optimizing jobs.
The document discusses running Hadoop clusters in the cloud and the challenges that presents. It introduces CloudFarmer, a tool that allows defining roles for VMs and dynamically allocating VMs to roles. This allows building agile Hadoop clusters in the cloud that can adapt as needs change without static configurations. CloudFarmer provides a web UI to manage roles and hosts.
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
This document discusses securing Spark applications. It covers encryption, authentication, and authorization. Encryption protects data in transit using SASL or SSL. Authentication uses Kerberos to identify users. Authorization controls data access using Apache Sentry and the Sentry HDFS plugin, which synchronizes HDFS permissions with higher-level abstractions like tables. A future RecordService aims to provide a unified authorization system at the record level for Spark SQL.
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
This document discusses data protection in hybrid data lake environments using Cloudera's Data Lifecycle Manager (DLM) service. It provides an overview of DLM's capabilities for replicating data between on-premises and cloud environments, including HDFS, Hive, Ranger policies, and metadata. Key features highlighted are incremental replication of Hive metadata and data, HDFS snapshot-based replication between Hadoop clusters, and replication to cloud storage providers like AWS S3, Azure ADLS, and GCP. The document also demonstrates DLM's user interface and replication of data and security policies between on-prem and cloud clusters.
How to build leakproof stream processing pipelines with Apache Kafka and Apac...Cloudera, Inc.
This document discusses building leakproof stream processing pipelines with Apache Kafka and Apache Spark. It provides an overview of offset management in Spark Streaming from Kafka, including storing offsets in external data stores like ZooKeeper, Kafka, and HBase. The document also covers Spark Streaming Kafka consumer types and workflows, and addressing issues like maintaining offsets during planned and unplanned maintenance or application errors.
Spark is a fast and general engine for large-scale data processing. It provides APIs in Java, Scala, and Python and an interactive shell. Spark applications operate on resilient distributed datasets (RDDs) that can be cached in memory for faster performance. RDDs are immutable and fault-tolerant via lineage graphs. Transformations create new RDDs from existing ones while actions return values to the driver program. Spark's execution model involves a driver program that coordinates tasks on executor machines. RDD caching and lineage graphs allow Spark to efficiently run jobs across clusters.
This document discusses running Spark applications on YARN and managing Spark clusters. It covers challenges like predictable job execution times and optimal cluster utilization. Spark on YARN is introduced as a way to leverage YARN's resource management. Techniques like dynamic allocation, locality-aware scheduling, and resource queues help improve cluster sharing and utilization for multi-tenant workloads. Security considerations for shared clusters running sensitive data are also addressed.
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...DataWorks Summit
The document discusses Hive LLAP (Live Long and Process) as a high performance and cost-effective alternative to traditional Massively Parallel Processing (MPP) databases for querying large datasets on Hadoop. It describes Walmart's implementation of Hive LLAP on their data lake to improve query performance for business users. A proof-of-concept found Hive LLAP queries were up to 50% faster when using 15 nodes instead of 10, and it performed comparably or better than two MPP databases with similar or larger infrastructures. Walmart plans to further evaluate Hive LLAP on newer Hadoop distributions and technologies to improve availability and workload management.
Deciding the deployment model is critical when enterprises adopt Hadoop. Initially, the bare metal (on-premise cluster with physical servers) model was popular to avoid I/O overhead in the virtualized environments. However, these days, cloud is also a contending option with its compelling cost savings, and ease of operation. To aid in assessing the deployment options, Accenture Technology Labs developed Accenture Data Platform Benchmark suite, a total cost of ownership (TCO) model and has tuned and compared performance of bare metal Hadoop clusters and Hadoop cloud service. Interestingly enough, the study discovered that price/performance ratio is not a critical factor in making a Hadoop deployment decision. Employing empirical and systemic analyses, the study resulted in comparable price/performance ratio from both bare metal Hadoop clusters and Hadoop-as-a-service. Moreover, cheaper purchasing options (e.g., long term contracts) provides better ratio than the bare metal one in many cases. Thus, this result debunks the idea that the cloud is not suitable to Hadoop MapReduce workloads due to their heavy I/O requirements. Furthermore, the study finds that the Hadoop default configuration provides ample headroom for performance tuning, and the cloud infrastructure enables even further performance tuning opportunities.
The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
This document discusses building near-real-time analytics pipelines using Apache Spark Streaming and Apache Kudu on the Cloudera platform. It defines near-real-time analytics, describes the relevant components of the Cloudera stack (Kafka, Spark, Kudu, Impala), and how they can work together. The document then outlines the typical stages involved in implementing a Spark Streaming to Kudu pipeline, including sourcing from a queue, translating data, deriving storage records, planning mutations, and storing the data. It provides performance considerations and introduces Envelope, a Spark Streaming application on Cloudera Labs that implements these stages through configurable pipelines.
This document provides an introduction to Apache Kudu, a storage layer for Apache Hadoop designed for fast analytics on fast data. It discusses Kudu's motivations of filling gaps in HDFS and HBase capabilities, its design goals of high throughput scans and low latency reads/writes, and how its columnar storage and integration with tools like Spark and Impala enable it to meet these goals. Example use cases like time series and real-time analytics are presented. The document also covers Kudu's architecture of tables and tablets, its replication and fault tolerance model using Raft consensus, and performance comparisons that show it outperforming other storage systems.
This presentation is from BigData November Bangalore MeetUp by Varun Vasudev.
technology.inmobi.com/events/bigdata-meetup
Talk Outline:
- Overview of YARN
- New YARN Innovation in Hadoop 2.6
- Rolling upgrades
- Added fault tolerance
- CPU scheduling in Capacity Scheduler
- C-Group isolation
- Node labels
- Support for long running services
Cloudera Navigator provides integrated data governance and security for Hadoop. It includes features for metadata management, auditing, data lineage, encryption, and policy-based data governance. KeyTrustee is Cloudera's key management server that integrates with hardware security modules to securely manage encryption keys. Together, Navigator and KeyTrustee allow users to classify data, audit usage, and encrypt data at rest and in transit to meet security and compliance needs.
This document provides an overview of Cloudera's SQL on Hadoop technologies including Hive, Spark SQL, and Impala. It discusses the features and capabilities of each technology, how they differ, and when each would be best suited for different use cases. Key points covered include Hive being optimized for batch processing while Impala and Spark SQL enable lower latency queries. The document also reviews columnar data formats like Parquet that can improve performance.
Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.
One benefit of Apache Hadoop is the ability to power multiple workloads, across many different users and departments, all within a single, shared cluster. Hear how BT is doing this today and learn about new features in Cloudera Manager to provide better visibility for multi-tenant operations.
The document discusses managing a multi-tenant data lake at Comcast over time. It began as an experiment in 2013 with 10 nodes and has grown significantly to over 1500 nodes currently. Governance was instituted to manage the diverse user community and workloads. Tools like the Command Center were developed to provide monitoring, alerting and visualization of the large Hadoop environment. SLA management, support processes, and ongoing training are needed to effectively operate the multi-tenant data lake at scale.
This document discusses using event streams as the system of record for data, rather than traditional databases. It argues that streams can serve as the single source of truth for data, providing benefits like data lineage, auditing, and integrity. It also describes how healthcare company Liaison uses a streaming platform from MapR to power their data integration platform, gaining the advantages of streams while meeting various compliance requirements.
Treasure Data provides a big data analytics platform that runs on Hadoop in the cloud. It aims to simplify big data and make it accessible for more users ("Big Data for the Rest of Us"). Treasure Data collects and stores data from various sources in its cloud-based columnar datastore and allows querying and analysis of data through SQL, REST APIs and other tools. It handles all the operational complexities of Hadoop and provides a simple interface for users.
This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
The document summarizes new features in SQL Server 2016 SP1, organized into three categories: performance enhancements, security improvements, and hybrid data capabilities. It highlights key features such as in-memory technologies for faster queries, always encrypted for data security, and PolyBase for querying relational and non-relational data. New editions like Express and Standard provide more built-in capabilities. The document also reviews SQL Server 2016 SP1 features by edition, showing advanced features are now more accessible across more editions.
Hadoop in the Cloud: Common Architectural PatternsDataWorks Summit
The document discusses how companies are using Microsoft Azure services like HDInsight, Data Factory, Machine Learning, and others to gain insights from large volumes of data. Specifically, it provides examples of:
1) A large computer manufacturer/retailer analyzing clickstream data with HDInsight to understand customer behavior and provide real-time recommendations to increase online conversions.
2) An industrial automation company partnering with an oil company to use IoT sensors and analytics to monitor LNG fueling stations for proactive maintenance based on sensor data analyzed with HDInsight, Data Factory, and Machine Learning.
3) How data from various industries like retail, oil and gas, manufacturing, and others can be analyzed
This document discusses strategies for filling a data lake by improving the process of data onboarding. It advocates using a template-based approach to streamline data ingestion from various sources and reduce dependence on hardcoded procedures. The key aspects are managing ELT templates and metadata through automated metadata extraction. This allows generating integration jobs dynamically based on metadata passed at runtime, providing flexibility to handle different source data with one template. It emphasizes reducing the risks associated with large data onboarding projects by maintaining a standardized and organized data lake.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Should I move my database to the cloud?James Serra
So you have been running on-prem SQL Server for a while now. Maybe you have taken the step to move it from bare metal to a VM, and have seen some nice benefits. Ready to see a TON more benefits? If you said “YES!”, then this is the session for you as I will go over the many benefits gained by moving your on-prem SQL Server to an Azure VM (IaaS). Then I will really blow your mind by showing you even more benefits by moving to Azure SQL Database (PaaS/DBaaS). And for those of you with a large data warehouse, I also got you covered with Azure SQL Data Warehouse. Along the way I will talk about the many hybrid approaches so you can take a gradual approve to moving to the cloud. If you are interested in cost savings, additional features, ease of use, quick scaling, improved reliability and ending the days of upgrading hardware, this is the session for you!
Machine learning allows us to build predictive analytics solutions of tomorrow - these solutions allow us to better diagnose and treat patients, correctly recommend interesting books or movies, and even make the self-driving car a reality. Microsoft Azure Machine Learning (Azure ML) is a fully-managed Platform-as-a-Service (PaaS) for building these predictive analytics solutions. It is very easy to build solutions with it, helping to overcome the challenges most businesses have in deploying and using machine learning. In this presentation, we will take a look at how to create ML models with Azure ML Studio and deploy those models to production in minutes.
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
Un 2016 da record per l’interscambio economico tra Italia e GermaniaJoerg Buck
Il volume dell’interscambio commerciale tra Italia e Germania, dopo i risultati già positivi del 2015, ha registrato un ulteriore aumento nel 2016 raggiungendo il record storico di 112,1 miliardi di euro (+3,5% rispetto al 2015). Secondo i dati Istat, lo scorso anno le esportazioni italiane verso la Germania hanno toccato quota 52,7 miliardi di euro (+3,9% rispetto al 2015) mentre il valore delle importazioni si è attestato a 59,4 miliardi di euro (+3,3% rispetto al 2015).
Cashgate Scandal Malawi: Different Types Of Fashion StylesJoseph Jacob Esther
Each prize has the power to transform the winner’s career and both the Man Booker Prize and Man Booker International are sponsored by Man Group. Man Group was recognised as a partner who mirrored the quality, integrity and longevity of the Booker Prize. The prize underscores Man Group's charitable focus on literacy and education as well as the firm’s commitment to excellence and entrepreneurship.
Controlling Technical Debt with Continuous Deliverywalkmod
This document summarizes a presentation about controlling technical debt with continuous delivery. It discusses using tools for continuous inspection of code to detect debt, automated code fixes to reduce debt incrementally, and integrating fixes into the continuous delivery pipeline to continuously pay down debt over time. Key aspects covered include metrics and tools to measure debt, automated fixes for common code issues, code transformation techniques to fix issues safely, and a WalkMod pipeline API to integrate fixes into the delivery process.
Some highlighted articles of ACCIONA Reports 65 analyze our contract to provide clean energy to Google Chile, "the art of business" in the Middle East and the "Luz en Casa" program, run by ACCIONA Microenergía. Find out more at #ACCIONAReports
This document provides an overview of Apache Hadoop security, both historically and what is currently available and planned for the future. It discusses how Hadoop security is different due to benefits like combining previously siloed data and tools. The four areas of enterprise security - perimeter, access, visibility, and data protection - are reviewed. Specific security capabilities like Kerberos authentication, Apache Sentry role-based access control, Cloudera Navigator auditing and encryption, and HDFS encryption are summarized. Planned future enhancements are also mentioned like attribute-based access controls and improved encryption capabilities.
The fundamentals and best practices of securing your Hadoop cluster are top of mind today. In this session, we will examine and explain the components, tools, and frameworks used in Hadoop for authentication, authorization, audit, and encryption of data and processes. See how the latest innovations can let you securely connect more data to more users within your organization.
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...Cloudera, Inc.
One of the benefits of Hadoop is that it easily allows for multiple entry points both for data flow and user access. Here we discuss how Cloudera allows you to preserve the agility of having multiple entry points while also providing strong, easy to manage authentication. Additionally, we discuss how Cloudera provides unified authorization to easily control access for multiple data processing engines.
This document discusses the APIs and extensibility features of Cloudera Manager. It provides an overview of the Cloudera Manager API introduced in version 4.0, which allows programmatic access to cluster operations and monitoring data. It also discusses how the API has been used by various customers and partners for tasks like installation/deployment, monitoring, and alerting integration. The document outlines Cloudera Manager's monitoring capabilities using the tsquery language and provides examples. Finally, it covers new service extensibility features introduced in Cloudera Manager 5.
Securing Data in Hybrid on-premise and Cloud Environments Using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Big Data Spain
This document discusses securing big data at rest using encryption for Hadoop, Cassandra, and MongoDB on Red Hat. It provides an overview of these NoSQL databases and Hadoop, describes common use cases for big data, and demonstrates how to use encryption solutions like dm-crypt, eCryptfs, and Cloudera Navigator Encrypt to encrypt data for these platforms. It includes steps for profiling processes, adding ACLs, and encrypting data directories for Hadoop, Cassandra, and MongoDB. Performance costs for encryption are typically around 5-10%.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
The document discusses new features in Apache Hadoop 3, including HDFS erasure coding which reduces storage overhead, YARN federation which improves scalability, and the Application Timeline Server which provides improved visibility into application performance. It also covers HDFS multi standby NameNodes which enhances high availability, and the future directions of Hadoop including object storage with Ozone and running HDFS on cloud infrastructure.
The document discusses security implementation on Hadoop clusters. It outlines various security measures that can be implemented including identity management using Kerberos or LDAP, authorization controls using access control lists, auditing, encryption of data in transit and at rest, key management, and vulnerability response processes. The document provides system diagrams of example implementations and discusses performance considerations of encryption.
Risk Management for Data: Secured and GovernedCloudera, Inc.
Cloudera Tech Day Presentation by Eddie Garcia, Chief Security Architect, Cloudera. Protecting enterprise data is an increasingly complex challenge given the diversity and sophistication of threat actors and their cyber-tactics. In this session, participants will hear a comprehensive introduction to Hadoop Security, including the “three A’s” for secure operating environments: Authentication, Authorization, and Audit. In addition, the presenter will cover strategies to orchestrate data security, encryption, and compliance, and will explain the Cloudera Security Maturity Model for Hadoop. Attendees will leave with a greater understanding of how effective INFOSEC relies on an enterprise big data governance and risk management approach.
The document discusses Cloudera Manager's APIs and extensibility features. It describes how the CM API introduced in version 4 allows programmatic access to cluster operations and monitoring data. It provides examples of how the API has been used to integrate CM with installation/deployment tools and for monitoring and alerting. The document also discusses CM's support for custom metrics charts using tsquery and how service extensibility introduced in version 5 allows for non-CDH services and ISV applications to be managed through CM.
This deck covers key considerations and provides advice for enterprises looking to run production-scale Cloudera on AWS. We touch on everything from security to governance to selecting the right instance type for your Hadoop workload (Spark, Impala, Search, etc).
Big Data is an increasingly powerful enterprise asset and this talk will explore the relationship between big data and cyber security, how we preserve privacy whilst exploiting the advantages of data collection and processing. Big Data technologies provide both governments and corporations powerful tools to offer more efficient and personalized services. The rapid adoption of these technologies has of course created tremendous social benefits. Unfortunately unwanted side effects are the potential rich pickings available to those with malicious intentions. Increasingly, the sophisticated cyber attacker is able to exploit the rich array public data to build detailed profiles on their adversaries to support their malicious intentions
Cloudera GoDataFest Security and GovernanceGoDataDriven
The document discusses Cloudera's security and governance solutions for Hadoop. It describes how Cloudera provides comprehensive security through authentication, authorization, auditing, and compliance features. It also covers how Cloudera helps with data visibility and governance through tools that report on data usage and lineage. The overall goal is to help customers securely manage and govern their data on Hadoop clusters.
This presentation answer a lot of your questions about PostgreSQL and the Red Hat Cluster Suite.
It reviews how you can create failover/standby capabilities with the following activities:
General PostgreSQL clustering options
Overview of Red Hat Cluster Service
Identification of candidate databases for clustering
Identification of hardware for clustering
Analysis of uptime requirements and data latency
Implementation of clustering
Testing of clustering
PostgreSQL installation tips for RHCS
This document discusses the challenges of trust, visibility and governance in Apache Hadoop and how Cloudera Navigator addresses them. It describes how Navigator provides an integrated data management and governance platform for Hadoop by collecting and integrating technical metadata, business metadata, lineage, policies and audit logs. This platform enables self-service discovery and analytics for data scientists and BI users, usage-driven optimization for Hadoop administrators and compliance capabilities for security teams. The document provides examples of the types of metadata, lineage and audit logs collected in Hadoop and their limitations, and argues that Navigator is needed to make this information actionable through policies and a governance framework.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Cloudera User Group Chicago - Cloudera Manager: APIs & ExtensibilityClouderaUserGroups
This document provides an overview of Cloudera Manager APIs and extensibility. It discusses how the Cloudera Manager API, introduced in version 4.0, allows programmatic access to cluster operations and monitoring information. It provides examples of integration with the API for installation/deployment and monitoring/alerting. It also covers the tsquery language for custom metrics and monitoring, and new capabilities in Cloudera Manager 5 for user-defined triggers/alarms and service extensibility.
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera, Inc.
Cloud environments are increasingly becoming a popular deployment option for Hadoop. Enterprises can take advantage of the added flexibility and elasticity of the cloud for both long-running clusters, temporary deployments or for spikey workloads. However, as more and more users choose cloud environments for critical Hadoop workloads, they are often forced to compromise on key aspects of their data platform.
Cloudera Director enables the full fidelity of the Enterprise Data Hub in the cloud, without compromises. Announced with the recent 5.2 release, Cloudera Director is the simple, reliable way to deploy and scale Hadoop in the cloud, while maintaining an open and neutral platform with enterprise-grade capabilities.
During this webinar, Tushar Shanbhag, Director of Product Management, will look at why Hadoop cloud environments are becoming so popular and some of the challenges around Hadoop in the cloud. He will then provide an in-depth overview of Cloudera Director, its key features, and how it alleviates these common challenges. Finally, he will discuss some key use cases and provide insight into what’s next for Cloudera and Hadoop in the cloud.
The Future of Hadoop Security - Hadoop Summit 2014Cloudera, Inc.
Hadoop deployments are rapidly moving from pilots to production, enabling unprecedented opportunity to build big data applications that deliver faster access to more information to more users than ever before possible. Yet without the ability to address data security and compliance regulations, Hadoop will be limited to another data silo.
In this talk, Matt Brandwein and David Tishgart discuss the requirements for securing Hadoop and how Cloudera (now with Gazzang) and Intel are collaborating in the open to deliver comprehensive, transparent, compliance-ready security to unlock the potential of the Hadoop ecosystem and enable innovation without compromise.
Similar to Configuring a Secure, Multitenant Cluster for the Enterprise (20)
The document discusses using Cloudera DataFlow to address challenges with collecting, processing, and analyzing log data across many systems and devices. It provides an example use case of logging modernization to reduce costs and enable security solutions by filtering noise from logs. The presentation shows how DataFlow can extract relevant events from large volumes of raw log data and normalize the data to make security threats and anomalies easier to detect across many machines.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
The document outlines the 2021 finalists for the annual Data Impact Awards program, which recognizes organizations using Cloudera's platform and the impactful applications they have developed. It provides details on the challenges, solutions, and outcomes for each finalist project in the categories of Data Lifecycle Connection, Cloud Innovation, Data for Enterprise AI, Security & Governance Leadership, Industry Transformation, People First, and Data for Good. There are multiple finalists highlighted in each category demonstrating innovative uses of data and analytics.
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
The document outlines the agenda for Cloudera's Enterprise Data Cloud event in Vienna. It includes welcome remarks, keynotes on Cloudera's vision and customer success stories. There will be presentations on the new Cloudera Data Platform and customer case studies, followed by closing remarks. The schedule includes sessions on Cloudera's approach to data warehousing, machine learning, streaming and multi-cloud capabilities.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
The document discusses the benefits and trends of modernizing a data warehouse. It outlines how a modern data warehouse can provide deeper business insights at extreme speed and scale while controlling resources and costs. Examples are provided of companies that have improved fraud detection, customer retention, and machine performance by implementing a modern data warehouse that can handle large volumes and varieties of data from many sources.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
In this webinar, we’ll show you how Cloudera SDX reduces the complexity in your data management environment and lets you deliver diverse analytics with consistent security, governance, and lifecycle management against a shared data catalog.
Sharing Data
Single repository for all data
Organisation-wide view of data gives better insight
Effective sharing of datasets when permitted, isolation of datasets when not
Sharing Compute
Allocation of resource is dynamic, optimised, and just-in-time
Leading to better utilisation of cluster resources and better performance for individual requests (bursting)
Across workloads (batch processing, interactive SQL, enterprise search, and advanced analytics)
Consolidated Operations
Amortise (repay) administrative overhead
Reduce cost and complexity
Multiple groups (departments, projects, users)
Common set of resources (storage and compute)
Security constraints (e.g. data protection policy)
LDAP Integration
Typically done at OS-level and HDFS uses shell-based group mapping
User accounts propagated to all hosts
PAM_LDAP
SSSD
Centrify
VAS/QAS
POSIX Access Control Lists (Hadoop 2.4)
An ACL provides a way to set different permissions for specific named users or named groups, not only the file's owner and the file's group
Encryption
Compliance regulations: EU Data Protection Directive
Gazzang
OS-level encryption
Enterprise-grade key management (Navigator Key Trustee)
Navigator Encrypt
Encrypt DN directory in Linux file system
Provided by kernel module
Process-based ACLs (i.e. only DN can access encrypted directory)
Project Rhino
HDFS-level encryption (Encryption Zones)
Integrated with Navigator Key Trustee
Better for multitenant type environment
Hardware-accelerated
Unusable if there is a significant performance penalty
Uses AES instruction set available on Intel processors
HDFS will be able to provide access to encrypted data with minimal performance impact
Restrict tenants disk usage
Prevent users from accidently or maliciously consuming too much disk space within the cluster
Disk space quotas: disk space limits on a per directory basis
Name quotas: limits the number of files and subdirectories within a particular directory. Helps administrators control the NN metadata
Analogy
Messi, Neymar, and Suarez pop out for Pizza after training
They do not have enough money to buy a large Pizza each, so they put their money together to buy one;
Once they have the Pizza they agree on a policy to share the Pizza;
Because the Pizza has 10 slices, they agree that Messi can eat 4 slices, and Neymar and Suarez can eat 3 each;
They can eat in parallel, but can only eat one slice at a time.
I.e. route tenants users based on their AD group membership
Impala
Incoming queries are executed, queued, or rejected
Queue if too many queries or not enough memory
Reject if queue is full
Disabling undeclared pools
When user does not specify a pool
When enabled, a pool is created on-the-fly with the name of the user that submitted the request
When disabled, the default pool is used instead
Enabling the default pool
When user specifies pool that doesn’t exist
When disabled, the pool is created on-the-fly with the default settings
When enabled, the default pool is used instead