SlideShare a Scribd company logo
Hadoop Operations –
Best Practices from the Field
June 11, 2015
Chris Nauroth
email: cnauroth@hortonworks.com
twitter: @cnauroth
Suresh Srinivas
email: suresh@hortonworks.com
twitter: @suresh_m_s
© Hortonworks Inc. 2011
About Me
Chris Nauroth
• Member of Technical Staff, Hortonworks
– Apache Hadoop committer, PMC member, and Apache Software Foundation member
– Major contributor to HDFS ACLs, Windows compatibility, and operability improvements
• Hadoop user since 2010
– Prior employment experience deploying, maintaining and using Hadoop clusters
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Agenda
• Analysis of Hadoop Support Cases
– Support case trends
– Configuration
– Software Improvements
• Key Learnings and Best Practices
– HDFS ACLs
– HDFS Snapshots
– Reporting DataNode Volume Failures
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Support Case Trends – Proportional Cases per Month
Page 4
Architecting the Future of Big Data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
HDFS
Map Reduce
YARN
Other (37 components)

Recommended for you

The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook

The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.

hadoopsqlimpala
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive

Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features. We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.

dataworks summit barcelonadws19apache hive
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots

Presented by: Jonathan Hsieh (Cloudera), Matteo Bertozzi (Cloudera), and Jesse Yates (Salesforce.com)

hbaseconhbasebig data
© Hortonworks Inc. 2011
Support Case Trends – Root Cause
Page 5
Architecting the Future of Big Data
0
200
400
600
800
1000
1200
Customer Environment
(Non HDP)
Documentation Defect Documentation Gap Documentation Not
Utilized
Education -
Configuration
Needs Training Product Defect
YARN
Map Reduce
HDFS
© Hortonworks Inc. 2011
Support Case Trends
• Core Hadoop components (HDFS, YARN and MapReduce) are used across all deployments, and
therefore receive proportionally more support cases than other ecosystem components.
• Misconfiguration is the dominant root cause.
• Documentation is a close second.
• We are constantly improving the code to eliminate operational issues, help with diagnosis and
provide increased visibility.
• Best practices get incorporated into Apache Ambari for improved defaults, simplified
configuration and deeper monitoring.
Page 6
Architecting the Future of Big Data
Configuration
© Hortonworks Inc. 2011
Configuration - Hardware and Cluster Sizing
• Considerations
–Larger clusters heal faster on nodes or disk failure
–Machines with huge storage take longer to recover
–More racks give more failure domains
• Recommendations
– Get good-quality commodity hardware
– Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores
– More memory is better – real time is memory hungry!
– Before considering fatter machines (1U 6 disks vs. 2U 12 disks)
– Get to 30-40 machines or 3-4 racks
–Use pilot cluster to learn about load patterns
– Balanced hardware for I/O, compute or memory bound
– More details - http://tinyurl.com/hwx-hadoop-hw
Page 8

Recommended for you

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

Did you like it? Check out our E-book: Apache NiFi - A Complete Guide https://ebook.getindata.com/apache-nifi-complete-guide Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com​

big databig data expertsdevops
Manage Add-On Services with Apache Ambari
Manage Add-On Services with Apache AmbariManage Add-On Services with Apache Ambari
Manage Add-On Services with Apache Ambari

Cutting-edge Hadoop clusters are bound to need custom (add-on) services that are not available in the Hadoop distribution of their choice. Agility is crucial for companies to integrate any service into existing large-scale Hadoop clusters with ease. Apache Ambari manages the Hadoop cluster and solves this problem by extending the stack with add-on services, which can be a new Apache project, different Hadoop file system, or internal tool. This talk covers how to create a service definition in Ambari to manage lifecycle commands and configs, plus advanced topics like packaging, installing from multiple repositories, recommending and validating configs using Service Advisor, running custom commands, defining dependencies on configs and other services, and more. We will also cover how to create custom metrics and dashboards using Ambari Metric System and Grafana, generating alerts, and enabling security by authenticating with Kerberos. Further, we will discuss the future of service definitions and how Ambari 3.0 will support custom services through Management Packs to enable Hadoop vendors to release software faster. Speaker Jayush Luniya, Principal Software Engineer, Hortonworks

hortonworksdataworks summit 2017dataworks summit sydney
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse

HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.

phoenixapache phoenixhbase
© Hortonworks Inc. 2011
Configuration – JVM Tuning
• Avoid JVM issues
– Use 64 bit JVM for all daemons
– Compressed OOPS enabled by default (6 u23 and later)
– Java heap size
– Set same max and starting heapsize, Xmx == Xms
– Avoid java defaults – configure NewSize and MaxNewSize
– Use 1/8 to 1/6 of max size for JVMs larger than 4G
– Configure –XX:PermSize=128 MB, -XX:MaxPermSize=256 MB
– Use low-latency GC collector
– -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N>
– High <N> on Namenode and JobTracker or ResourceManager
– Important JVM configs to help debugging
– -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails
– -XX:ErrorFile=<file>
– -XX:+HeapDumpOnOutOfMemoryError
Page 9
© Hortonworks Inc. 2011
Configuration
• Deploy with QuorumJournalManager for high availability
• Configure open fd ulimit
– Default 1024 is too low
– 16K for datanodes, 64K for Master nodes
• Use version control for configuration!
Page 10
© Hortonworks Inc. 2011
Configuration
• Use disk fail in place for datanodes: dfs.datanode.failed.volumes.tolerated
– Disk failure is no longer datanode failure
– Especially important for large density nodes
• Set dfs.namenode.name.dir.restore to true
– Restores NN storage directory during checkpointing
• Take periodic backups of namenode metadata
– Make copies of the entire storage directory
• Set aside a lot of disk space for NN logs
– It is verbose – set aside multiple GBs
– Many installs configure this too small
– NN logs roll with in minutes – hard to debug issues
Page 11
© Hortonworks Inc. 2011
Configuration – Monitoring Usage
• Cluster storage, nodes, files, blocks grows
– Update NN heap, handler count, number of DN xceivers
– Tweak other related config periodically
• Monitor the hardware usage for your work load
– Disk I/O, network I/O, CPU and memory usage
– Use this information when expanding cluster capacity
• Monitor the usage with HADOOP metrics
– JVM metrics – GC times, Memory used, Thread Status
– RPC metrics – especially latency to track slowdowns
– HDFS metrics
– Used storage, # of files and blocks, total load on the cluster
– File System operations
– MapReduce Metrics
– Slot utilization and Job status
• Tweak configurations during upgrades/maintenance on an ongoing basis
Page 12

Recommended for you

Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data

The tech talk was gieven by Ranjeeth Kathiresan, Salesforce Senior Software Engineer & Gurpreet Multani, Salesforce Principal Software Engineer in June 2017.

big datasalesforceengineering
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals

The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses: - RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied. - RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation. - Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling. - The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.

apache sparkshufflingrdds
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop

Ozone is an object store for Apache Hadoop that is designed to scale to trillions of objects. It uses a distributed metadata store to avoid single points of failure and enable parallelism. Key components of Ozone include containers, which provide the basic storage and replication functionality, and the Key Space Manager (KSM) which maps Ozone entities like volumes and buckets to containers. The Storage Container Manager manages the container lifecycle and replication.

hadoopapacheconozone
HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Install & Configure: Ambari Guided Configuration
Guide configuration and provide
recommendations for the most
common settings.
(HBase Example Shown here)
Software Improvements
Real Incidents and Software Improvements to Address Them
© Hortonworks Inc. 2011
Don’t edit the metadata files!
• Editing can corrupt the cluster state
– Might result in loss of data
• Real incident
– NN misconfigured to point to another NN’s metadata
– DNs can’t register due to namespace ID mismatch
– System detected the problem correctly
– Safety net ignored by the admin!
– Admin edits the namenode VERSION file to match ids
Mass deletion of unknown blocks that do not
exist in that namespace
Page 15
© Hortonworks Inc. 2011
Improvement
• Pause deletion of blocks when the namenode starts up
– https://issues.apache.org/jira/browse/HDFS-6186
– Supports configurable delay of block deletions after NameNode startup
– Gives an admin extra time to diagnose before deletions begin
• Show when block deletion will start after NameNode startup in WebUI
– https://issues.apache.org/jira/browse/HDFS-6385
– The web UI already displayed the number of pending block deletions
– This enhanced the display to indicate when actual deletion will begin
Page 16
Architecting the Future of Big Data

Recommended for you

ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data

ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for. This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.

dataworks summitdataworks summit 2017hadoop summit
Apache Ranger
Apache RangerApache Ranger
Apache Ranger

Data in Hadoop is getting bigger every day, consumers of the data are growing, organizations are now looking at making their Hadoop cluster compliant to federal regulations and commercial demands. Apache Ranger simplifies the management of security policies across all components in Hadoop. Ranger provides granular access controls to data. The deck describes what security tools are available in Hadoop and their purpose then it moves on to discuss in detail Apache Ranger.

apache rangerprotecting data in hadoopkerberos
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway

The document discusses the Apache Knox Gateway, which is an extensible reverse proxy framework that securely exposes REST APIs and HTTP-based services from Hadoop clusters. It provides features such as support for common Hadoop services, integration with enterprise authentication systems, centralized auditing of REST API access, and service-level authorization controls. The Knox Gateway aims to simplify access to Hadoop services, enhance security by protecting network details and supporting partial SSL, and enable centralized management and control over REST API access.

© Hortonworks Inc. 2011
Block Deletion Start Time
Page 17
Architecting the Future of Big Data
New
© Hortonworks Inc. 2011
Guard Against Accidental Deletion
• rm –r deletes the data at the speed of Hadoop!
– ctrl-c of the command does not stop deletion!
– Undeleting files on datanodes is hard & time consuming
– Immediately shutdown NN, unmount disks on datanodes
– Recover deleted files
– Start namenode without the delete operation in edits
• Enable Trash
• Real Incident
– Customer is running a distro of Hadoop with trash not enabled
– Deletes a large dir (100 TB) and shuts down NN immediately
– Support person asks NN to be restarted to see if trash is enabled!
Blocks start deleting
Page 18
© Hortonworks Inc. 2011
Improvement
• HDFS Snapshots
– https://issues.apache.org/jira/browse/HDFS-2802
– A snapshot is a read-only point-in-time image of part of the file system
– A snapshot created before a deletion can be used to restore deleted data
– More coverage of snapshots later in the presentation
• HDFS ACLs
– https://issues.apache.org/jira/browse/HDFS-4685
– Finer-grained control of file permissions can help prevent an accidental deletion
– More coverage of ACLs later in the presentation
Page 19
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Unexpected error during HA HDFS upgrade
• Background: HDFS HA Architecture
– http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
• Real Incident
– During upgrade, NameNode calls every JournalNode to request backup of metadata directory, which renames
“current” directory to “previous.tmp”.
– Permissions incorrect on metadata directory for 1 out of 3 JournalNodes.
– The hdfs user is not authorized to rename. Backup fails for that JournalNode, so upgrade process aborts with
error.
Root cause not easily identifiable, long time to
recover
Page 20
Architecting the Future of Big Data

Recommended for you

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice

From: DataWorks Summit 2017 - Munich - 20170406 HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.

hbasetuningbest practices
Kafka Security
Kafka SecurityKafka Security
Kafka Security

This document discusses security features in Apache Kafka including SSL for encryption, SASL/Kerberos for authentication, authorization controls using an authorizer, and securing Zookeeper. It provides details on how these security components work, such as how SSL establishes an encrypted channel and SASL performs authentication. The authorizer implementation stores ACLs in Zookeeper and caches them for performance. Securing Zookeeper involves setting ACLs on Zookeeper nodes and migrating security configurations. Future plans include moving more functionality to the broker side and adding new authorization features.

hadoop summit
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?

Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features. We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.

hortonworksapache hadoopapache hive
© Hortonworks Inc. 2011
Improvement
• Improve diagnostics on storage directory rename operations by using native code.
– https://issues.apache.org/jira/browse/HDFS-7118
– Logs additional root cause information for rename failure. For example, EACCES
• Split error checks in into separate conditions to improve diagnostics.
– https://issues.apache.org/jira/browse/HDFS-7119
– Splits a log message about failure to delete or rename into separate log messages to clarify which specific action
failed
• When aborting NameNode or JournalNode, write the contents of the metadata directories and
permissions to logs.
– https://issues.apache.org/jira/browse/HDFS-7120
– Usually the first information asked of the user, so we can automate this
• For JournalNode operations that must succeed on all nodes, execute a pre-check to verify that
the operation can succeed.
– https://issues.apache.org/jira/browse/HDFS-7121
– Prevents need for manual cleanup on 2 out of 3 JournalNodes where backup succeeded
Page 21
Architecting the Future of Big Data
Key Learnings and Best Practices
Features that Help Improve Production Operations
© Hortonworks Inc. 2011
HDFS ACLs
• Existing HDFS POSIX permissions good, but not flexible enough
– Permission requirements may differ from the natural organizational hierarchy of users and groups.
• HDFS ACLs augment the existing HDFS POSIX permissions model by implementing the POSIX
ACL model.
– An ACL (Access Control List) provides a way to set different permissions for specific named users or named
groups, not only the file’s owner and file’s group.
Page 23
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS File Permissions Example
• Authorization requirements:
–In a sales department, they would like a single user Maya (Department Manager) to
control all modifications to sales data
–Other members of sales department need to view the data, but can’t modify it.
–Everyone else in the company must not be allowed to view the data.
• Can be implemented via the following:
Read/Write perm for user
maya
User
Group
Read perm for group sales
File with sales data

Recommended for you

Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon

Apache Hive 3 introduces new capabilities for data analytics including materialized views, default columns, constraints, and improved JDBC and Kafka connectors to enable real-time streaming and integration with external systems like Druid; Hive 3 also improves performance and query optimization through a new query result cache, workload management, and cloud storage optimizations. Data Analytics Studio provides self-service analytics on top of Hive 3 through a visual interface to optimize queries, monitor performance, and manage data lifecycles.

apache hiveapache hadoopaws
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future

The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.

hadoop summitapache tezapache hadoop
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned

This document provides an overview and lessons learned from Hadoop. It discusses why Hadoop is used, how MapReduce and HDFS work, tips for integration and operations, and the outlook for the Hadoop community moving forward with real-time capabilities and refined APIs. Key takeaways include only using Hadoop if necessary, fully understanding your data pipeline, and "unboxing the black box" of Hadoop.

hadoop mapreduce
© Hortonworks Inc. 2011
HDFS ACLs
• Problem
–No longer feasible for Maya to control all modifications to the file
– New Requirement: Maya, Diane and Clark are allowed to make modifications
– New Requirement: New group called executives should be able to read the sales data
–Current permissions model only allows permissions at 1 group and 1 user
• Solution: HDFS ACLs
–Now assign different permissions to different users and groups
Owner
Group
Others
HDFS
Directory
… rwx
… rwx
… rwx
Group D … rwx
Group F … rwx
User Y … rwx
© Hortonworks Inc. 2011
HDFS ACLs
New Tools for ACL Management (setfacl, getfacl)
– hdfs dfs -setfacl -m group:execs:r-- /sales-data
– hdfs dfs -getfacl /sales-data # file: /sales-data # owner: maya # group:
sales user::rw- group::r-- group:execs:r-- mask::r-- other::--
– How do you know if a directory has ACLs set?
– hdfs dfs -ls /sales-data Found 1 items -rw-r-----+ 3 maya sales 0
2014-03-04 16:31 /sales-data
© Hortonworks Inc. 2011
HDFS ACLs Best Practices
• Start with traditional HDFS permissions to implement most permission requirements.
• Define a smaller number of ACLs to handle exceptional cases.
• A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that
has only traditional permissions.
Page 27
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS Snapshots
• HDFS Snapshots
– A snapshot is a read-only point-in-time image of part of the file system
– Performance: snapshot creation is instantaneous, regardless of data size or subtree depth
– Reliability: snapshot creation is atomic
– Scalability: snapshots do not create extra copies of data blocks
– Useful for protecting against accidental deletion of data
• Example: Daily Feeds
hdfs dfs -ls /daily-feeds
Found 5 items
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-16
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17
Page 28
Architecting the Future of Big Data

Recommended for you

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction

This document provides an overview of big data concepts and Hadoop. It discusses the four V's of big data - volume, velocity, variety, and veracity. It then describes how Hadoop uses MapReduce and HDFS to process and store large datasets in a distributed, fault-tolerant and scalable manner across commodity hardware. Key components of Hadoop include the HDFS file system and MapReduce framework for distributed processing of large datasets in parallel.

hadoop
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles

HDFS Design Principles and the Scale-out-Ability of Distributed Storage. SVForum, Software Architecture & Platform SIG. May 23, 2012.

hadoophdfsapache hadoop
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)

Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across commodity hardware. It is highly fault tolerant, provides high throughput, and is suitable for applications with large datasets. HDFS uses a master/slave architecture where a NameNode manages the file system namespace and DataNodes store data blocks. The NameNode ensures data replication across DataNodes for reliability. HDFS is optimized for batch processing workloads where computations are moved to nodes storing data blocks.

big datahadoopdatabase system
© Hortonworks Inc. 2011
HDFS Snapshots
• Create a snapshot after each daily load
hdfs dfsadmin -allowSnapshot /daily-feeds
Allowing snaphot on /daily-feeds succeeded
hdfs dfs -createSnapshot /daily-feeds snapshot-to-2014-10-17
Created snapshot /daily-feeds/.snapshot/snapshot-to-2014-10-17
• User accidentally deletes data for 2014-10-16
hdfs dfs -ls /daily-feeds
Found 4 items
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17
Page 29
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS Snapshots
• Snapshots to the rescue: the data is still in the snapshot
hdfs dfs -ls /daily-feeds/.snapshot/snapshot-to-2014-10-17
Found 5 items
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-13
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-14
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-15
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-17
• Restore data from 2014-10-16
hdfs dfs -cp /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16 /daily-feeds
Page 30
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
• Configuring dfs.datanode.failed.volumes.tolerated > 0 enables a DataNode to keep running after
volume failures
• DataNode is still running, but capacity is degraded
• HDFS already provided a count of failed volumes for each DataNode, but no further details
• Apache Hadoop 2.7.0 provides more information: failed path, estimated lost capacity and failure
date/time
• An administrator can use this information to prioritize cluster maintenance work
Page 31
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
Page 32
Architecting the Future of Big Data
New

Recommended for you

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture. In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.

hdfshadoop summitdataworks summit
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview

Distributed Computing with Apache Hadoop is a technology overview that discusses: 1) Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. 2) Hadoop addresses limitations of traditional distributed computing with an architecture that scales linearly by adding more nodes, moves computation to data instead of moving data, and provides reliability even when hardware failures occur. 3) Core Hadoop components include the Hadoop Distributed File System for storage, and MapReduce for distributed processing of large datasets in parallel on multiple machines.

disctributed computinghadoop
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...

The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.

hadoop summit tokyobig datahadoop cluster
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
Page 33
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
Page 34
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
• Everything in the web UI is sourced from standardized Hadoop metrics
– Each DataNode publishes its own metrics
– NameNode publishes aggregate information from every DataNode
• Metrics accessible through JMX or the HTTP /jmx URI
• Integrated in Ambari
• Can be integrated into your preferred management tools and ops dashboards
Page 35
Architecting the Future of Big Data
New System to Manage the Health of Hadoop
Clusters
• Ambari Alerts are installed and configured by default
• Health Alerts and Metrics managed via Ambari Web

Recommended for you

Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design

HDFS is a distributed file system designed for storing very large data files across commodity servers or clusters. It works on a master-slave architecture with one namenode (master) and multiple datanodes (slaves). The namenode manages the file system metadata and regulates client access, while datanodes store and retrieve block data from their local file systems. Files are divided into large blocks which are replicated across datanodes for fault tolerance. The namenode monitors datanodes and replicates blocks if their replication drops below a threshold.

hdfsarchitecturemapreduce
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking

This document discusses benchmarking Hadoop and big data systems. It provides an overview of common Hadoop benchmarks including microbenchmarks like TestDFSIO, TeraSort, and NNBench which test individual Hadoop components. It also describes BigBench, a benchmark modeled after TPC-DS that aims to test a more complete big data analytics workload using techniques like MapReduce, Hive, and Mahout across structured, semi-structured, and unstructured data. The document emphasizes using Hadoop distributions for administration and both microbenchmarks and full benchmarks like BigBench for evaluation.

big benchclouderabenchmarks
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners

This presentation provides a basic overview on Hadoop, Map-Reduce and HDFS related concepts, Configuration and Installation steps and a Sample code.

google file systemmapreducehadoop
© Hortonworks Inc. 2011
Summary
• Configuration
– Prevent garbage collection issues
– Configure for redundancy
– Retune configuration in response to metrics
• HDFS ACLs
– Implement fine-grained authorization rules on files
– Can protect against accidental file manipulations
• HDFS Snapshots
– Point-in-time image of part of the filesystem
– Useful for restoring to a prior state after accidental file manipulation
• Reporting DataNode Volume Failures
– Metrics and web UI exposing information about volume failures on DataNodes
– Useful for planning cluster maintenance work
• Use Ambari
– Helps install, configure, monitor and manage Hadoop clusters
– Incorporates the latest best practices
Page 37
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thank you, Q&A
Resource Location
Hardware
Recommendations for
Apache Hadoop
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html
HDFS operational and
debuggability
improvements
https://issues.apache.org/jira/browse/HDFS-6185
HDFS ACLs Blog Post http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/
HDFS Snapshots Blog Post http://hortonworks.com/blog/protecting-your-enterprise-data-with-hdfs-snapshots/
Learn more
Contact me with your operations questions and suggestions
Chris Nauroth – cnauroth@hortonworks.com

More Related Content

What's hot

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
Deon Huang
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Manage Add-On Services with Apache Ambari
Manage Add-On Services with Apache AmbariManage Add-On Services with Apache Ambari
Manage Add-On Services with Apache Ambari
DataWorks Summit
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
 
Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data
Salesforce Engineering
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
Rommel Garcia
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
larsgeorge
 
Kafka Security
Kafka SecurityKafka Security
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
 

What's hot (20)

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Manage Add-On Services with Apache Ambari
Manage Add-On Services with Apache AmbariManage Add-On Services with Apache Ambari
Manage Add-On Services with Apache Ambari
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 

Viewers also liked

Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
tcurdt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
Konstantin V. Shvachko
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Hadoop
HadoopHadoop
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Viewers also liked (13)

Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar to Hadoop Operations - Best Practices from the Field

Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
hdhappy001
 
Interactive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryInteractive Hadoop via Flash and Memory
Interactive Hadoop via Flash and Memory
Chris Nauroth
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
DataWorks Summit/Hadoop Summit
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
Chris Nauroth
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
Perforce
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
DataWorks Summit
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
SpringPeople
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
Mike Pittaro
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Etu Solution
 
Establishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan LawEstablishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan Law
Flamer
 
Best And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM ConnectionsBest And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM Connections
LetsConnect
 
Backup netezza-tsm-v1403c-140330170451-phpapp01
Backup netezza-tsm-v1403c-140330170451-phpapp01Backup netezza-tsm-v1403c-140330170451-phpapp01
Backup netezza-tsm-v1403c-140330170451-phpapp01
Arunkumar Shanmugam
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
srikanthhadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY
 

Similar to Hadoop Operations - Best Practices from the Field (20)

Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Interactive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryInteractive Hadoop via Flash and Memory
Interactive Hadoop via Flash and Memory
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Establishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan LawEstablishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan Law
 
Best And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM ConnectionsBest And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM Connections
 
Backup netezza-tsm-v1403c-140330170451-phpapp01
Backup netezza-tsm-v1403c-140330170451-phpapp01Backup netezza-tsm-v1403c-140330170451-phpapp01
Backup netezza-tsm-v1403c-140330170451-phpapp01
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
apoorva2579
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
The Digital Insurer
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
ScyllaDB
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
Linda Zhang
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
FellyciaHikmahwarani
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
uuuot
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
amitchopra0215
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
ScyllaDB
 

Recently uploaded (20)

AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
 

Hadoop Operations - Best Practices from the Field

  • 1. Hadoop Operations – Best Practices from the Field June 11, 2015 Chris Nauroth email: cnauroth@hortonworks.com twitter: @cnauroth Suresh Srinivas email: suresh@hortonworks.com twitter: @suresh_m_s
  • 2. © Hortonworks Inc. 2011 About Me Chris Nauroth • Member of Technical Staff, Hortonworks – Apache Hadoop committer, PMC member, and Apache Software Foundation member – Major contributor to HDFS ACLs, Windows compatibility, and operability improvements • Hadoop user since 2010 – Prior employment experience deploying, maintaining and using Hadoop clusters Page 2 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 Agenda • Analysis of Hadoop Support Cases – Support case trends – Configuration – Software Improvements • Key Learnings and Best Practices – HDFS ACLs – HDFS Snapshots – Reporting DataNode Volume Failures Page 3 Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 Support Case Trends – Proportional Cases per Month Page 4 Architecting the Future of Big Data 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 HDFS Map Reduce YARN Other (37 components)
  • 5. © Hortonworks Inc. 2011 Support Case Trends – Root Cause Page 5 Architecting the Future of Big Data 0 200 400 600 800 1000 1200 Customer Environment (Non HDP) Documentation Defect Documentation Gap Documentation Not Utilized Education - Configuration Needs Training Product Defect YARN Map Reduce HDFS
  • 6. © Hortonworks Inc. 2011 Support Case Trends • Core Hadoop components (HDFS, YARN and MapReduce) are used across all deployments, and therefore receive proportionally more support cases than other ecosystem components. • Misconfiguration is the dominant root cause. • Documentation is a close second. • We are constantly improving the code to eliminate operational issues, help with diagnosis and provide increased visibility. • Best practices get incorporated into Apache Ambari for improved defaults, simplified configuration and deeper monitoring. Page 6 Architecting the Future of Big Data
  • 8. © Hortonworks Inc. 2011 Configuration - Hardware and Cluster Sizing • Considerations –Larger clusters heal faster on nodes or disk failure –Machines with huge storage take longer to recover –More racks give more failure domains • Recommendations – Get good-quality commodity hardware – Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores – More memory is better – real time is memory hungry! – Before considering fatter machines (1U 6 disks vs. 2U 12 disks) – Get to 30-40 machines or 3-4 racks –Use pilot cluster to learn about load patterns – Balanced hardware for I/O, compute or memory bound – More details - http://tinyurl.com/hwx-hadoop-hw Page 8
  • 9. © Hortonworks Inc. 2011 Configuration – JVM Tuning • Avoid JVM issues – Use 64 bit JVM for all daemons – Compressed OOPS enabled by default (6 u23 and later) – Java heap size – Set same max and starting heapsize, Xmx == Xms – Avoid java defaults – configure NewSize and MaxNewSize – Use 1/8 to 1/6 of max size for JVMs larger than 4G – Configure –XX:PermSize=128 MB, -XX:MaxPermSize=256 MB – Use low-latency GC collector – -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N> – High <N> on Namenode and JobTracker or ResourceManager – Important JVM configs to help debugging – -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails – -XX:ErrorFile=<file> – -XX:+HeapDumpOnOutOfMemoryError Page 9
  • 10. © Hortonworks Inc. 2011 Configuration • Deploy with QuorumJournalManager for high availability • Configure open fd ulimit – Default 1024 is too low – 16K for datanodes, 64K for Master nodes • Use version control for configuration! Page 10
  • 11. © Hortonworks Inc. 2011 Configuration • Use disk fail in place for datanodes: dfs.datanode.failed.volumes.tolerated – Disk failure is no longer datanode failure – Especially important for large density nodes • Set dfs.namenode.name.dir.restore to true – Restores NN storage directory during checkpointing • Take periodic backups of namenode metadata – Make copies of the entire storage directory • Set aside a lot of disk space for NN logs – It is verbose – set aside multiple GBs – Many installs configure this too small – NN logs roll with in minutes – hard to debug issues Page 11
  • 12. © Hortonworks Inc. 2011 Configuration – Monitoring Usage • Cluster storage, nodes, files, blocks grows – Update NN heap, handler count, number of DN xceivers – Tweak other related config periodically • Monitor the hardware usage for your work load – Disk I/O, network I/O, CPU and memory usage – Use this information when expanding cluster capacity • Monitor the usage with HADOOP metrics – JVM metrics – GC times, Memory used, Thread Status – RPC metrics – especially latency to track slowdowns – HDFS metrics – Used storage, # of files and blocks, total load on the cluster – File System operations – MapReduce Metrics – Slot utilization and Job status • Tweak configurations during upgrades/maintenance on an ongoing basis Page 12
  • 13. HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Install & Configure: Ambari Guided Configuration Guide configuration and provide recommendations for the most common settings. (HBase Example Shown here)
  • 14. Software Improvements Real Incidents and Software Improvements to Address Them
  • 15. © Hortonworks Inc. 2011 Don’t edit the metadata files! • Editing can corrupt the cluster state – Might result in loss of data • Real incident – NN misconfigured to point to another NN’s metadata – DNs can’t register due to namespace ID mismatch – System detected the problem correctly – Safety net ignored by the admin! – Admin edits the namenode VERSION file to match ids Mass deletion of unknown blocks that do not exist in that namespace Page 15
  • 16. © Hortonworks Inc. 2011 Improvement • Pause deletion of blocks when the namenode starts up – https://issues.apache.org/jira/browse/HDFS-6186 – Supports configurable delay of block deletions after NameNode startup – Gives an admin extra time to diagnose before deletions begin • Show when block deletion will start after NameNode startup in WebUI – https://issues.apache.org/jira/browse/HDFS-6385 – The web UI already displayed the number of pending block deletions – This enhanced the display to indicate when actual deletion will begin Page 16 Architecting the Future of Big Data
  • 17. © Hortonworks Inc. 2011 Block Deletion Start Time Page 17 Architecting the Future of Big Data New
  • 18. © Hortonworks Inc. 2011 Guard Against Accidental Deletion • rm –r deletes the data at the speed of Hadoop! – ctrl-c of the command does not stop deletion! – Undeleting files on datanodes is hard & time consuming – Immediately shutdown NN, unmount disks on datanodes – Recover deleted files – Start namenode without the delete operation in edits • Enable Trash • Real Incident – Customer is running a distro of Hadoop with trash not enabled – Deletes a large dir (100 TB) and shuts down NN immediately – Support person asks NN to be restarted to see if trash is enabled! Blocks start deleting Page 18
  • 19. © Hortonworks Inc. 2011 Improvement • HDFS Snapshots – https://issues.apache.org/jira/browse/HDFS-2802 – A snapshot is a read-only point-in-time image of part of the file system – A snapshot created before a deletion can be used to restore deleted data – More coverage of snapshots later in the presentation • HDFS ACLs – https://issues.apache.org/jira/browse/HDFS-4685 – Finer-grained control of file permissions can help prevent an accidental deletion – More coverage of ACLs later in the presentation Page 19 Architecting the Future of Big Data
  • 20. © Hortonworks Inc. 2011 Unexpected error during HA HDFS upgrade • Background: HDFS HA Architecture – http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html • Real Incident – During upgrade, NameNode calls every JournalNode to request backup of metadata directory, which renames “current” directory to “previous.tmp”. – Permissions incorrect on metadata directory for 1 out of 3 JournalNodes. – The hdfs user is not authorized to rename. Backup fails for that JournalNode, so upgrade process aborts with error. Root cause not easily identifiable, long time to recover Page 20 Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011 Improvement • Improve diagnostics on storage directory rename operations by using native code. – https://issues.apache.org/jira/browse/HDFS-7118 – Logs additional root cause information for rename failure. For example, EACCES • Split error checks in into separate conditions to improve diagnostics. – https://issues.apache.org/jira/browse/HDFS-7119 – Splits a log message about failure to delete or rename into separate log messages to clarify which specific action failed • When aborting NameNode or JournalNode, write the contents of the metadata directories and permissions to logs. – https://issues.apache.org/jira/browse/HDFS-7120 – Usually the first information asked of the user, so we can automate this • For JournalNode operations that must succeed on all nodes, execute a pre-check to verify that the operation can succeed. – https://issues.apache.org/jira/browse/HDFS-7121 – Prevents need for manual cleanup on 2 out of 3 JournalNodes where backup succeeded Page 21 Architecting the Future of Big Data
  • 22. Key Learnings and Best Practices Features that Help Improve Production Operations
  • 23. © Hortonworks Inc. 2011 HDFS ACLs • Existing HDFS POSIX permissions good, but not flexible enough – Permission requirements may differ from the natural organizational hierarchy of users and groups. • HDFS ACLs augment the existing HDFS POSIX permissions model by implementing the POSIX ACL model. – An ACL (Access Control List) provides a way to set different permissions for specific named users or named groups, not only the file’s owner and file’s group. Page 23 Architecting the Future of Big Data
  • 24. © Hortonworks Inc. 2011 HDFS File Permissions Example • Authorization requirements: –In a sales department, they would like a single user Maya (Department Manager) to control all modifications to sales data –Other members of sales department need to view the data, but can’t modify it. –Everyone else in the company must not be allowed to view the data. • Can be implemented via the following: Read/Write perm for user maya User Group Read perm for group sales File with sales data
  • 25. © Hortonworks Inc. 2011 HDFS ACLs • Problem –No longer feasible for Maya to control all modifications to the file – New Requirement: Maya, Diane and Clark are allowed to make modifications – New Requirement: New group called executives should be able to read the sales data –Current permissions model only allows permissions at 1 group and 1 user • Solution: HDFS ACLs –Now assign different permissions to different users and groups Owner Group Others HDFS Directory … rwx … rwx … rwx Group D … rwx Group F … rwx User Y … rwx
  • 26. © Hortonworks Inc. 2011 HDFS ACLs New Tools for ACL Management (setfacl, getfacl) – hdfs dfs -setfacl -m group:execs:r-- /sales-data – hdfs dfs -getfacl /sales-data # file: /sales-data # owner: maya # group: sales user::rw- group::r-- group:execs:r-- mask::r-- other::-- – How do you know if a directory has ACLs set? – hdfs dfs -ls /sales-data Found 1 items -rw-r-----+ 3 maya sales 0 2014-03-04 16:31 /sales-data
  • 27. © Hortonworks Inc. 2011 HDFS ACLs Best Practices • Start with traditional HDFS permissions to implement most permission requirements. • Define a smaller number of ACLs to handle exceptional cases. • A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that has only traditional permissions. Page 27 Architecting the Future of Big Data
  • 28. © Hortonworks Inc. 2011 HDFS Snapshots • HDFS Snapshots – A snapshot is a read-only point-in-time image of part of the file system – Performance: snapshot creation is instantaneous, regardless of data size or subtree depth – Reliability: snapshot creation is atomic – Scalability: snapshots do not create extra copies of data blocks – Useful for protecting against accidental deletion of data • Example: Daily Feeds hdfs dfs -ls /daily-feeds Found 5 items drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13 drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-16 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17 Page 28 Architecting the Future of Big Data
  • 29. © Hortonworks Inc. 2011 HDFS Snapshots • Create a snapshot after each daily load hdfs dfsadmin -allowSnapshot /daily-feeds Allowing snaphot on /daily-feeds succeeded hdfs dfs -createSnapshot /daily-feeds snapshot-to-2014-10-17 Created snapshot /daily-feeds/.snapshot/snapshot-to-2014-10-17 • User accidentally deletes data for 2014-10-16 hdfs dfs -ls /daily-feeds Found 4 items drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13 drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17 Page 29 Architecting the Future of Big Data
  • 30. © Hortonworks Inc. 2011 HDFS Snapshots • Snapshots to the rescue: the data is still in the snapshot hdfs dfs -ls /daily-feeds/.snapshot/snapshot-to-2014-10-17 Found 5 items drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-13 drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-14 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-15 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-17 • Restore data from 2014-10-16 hdfs dfs -cp /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16 /daily-feeds Page 30 Architecting the Future of Big Data
  • 31. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures • Configuring dfs.datanode.failed.volumes.tolerated > 0 enables a DataNode to keep running after volume failures • DataNode is still running, but capacity is degraded • HDFS already provided a count of failed volumes for each DataNode, but no further details • Apache Hadoop 2.7.0 provides more information: failed path, estimated lost capacity and failure date/time • An administrator can use this information to prioritize cluster maintenance work Page 31 Architecting the Future of Big Data
  • 32. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures Page 32 Architecting the Future of Big Data New
  • 33. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures Page 33 Architecting the Future of Big Data
  • 34. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures Page 34 Architecting the Future of Big Data
  • 35. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures • Everything in the web UI is sourced from standardized Hadoop metrics – Each DataNode publishes its own metrics – NameNode publishes aggregate information from every DataNode • Metrics accessible through JMX or the HTTP /jmx URI • Integrated in Ambari • Can be integrated into your preferred management tools and ops dashboards Page 35 Architecting the Future of Big Data
  • 36. New System to Manage the Health of Hadoop Clusters • Ambari Alerts are installed and configured by default • Health Alerts and Metrics managed via Ambari Web
  • 37. © Hortonworks Inc. 2011 Summary • Configuration – Prevent garbage collection issues – Configure for redundancy – Retune configuration in response to metrics • HDFS ACLs – Implement fine-grained authorization rules on files – Can protect against accidental file manipulations • HDFS Snapshots – Point-in-time image of part of the filesystem – Useful for restoring to a prior state after accidental file manipulation • Reporting DataNode Volume Failures – Metrics and web UI exposing information about volume failures on DataNodes – Useful for planning cluster maintenance work • Use Ambari – Helps install, configure, monitor and manage Hadoop clusters – Incorporates the latest best practices Page 37 Architecting the Future of Big Data
  • 38. © Hortonworks Inc. 2011 Thank you, Q&A Resource Location Hardware Recommendations for Apache Hadoop http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/bk_cluster-planning- guide/content/ch_hardware-recommendations.html HDFS operational and debuggability improvements https://issues.apache.org/jira/browse/HDFS-6185 HDFS ACLs Blog Post http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/ HDFS Snapshots Blog Post http://hortonworks.com/blog/protecting-your-enterprise-data-with-hdfs-snapshots/ Learn more Contact me with your operations questions and suggestions Chris Nauroth – cnauroth@hortonworks.com

Editor's Notes

  1. First, a quick introduction. My name is Chris Nauroth. I’m a software engineer on the HDFS team at Hortonworks. I’m an Apache Hadoop committer and PMC member. I’m also an Apache Software Foundation member. Some of my major contributions include HDFS ACLs, Windows compatibility and various operability improvements. Prior to Hortonworks, I worked for Disney and did an initial deployment of Hadoop there. As part of that job, I worked very closely with the systems engineering team responsible for maintaining those Hadoop clusters, so I tend to think back to that team and get excited about things I can do now as a software engineer to help make that team’s job easier. I’m also here with Suresh Srinivas, one of the founders of Hortonworks, and a long-time Hadoop committer and PMC member. He has a lot of experience supporting some of the world’s largest clusters at Yahoo and elsewhere. Together with Suresh, we have experience supporting Hadoop clusters since 2008.
  2. For today’s agenda, I’d like to start by sharing some analysis that we’ve done of support case trends. In that analysis, we’re going to see that some common patterns emerge, and that’s going to lead into a discussion of configuration best practices and software improvements. In the second half of the talk, we’ll move into a discussion of key learnings and best practices around how recent HDFS features can help prevent problems or manage day-to-day maintenance.
  3. Let’s dive into the support case analysis. The data source for this chart is the entire history of support cases at Hortonworks. The x-axis is month and the y-axis is the proportion of support cases reported against a specific component. The chart focuses on 3 components that we define as the core of Hadoop: HDFS, YARN and MapReduce. All other components in the ecosystem are collapsed into a single line. Here we see a trend stabilizing around 30% of support cases driven from those core components. It also makes sense intuitively that a large proportion of support cases are driven from those core components, because every deployment uses them. As you rise up the stack, deployments start to vary in the components they choose to deploy. For example, a deployment may or may not deploy Hbase depending on its use cases.
  4. The second chart shows an analysis of root cause category in each of those 3 core components. The source data contains many additional root cause categories. I’ve chosen to prune this down to the most significant ones to simplify the chart. The pattern that we see here is that a lot of support cases are driven by configuration issues or documentation problems. On an interesting side note, I gave a version of this presentation last year at Strata, and since then I’ve refreshed these charts with current data. Something I noticed is that documentation, configuration and software defects are propotionally a little bit smaller than last time. We’ve been investing a lot of energy in these areas, so it was satisfying to see the data showing that those efforts have been somewhat successful.
  5. Investment in operations at the core helps the most users.
  6. With that, let’s move into a discussion of common configuration issues that we continue to see.
  7. Fewer nodes is less resilient than many nodes. Failure of a DataNode that’s heavier on storage causes more re-replication activity. Map Reduce jobs may need to rerun more tasks. Commodity != poor quality.
  8. Compressed ordinary object pointers are a technique used in the JVM to represent managed pointers as 32-bit offsets from a 64-bit base heap address. This saves on the space taken by 64-bit native pointers. We used to have a recommendation to pass a JVM argument to turn this on. Recent JVM versions just use it by default. Xmx different from Xms can cause big expensive malloc. Surprising results when you run out of memory late in the process lifetime. N=8 typically. Oom-killer.
  9. NameNode high availability was a very hot topic a few years ago. At this point, the recommended HA architecture is to use QuorumJournalManager, which sets up an active-standby pair of NameNodes and offloads edit logging to a separate set of daemons called the JournalNodes. On a side note, version control for configuration is a good thing. It can be helpful to look back on the history of changes or restore to a last known good state.
  10. The DataNode has a feature called disk-fail-in-place that allows it to keep running even if individual volumes have failed. This is off by default, but you can turn it on by editing hdfs-site.xml and setting property dfs.datanode.failed.volumes.tolerated to the number of volumes that you tolerate failing before shutting down the entire DataNode. This is useful for large-density nodes, meaning nodes that have a lot of disks. If you have a node with 16 disks, and 2 disks fail, you’d probably prefer to keep that DataNode running with 14 disks available to serve clients instead of shutting down the whole thing. dfs.namenode.name.dir.restore is a property that controls whether or not the NameNode should attempt to bring back into service metadata storage directories that previously failed. By turning this on, you have the ability to repair a failed directory online and bring it back into service without restarting the NameNode process. We recommend taking periodic backups of the NameNode metadata. Copy the entire storage directory. Also plan on reserving a lot of disk for NameNode logs. A common pitfall is choosing too little space for logs, which then forces you to configure Log4J to roll logs very rapidly, and this can make debugging harder.
  11. Something to keep in mind that usage patterns on a cluster tend to change over time as use cases change. Configuration may need to change in reaction to changing usage patterns. If you have a major upgrade or maintenance planned, then that’s a good opportunity to review configurations and see if anything else needs to change.
  12. Increasingly, we’re pushing configuration best practices into the implementation of Ambari. This takes the burden off of administrators to remember these best practices during deployments. For those who don’t know, Apache Ambari is an open source cluster deployment and management tool. For a little variety, I chose to pull a screenshot related to HBase. Here we can see that Ambari starts by recommending some good defaults, but still gives administrators the option to tune settings to match their specific needs.
  13. Next, I’d like to discuss a few software improvements that were prompted by our experiences in support cases. We’ve found that often very small code changes can have a big impact on preventing problems or recovering from them. I’m going to discuss some real incidents that we’ve seen and how they led us to make those code changes.
  14. First, a public service announcement: don’t edit the metadata files. The NameNode metadata files are crucial for maintaining the state of the file system, so editing them can corrupt cluster state and result in loss of data. Don’t edit them. Now that I’ve said that, let’s talk about editing the metadata files. This is a real incident. A NameNode was misconfigured to point to the metadata from a different NameNode. An important note here is that part of the NameNode metadata is a namespace ID, which uniquely identifies that file system namespace. When DataNodes register with a NameNode for the first time, they also acquire that namespace ID and persist it locally. On subsequent DataNode restarts, the NameNode has a check that the DataNode attempting to register with it is presenting the same namespace ID. After NameNode restart, the DataNodes could not register with the NameNode because of the namespace ID mismatch. The system detected the problem correctly, and so far everything is working as designed. However, the admin thought an appropriate fix would be to manually edit the VERSION file, which is the part of the metadata containing the namespace ID, and change it to match what the DataNodes were reporting. “What happens next?” The problem is that the NameNode’s fsimage also persists the block IDs that are known for each file. When these DataNodes from a different cluster started sending their block reports, the NameNode replied by saying these blocks do not exist in my namespace, and therefore they should be deleted.
  15. This is the HDFS web UI, now with a small enhancement to show the time when block deletions will start.
  16. HDFS is known for being a scalable system. One of the things it’s really awesome at is scaling deletes! This can be a scary situation if someone deletes the wrong thing, because attempting to recover by undeleting block files is error-prone and time-consuming work across all DataNodes. We recommend enabling the HDFS trash feature as a safety net, which essentially changes deletes into renames, and the NameNode can then reap the trash files at a later time. However, I’m going to talk about a real incident in which trash was not enabled. There was a large directory deleted, and the admin realized this was a mistake and chose to shut down the NameNode immediately. The support engineer taking the case naturally figured we could restore from trash, so advised restarting the NameNode. “What happens next?”
  17. This incident really points out the importance of protecting data against accidental deletion. HDFS snapshots and HDFS ACLs are two features that I think help with this. I’ll have more coverage of these features later in the presentation.
  18. “What happens next?”
  19. If you’ve used POSIX ACLs on a Linux file system, then you already know how it works in HDFS too.
  20. By convention, snapshots can be referenced as a file system path under sub-directory “.snapshot”.
  21. Here is a screenshot pointing out a change in the HDFS web UI: Total Datanode Volume Failures is a hyperlink. Clicking that jumps to…
  22. …this new screen listing the volume failures in detail. We can see the path of each failed storage location, and an estimate of the capacity that was lost. I think of this screen being used by a system engineer as a to-do list as part of regular cluster maintenance.
  23. Here is what it looks like when there are no volume failures. I included this picture, because this is what we all want it to look like. Of course, it won’t always be that way.