Ozone and HDFS’s evolution

1 © Hortonworks Inc. 2011–2018. All rights reserved
HDFS Scalability and Evolution:
HDDS and Ozone
Sanjay Radia,
Founder, Chief Architect, Hortonworks

About the Speaker
• Sanjay Radia
• Chief Architect, Founder, Hortonworks
• Apache Hadoop PMC and Committer
• Part of the original Hadoop team at Yahoo! since 2007
• Chief Architect of Hadoop Core at Yahoo!
• Prior
• Data center automation, virtualization, Java, HA, OSs, File Systems
• Startup, Sun Microsystems, INRIA…
• Ph.D., University of Waterloo
Page 2Architecting the Future of Big Data

• Scaling – IO + PBs + clients
• Horizontal scaling – IO + PBs
• Fast IO – scans and writes
• Number of concurrent clients 60K++
• Low latency metadata operations
• Fault tolerant storage layer
• Locality
• Replicas/Reliability and parallelism
• Layering – Namespace layer and storage layer
• Security
• Scaling Namespace – 500M FILES
• Scaling Block space
• Scaling Block reports
• Scaling DN’s block management
• Need further scaling of client/RPC 150K++
HDFS does well
But scaling Namespace is limited to 500M
files (192G Heap)
HDFS – What It Does Well and not so Well
Ironically, Namespace in mem
is strength and weakness

Proof Points of Scaling Data, IO, Clients/RPC
• Proof points of large data and large clusters
• Single Organizations have over 600PB in HDFS
• Single clusters with over 200PB using federation
• Large clusters over 4K multi-core nodes bombarding a single NN
• Federation is the currents caling solution (both Namespace & Operations)
• In deployment at Twitter, Yahoo, FB, and elsewhere
Metadata in memory the strength of the original GFS and HDFS design.
But also its weakness in scaling number of files and blocks

Scaling HDFS—
with HDDS and Ozone

HDFS Layering
DN 1 DN 2 DN m
.. .. ..
NS1
...
NS k
Block Management Layer
Block Pool kBlock Poo 1
NN-1 NN-k
Common Storage
BlockStorageNamespace

Solutions to Scaling Files, Blocks, Clients/RPC
Scale Namespace
• Hierarchical file system
– Cache only workingSet of namespace in
memory
– Partition:
- Distributed namespace (transparent automatic
partitioning)
- Volumes (static partitioning)
• Flat Key-Value store
– Cache only workingSet of namespace in
memory
– Partition/Shard the space (easy to hash)
Scale Metadata Clients/RPC
• Multi-thread namespace manager
• Partitioning/Sharding
Slow NN startup
• Cache only workingSet in mem
• Shard/partition namespace
Scale Block Management
• Containers of blocks (2GB-16GB+)
• Will significantly reduce BlockMap
• Reduce Number of Block/Container reports

Scaling HDFS
Must Scale both the Namespace and the Block Layer
• Scaling one is not sufficient
Scalable Block layer: Hadoop Distributed Data Storage (HDDS)
• Containers of blocks
• Replicated as a group
• Reduces Block Map
Scale Namespace: Several approaches (not exclusive)
• Partial namespace in memory
• Shard namespace
• Use flat namespace (KV namespace) – easier to implement and scale – Ozone

Scale Storage Layer:
Container of Blocks
HDDS
Flat KV
Namespace:
Ozone
New
HDFS
OzoneFS:
Hadoop
Compatible
FS
Hierarchical
Namespace:
New Scalable
NN
Evolution Towards New HDFS

HDFS Ozone and Quadra on Same Cluster/storage—
Shared Storage Servers and Shared Physical Storage
Data Nodes : Shared Storage Servers for HDFS-Blocks and Ozone/Quadra Blocks
Shared Physical Storage
HDFS
Scalable FS
with
Hierarchical
Name space
Hadoop Compatible FS API
FileSystem or FileContext
Quadra
Raw
Storage
Volumes
Raw Storage API
(Lun/EBS like, SCSI)
Linux FS
Ozone
Highly
Scalable KV
Object Store
Flat
Namespace
S3 API

New
How It All Fits Together
Old HDFS NN
All namespace in
memory
Storage&IONamespace
HDFS Block storage on DataNodes
(Bid -> Data)
Physical Storage - Shared DataNodes and physical
storage shared between
Old HDFS and HDDS
Block Reports
BlockMap
(Bid ->IPAddress of DN
File = Bid[]
Ozone Master
K-V Flat
Namespace
File (Object) = Bid[]
Bid = Cid+ LocalId
New HDFS NN
(scalable)
Hierarchical
Namespace
File = Bid[]
Bid = Cid+ LocalId
Container Management
& Cluster Membership
HDDS Container Storage on DataNodes
(Bid -> Data, but blocks grouped in containers)
HDDS
HDDS – Clean
Separation of
Block layer
DataNodes
ContainerMap
(CId ->IPAddress of DNContainer Reports
Existing HDFS

Ozone FS
Ozone/HDDS Can Be Used Separately, or also with HDFS
• Initially HDFS is the default FS
• Has many features
• so cannot be replaced by OzoneFS on day one
• Ozone FS sits on side as additional namespace, sharing DNs
• For applications work with Hadoop Compatible FS
on K-V Store – Hive, Spark …
• How is Ozone FS accessed?
• Use direct URIs for either HDFS or OzoneFS
• Mount in HDFS or in ViewFS
HDFS
Default
FS

Scalable Block Layer:
Hadoop Distributed Data Storage (HDDS)
Container: Containers of blocks (2GB-16GB+)
• Replicated as a group
• Each Container has a unique ContainerId
– Every block within a container has a block id
» BlockId = ContainerId, LocalId
Data Nodes – HDFS and HDDS can share DNs
• DataNodes contain a set of containers (just like
they used to contain blocks)
• DataNodes send Container-reports (like block
reports) to CM (Container Manager)
HDDS: Separate layer from namespace layer (strictly separate, not almost)
CM – Container manager
• Cluster membership
• Receives container reports from DNs
• Manages container replication
• Maintained Container Map (Cid->IPAddr)
Block Pools
• Just like blocks were in block pools, containers
are also in container pools
• This allow independent namespaces to carve
out their block space

Key Ozone Characteristics – Compare with HDFS
• Scale Block Management
• Containers of block (2 GB to 16GB)
• 2-4gb block containers initially => 40-80x
reduction in BR and CM block map
• Reduce BR on DNs, Masters, Network
• Scale Namespace
• Key Space Manager caches only working set in
memory
• Future scaling:
• Flat namespace is easy to shard (Bucket are
natural sharding points)
• Scale Num of Metadata Clients/Rpc
• No single global lock like NN
• Metadata operations are simpler
• Sharding will help further
§ Fault Tolerance
– Blocks – inherits HDFS’s block-layer FT
– Namespace – uses Raft rather then Journal Nodes
•HA Easier
§ Manageability
– GC/Overloaded Master is not longer an issue
• caches working set
– Journal nodes disappear – Raft is used
– Faster and more predictable failover
– Fast start up
• Faster upgrades
• Faster failover
• Retains HDFS Semantics & Performance
– Strong consistency, locality, fast scans, …
• Other:
– OM can run on DNs – beneficial for
small clusters or embedded systems

Will OzoneFS’s Key-Value Store Work with Hadoop Apps?
• Two years ago – NO!
• Today - Yes!
• Hive, Spark and others are making sure they work on Cloud K-V Object Stores via HCFS
• Even customers are ensuring that their apps work on Cloud K-V Object Stores via HCFS
• Lack of real directories and their ACLs: Fake directories + Buckets ACLs
• Lack of eventual consistency in S3 is being worked around – S3Gaurd (Note: OzoneFS is consistent)
• Lack of rename in S3 is being worked around
• Various direct output committers (early versions had issues)
• Netflix Direct Commiter; being replaced by Iceberg
• Via Metastore (Databricks has proprietary version, Hive’s approach)

Details of HDDS

Container Structure (Using RocksDB)
Container
Index
Chunk
data file
Chunk data
file
Chunk data
file
Chunk data
file
Key 1
LSM
LevelDB/RocksDB
Key N
Chunk Data
File Name
Offset Length
• An embedded LSM/KVStore (RocksDB)
• BlockId is the key,
• filename of local chunk file is value
• Optimizations
• Small blocks (< 1MB) can be stored directly in
rocksDB
• Compaction for block data to avoid lots of files
• But this can be evolved over time

Replication of Container
• Use RAFT replication instead of data pipeline, for both data and metadata
• Proven to be correct
• Traditionally Raft used for small updates and transactions, fits well for metadata
• Performance considerations
• When writing the meta data into raft-journal, put the data directly in container
storage
• Raft-journal in separate disk – fast contagious writes without seeking
• Data spread across the other disks
• Client uses Raft protocol to write data to the DNs storing the container

Open and Closed Containers
Open – active writers
• Need at least( NumSpindles * Data nodes) open active containers
• Clients can get locality on writes
• Data is spread across all data nodes
• Improved IO and better chance of getting locality
• Keep DNs and ALL spindles busy
Closed – typically when full or had a failure in the past
• Why close a container on failures
• We originally considered keeping it open and bringing in a new DN
• Wait for the data to copy?
• Decided to close it, and have it replicated
• Can open later or can merge with other closed container – under design

Details of Ozone

Ozone Master
DN1 DN2 DNn
Ozone Master
K-V
Namespace
File (Object) = Bid[]
Bid = Cid+ LocalId
CM
ContainerMap
(CId ->IPAddress of DN
Client
RocksDB
bId[]= Open(Key,..)
GetBlockLocations(Bid)
$$$
$$$ - Container Map Cache
$$$
Read, Write, …

Ozone APIs
• Key: /VolumeName/BucketId/ObjectKey e.g /Home/John/foo/bar/zoo)
• ACLs at Volume and Bucket level (the other dirs are fake)
• Future sharding at bucket level
• => Ozone is Consistent (unlike S3)
Ozone Object API (RPC)
S3 Connector
Hadoop FileSystem and Hadoop
FileContext Connectors

Where does the Ozone Master run?
Which Node?
• On a separate node with large enough memory for caching the working set
• Caching the working set is important for large number of concurrent clients
• This option would give predictable performance for large clusters
• On the Datanodes
• How much memory for caching,
• Note: tasks and other services run on DN since they are typically also compute nodes
Where is Storage for the Ozone KV Metadata?
• Local disk
• If on DN then is it dedicated disk or shared with DN?
• Use the container storage (Its using RocksDB anyway)
• Spread Ozone volumes across containers to gain performance,
• but this may limit volume size & force more Ozone volumes than Admin wants

Quadra – Lun-like Raw-Block Storage
Used for Creating Mountable Disk FS Volume

Quadra: Raw-Block Storage Volume (Lun)
Lun-like storage service where the blocks are stored on HDDS
• Volume: A raw-block device that can be used to create a mountable disk on Linux.
• Raw-Blocks - those of the native FS that will use the Lun Volume
• Raw-block size is dictated by the native fs like ext4 (4K)
• Raw-Blocks are unit of IO operations by native file systems.
• Raw-Block is the unit of read/write/update to HDDS
• Ozone and Quadra share HDDS as a common storage backend
• Current prototype: 1 raw-block = 1 HDDS block (but this will change later)
Can be used in Kubernetes for container state

Status
HDDS: Block Container
• 2-4gb block containers initially
• Reduction of 40-80 in BR and block map
• Reduce BR pressure in on NN/OzoneMaster
• Initial version to scale to 10s billions of blocks
Ozone Master
• Implemented using RocksDB (just like the HDDS in DNs)
• Initial version to scale to 10 billion objects
Current Status and Steps to GA
• Stabilize HDDS and Ozone
• Measure and improve performance
• Add HA for Ozone Master and Container Manager
• Add security – Security design completed and published
After GA
• Further stabilization and performance improvements
• Transparent encryption
• Erasure codes
• Snapshots (or their equivalent)
• ..

Summary
• HDFS scale proven in real production systems
• 4K+ clusters
• Raw Storage >200PB in single federated NN cluster and >30PB in non-federated clusters
• Scales to 60K+ concurrent clients bombarding the NN
• But very large number of small files is a challenge (500M files)
• HDDS + Ozone: Scalable Hadoop Storage
• Retains
• HDFS block storage Fault-tolerance
• HDFS Horizonal scaling for Storage, IO
• HDFS’s move computation to Storage
• HDDS: Block containers:
• Initially scale to 10B blocks, later to 100B+ blocks (HDFS-7240)
• Ozone – Flat KV namespace + Hadoop Compatible FS (OzoneFS)
• initially scale to 10B files (HDFS-13074)
• Community working on a Hierarchal Namespace on HDDS (HDFS-10419)

Thank You
Q&A

Ozone and HDFS’s evolution

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Ozone and HDFS’s evolution

Similar to Ozone and HDFS’s evolution (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Ozone and HDFS’s evolution