Академический Документы
Профессиональный Документы
Культура Документы
Abstract
November 2014
Copyright 2015 EMC Corporation. All Rights Reserved.
For the most up-to-date listing of EMC product names, see EMC
Corporation Trademarks on EMC.com.
Figure 1: OneFS Combines File System, Volume Manager and Data Protection
into One Single Intelligent, Distributed System.
This is the core innovation that directly enables enterprises to successfully utilize the
scale-out NAS in their environments today. It adheres to the key principles of scale-
out; intelligent software, commodity hardware and distributed architecture. OneFS is
not only the operating system but also the underlying file system that drives and
stores data in the Isilon scale-out NAS cluster.
Isilon nodes
OneFS works exclusively with the Isilon scale-out NAS nodes, referred to as a
cluster. A single Isilon cluster consists of multiple nodes, which are constructed as
rack-mountable enterprise appliances containing: memory, CPU, networking, Non-
Volatile Random Access Memory (NVRAM), low-latency InfiniBand interconnects, disk
controllers and storage media. Each node in the distributed cluster thus has compute
or processing capabilities as well as storage or capacity capabilities.
An Isilon cluster starts with as few as three-nodes, and currently scales to 144-nodes
(governed by the largest, 144-port InfiniBand switch that Isilon has qualified). There
are many different types of nodes, all of which can be incorporated into a single
cluster where different nodes provide different ratios of capacity to throughput or
Input/Output operations per second (IOPS).
OneFS has no built-in limitation in terms of the number of nodes that can be included
in a single system. Each node added to a cluster increases aggregate disk, cache,
CPU, and network capacity. OneFS leverages each of the hardware building blocks, so
that the whole becomes greater than the sum of the parts. The RAM is grouped
together into a single coherent cache, allowing I/O on any part of the cluster to
benefit from data cached anywhere. NVRAM is grouped together to allow for high-
throughput writes that are safe across power failures. Spindles and CPU are combined
to increase throughput, capacity and IOPS as the cluster grows, for access to one file
or for multiple files. A clusters storage capacity can range from a minimum of 18
terabytes (TB) to a maximum of greater than 30 petabytes (PB). The maximum
Network
There are two types of networks associated with a cluster: internal and external.
Back-end network
All intra-node communication in a cluster is performed using a proprietary, unicast
(node to node) protocol. Communication uses an extremely fast low-latency,
InfiniBand (IB) network. This back-end network, which is configured with redundant
switches for high availability, acts as the backplane for the cluster, enabling each
node to act as a contributor in the cluster and isolating node-to-node communication
to a private, high-speed, low-latency network. This back-end network utilizes Internet
Protocol (IP) over IB for node-to-node communication
Front-end network
Clients connect to the cluster using Ethernet connections (1GbE or 10GbE) that are
available on all nodes. Because each node provides its own Ethernet ports, the
amount of network bandwidth available to the cluster scales linearly with performance
and capacity. The Isilon cluster supports standard network communication protocols
to a customer network, including NFS, SMB, HTTP, FTP, HDFS, and Openstack Swift.
Figure 2 depicts the complete architecture; software, hardware and network all
working together in your environment with servers to provide a completely distributed
single file system that can scale dynamically as workloads and capacity needs or
throughput needs change in a scale-out environment.
Client services
The front-end protocols that the clients can use to interact with OneFS are referred to
as client services. Please refer to the Supported Protocols section for a detailed list of
supported protocols. In order to understand, how OneFS communicates with clients,
we split the I/O subsystem into two halves: the top half or the Initiator and the
bottom half or the Participant. Every node in the cluster is a Participant for a
particular I/O operation. The node that the client connects to is the Initiator and that
node acts as the captain for the entire I/O operation. The read and write operation
are detailed in later sections
Cluster operations
In a clustered architecture, there are cluster jobs that are responsible for taking care of the
health and maintenance of the cluster itselfthese jobs are all managed by the OneFS job
engine. The Job Engine runs across the entire cluster and is responsible for dividing and
conquering large storage management and protection tasks. To achieve this, it reduces a task
Job Engine includes a comprehensive check-pointing system which allows jobs to be paused
and resumed, in addition to stopped and started. The Job Engine framework also includes an
adaptive impact management system.
The Job Engine typically executes jobs as background tasks across the cluster, using spare or
especially reserved capacity and resources. The jobs themselves can be categorized into three
primary classes:
These jobs perform background file system maintenance, and typically require access to
all nodes. These jobs are required to run in default configurations, and often in degraded
cluster conditions. Examples include file system protection and drive rebuilds.
The feature support jobs perform work that facilitates some extended storage
management function, and typically only run when the feature has been configured.
Examples include deduplication and anti-virus scanning.
These jobs are run directly by the storage administrator to accomplish some data
management goal. Examples include parallel tree deletes and permissions maintenance.
The table below provides a comprehensive list of the exposed Job Engine jobs, the operations
they perform, and their respective file system access methods:
Although the file system maintenance jobs are run by default, either on a schedule or in
reaction to a particular file system event, any Job Engine job can be managed by configuring
both its priority-level (in relation to other jobs) and its impact policy.
An impact policy can consist of one or many impact intervals, which are blocks of time
within a given week. Each impact interval can be configured to use a single pre-
defined impact-level which specifies the amount of cluster resources to use for a
particular cluster operation. Available job engine impact-levels are:
Paused
Low
Medium
High
Additionally, Job Engine jobs are prioritized on a scale of one to ten, with a lower value
signifying a higher priority. This is similar in concept to the UNIX scheduling utility, nice.
The Job Engine allows up to three jobs to be run simultaneously. This concurrent job execution
is governed by the following criteria:
Job Priority
Exclusion Sets - jobs which cannot run together (ie, FlexProtect and AutoBalance)
Cluster health - most jobs cannot run when the cluster is in a degraded state.
OneFS is truly a single file system with one namespace. Data and metadata are
striped across the nodes for redundancy and availability. The storage has been
completely virtualized for the users and administrator. The file tree can grow
organically without requiring planning or oversight about how the tree grows or how
users use it. No special thought has to be applied by the administrator about tiering
files to the appropriate disk, because Isilon SmartPools will handle that automatically
without disrupting the single tree. No special consideration needs to be given to how
one might replicate such a large tree, because the Isilon SyncIQTM service
automatically parallelizes the transfer of the file tree to one or more alternate
clusters, without regard to the shape or depth of the file tree.
This design should be compared with namespace aggregation, which is a commonly-
used technology to make traditional NAS appear to have a single namespace. With
Data layout
OneFS uses physical pointers and extents for metadata and stores file and directory
metadata in inodes. B-trees are used extensively in the file system, allowing
scalability to billions of objects and near-instant lookups of data or metadata. OneFS
is a completely symmetric and highly distributed file system. Data and metadata are
always redundant across multiple hardware devices. Data is protected using erasure
coding across the nodes in the cluster, this creates a cluster that has high-efficiency,
allowing 80% or better raw-to-usable on clusters of five nodes or more. Metadata
(which makes up generally less than 1% of the system) is mirrored in the cluster for
performance and availability. As OneFS is not reliant on RAID, the amount of
redundancy is selectable by the administrator, at the file- or directory-level beyond
the defaults of the cluster. Metadata access and locking tasks are managed by all
nodes collectively and equally in a peer-to-peer architecture. This symmetry is key to
the simplicity and resiliency of the architecture. There is no single metadata server,
lock manager or gateway node.
Because OneFS must access blocks from several devices simultaneously, the
addressing scheme used for data and metadata is indexed at the physical-level by a
tuple of {node, drive, offset}. For example if 12345 was a block address for a block
that lived on disk 2 of node 3, then it would read, {3,2,12345}. All metadata within
the cluster is multiply mirrored for data protection, at least to the level of redundancy
of the associated file. For example, if a file were at an erasure-code protection of
+2n, implying the file could withstand two simultaneous failures, then all metadata
needed to access that file would be 3x mirrored, so it too could withstand two
failures. The file system inherently allows for any structure to use any and all blocks
on any nodes in the cluster.
Other storage systems send data through RAID and volume management layers,
introducing inefficiencies in data layout and providing non-optimized block access.
Isilon OneFS controls the placement of files directly, down to the sector-level on any
drive anywhere in the cluster. This allows for optimized data placement and I/O
patterns and avoids unnecessary read-modify-write operations. By laying data on
disks in a file-by-file manner, OneFS is able to flexibly control the type of striping as
well as the redundancy level of the storage system at the system, directory, and even
file-levels. Traditional storage systems would require that an entire RAID volume be
dedicated to a particular performance type and protection setting. For example, a set
of disks might be arranged in a RAID 1+0 protection for a database. This makes it
difficult to optimize spindle use over the entire storage estate (since idle spindles
File writes
The OneFS software runs on all nodes equally - creating a single file system that runs
across every node. No one node controls or masters the cluster; all nodes are true
peers.
If we were to look at all the components within every node of a cluster that are
involved in I/O from a high-level, it would look like Figure 6 above. We have split the
stack into a top layer, called the Initiator, and a bottom layer, called the
Participant. This division is used as a logical model for the analysis of any one given
read or write. At a physical-level, CPUs and RAM cache in the nodes are
simultaneously handling Initiator and Participant tasks for I/O taking place throughout
the cluster. There are caches and a distributed lock manager that are excluded from
the diagram above to keep it simple. They will be covered in later sections of the
paper.
When a client connects to a node to write a file, it is connecting to the top half or
Initiator of that node. Files are broken into smaller logical chunks called stripes before
being written to the bottom half or Participant of a node (disk). Failure-safe buffering
using a write coalescer is used to ensure that writes are efficient and read-modify-
write operations are avoided. The size of each file chunk is referred to as the stripe
unit size.
OneFS uses the InfiniBand back-end network to allocate and stripe data across all
nodes in the cluster automatically, so no additional processing is required. As data is
being written, it is being protected at the specified level.
When writes take place, OneFS divides data out into atomic units called protection
groups. Redundancy is built into protection groups, such that if every protection
Every node that owns blocks in a particular write is involved in a two-phase commit.
The mechanism relies on NVRAM for journaling all the transactions that are occurring
across every node in the storage cluster. Using multiple NVRAMs in parallel allows for
high-throughput writes while maintaining data safety against all manner of failures,
including power failures. In the event that a node should fail mid-transaction, the
transaction is restarted instantly without that node involved. When the node returns,
the only required actions are for the node to replay its journal from NVRAMwhich
takes seconds or minutesand, occasionally, for AutoBalance to rebalance files that
were involved in the transaction. No expensive fsck or disk-check processes are
ever required. No drawn-out resynchronization ever needs to take place. Writes are
never blocked due to a failure. The patented transaction system is one of the ways
that OneFS eliminates singleand even multiplepoints of failure.
In a write operation, the initiator captains or orchestrates the layout of data and
metadata; the creation of erasure codes; and the normal operations of lock
management and permissions control. An administrator from the web management or
CLI interface at any point can optimize layout decisions made by OneFS to better suit
the workflow. The administrator can choose from the access patterns below at a per-
file or directory-level:
Concurrency: Optimizes for current load on the cluster, featuring many
simultaneous clients. This setting provides the best behavior for mixed workloads.
Streaming: Optimizes for high-speed streaming of a single file, for example to
enable very fast reading with a single client.
OneFS Caching
The OneFS caching infrastructure design is predicated on aggregating the cache present on
each node in a cluster into one globally accessible pool of memory. To do this, Isilon uses an
efficient messaging system, similar to non-uniform memory access (NUMA). This allows all the
nodes memory cache to be available to each and every node in the cluster. Remote memory is
accessed over an internal interconnect, and has much lower latency than accessing hard disk
drives.
For remote memory access, OneFS utilizes the Sockets Direct Protocol (SDP) over an Infiniband
(IB) backend interconnect on the cluster, which is essentially a distributed system bus. SDP
provides an efficient, socket-like interface between nodes which, by using a switched star
topology, ensures that remote memory addresses are only ever one IB hop away. While not as
fast as local memory, remote memory access is still very fast due to the low latency of IB.
The OneFS caching subsystem is coherent across the cluster. This means that if the same
content exists in the private caches of multiple nodes, this cached data is consistent across all
instances. OneFS utilizes the MESI Protocol to maintain cache coherency. This protocol
implements an invalidate-on-write policy to ensure that all data is consistent across the entire
shared cache.
OneFS uses up to three levels of read cache, plus an NVRAM-backed write cache, or coalescer.
These, and their high-level interaction, are illustrated in the following diagram.
The first two types of read cache, level 1 (L1) and level 2 (L2), are memory (RAM) based, and
analogous to the cache used in processors (CPUs). These two cache layers are present in all
Isilon storage nodes.
OneFS dictates that a file is written across multiple nodes in the cluster, and possibly multiple
drives within a node, so all read requests involve reading remote (and possibly local) data.
When a read request arrives from a client, OneFS determines whether the requested data is in
local cache. Any data resident in local cache is read immediately. If data requested is not in
local cache, it is read from disk. For data not on the local node, a request is made from the
remote nodes on which it resides. On each of the other nodes, another cache lookup is
performed. Any data in the cache is returned immediately, and any data not in the cache is
retrieved from disk.
When the data has been retrieved from local and remote cache (and possibly disk), it is
returned back to the client.
The high-level steps for fulfilling a read request on both a local and remote node are:
1. Determine whether part of the requested data is in the local L1 cache. If so, return to client.
2. If not in the local cache, request data from the remote node(s).
On remote nodes:
1. Determine whether requested data is in the local L2 or L3 cache. If so, return to the
requesting node.
2. If not in the local cache, read from disk and return to the requesting node.
File reads
In an Isilon cluster, data, metadata and inodes are all distributed on multiple nodes,
and even across multiple drives within nodes. When reading or writing to the cluster,
the node a client attaches to acts as the captain for the operation.
In a read operation, the captain node gathers all of the data from the various nodes
in the cluster and presents it in a cohesive way to the requestor.
Due to the use of cost-optimized industry standard hardware, the Isilon cluster
provides a high ratio of cache to disk (multiple GB per node) that is dynamically
allocated for read and write operations as needed. This RAM-based cache is unified
and coherent across all nodes in the cluster, allowing a client read request on one
node to benefit from I/O already transacted on another node. These cached blocks
can be quickly accessed from any node across the low-latency InfiniBand backplane,
allowing for a large, efficient RAM cache, which greatly accelerates read performance.
As the cluster grows larger, the cache benefit increases. For this reason, the amount
of I/O to disk on an Isilon cluster is generally substantially lower than it is on
traditional platforms, allowing for reduced latencies and a better user experience.
For files marked with an access pattern of concurrent or streaming, OneFS can take
advantage of pre-fetching of data based on heuristics used by the Isilon SmartRead
component. SmartRead can create a data pipeline from L2 cache, prefetching into a
local L1 cache on the captain node. This greatly improves sequential-read
performance across all protocols, and means that reads come directly from RAM
Multi-threaded IO
With the growing use of large NFS datastores for server virtualization and enterprise
application support comes the need for high throughput and low latency to large files.
To accommodate this, OneFS Multi-writer supports multiple threads concurrently
writing to individual files.
In the above example, concurrent write access to a large file can become limited by
the exclusive locking mechanism, applied at the whole file level. In order to avoid this
potential bottleneck, OneFS Multi-writer provides more granular write locking by sub-
diving the file into separate regions and granting exclusive write locks to individual
regions, as opposed to the entire file. As such, multiple clients can simultaneously
write to different portions of the same file.
Data protection
Power loss
A file system journal, which stores information about changes to the file system, is
designed to enable fast, consistent recoveries after system failures or crashes, such
as power loss. The file system replays the journal entries after a node or cluster
recovers from a power loss or other outage. Without a journal, a file system would
need to examine and review every potential change individually after a failure (an
fsck or chkdsk operation); in a large file system, this operation can take a long
time.
OneFS is a journaled file system in which each node contains a battery-backed
NVRAM card used for protecting uncommitted writes to the file system. The NVRAM
card battery charge lasts many days without requiring a recharge. When a node boots
up, it checks its journal and selectively replays transactions to disk where the
journaling system deems it necessary.
OneFS will mount only if it can guarantee that all transactions not already in the
system have been recorded. For example, if proper shutdown procedures were not
followed, and the NVRAM battery discharged, transactions might have been lost; to
prevent any potential problems, the node will not mount the file system.
Scalable rebuild
OneFS does not rely on hardware RAID either for data allocation, or for reconstruction
of data after failures. Instead OneFS manages protection of file data directly, and
when a failure occurs, it rebuilds data in a parallelized fashion. OneFS is able to
determine which files are affected by a failure in constant time, by reading inode data
in a linear manor, directly off disk. The set of affected files are assigned to a set of
worker threads that are distributed among the cluster nodes by the job engine. The
worker nodes repair the files in parallel. This implies that as cluster size increases, the
Smaller clusters can be protected with +1n protection, but this implies that while a
single drive or node could be recovered, two drives in two different nodes could not.
Drive failures are orders of magnitude more likely than node failures. For clusters with
large drives, it is desirable to provide protection for multiple drive failures, though
single-node recoverability is acceptable.
To provide for a situation where we wish to have double-disk redundancy and single-
node redundancy, we can build up double or triple width protection groups of size.
These double or triple width protection groups will wrap once or twice over the
same set of nodes, as they are laid out. Since each protection group contains exactly
two disks worth of redundancy, this mechanism will allow a cluster to sustain either a
two or three drive failure or a full node failure, without any data unavailability.
Most important for small clusters, this method of striping is highly efficient, with an
on-disk efficiency of M/(N+M). For example, on a cluster of five nodes with double-
failure protection, were we to use N=3, M=2, we would obtain a 3+2 protection group
with an efficiency of 12/5 or 60%. Using the same 5-node cluster but with each
protection group laid out over 2 stripes, N would now be 8 and M=2, so we could
obtain 1-2/(8+2) or 80% efficiency on disk, retaining our double-drive failure
protection and sacrificing only double-node failure protection.
Protection Description
Level
+1n Tolerate failure of 1 drive OR 1 node
OneFS enables an administrator to modify the protection policy in real time, while
clients are attached and are reading and writing data. Note that increasing a clusters
protection level may increase the amount of space consumed by the data on the
cluster.
Note: OneFS 7.2 introduces under-protection alerting for new installations of OneFS
7.2. If the cluster is under-protected, the cluster event logging system (CELOG) will
generate alerts, warning the cluster administrator of the protection deficiency and
recommending a change to the appropriate protection level for that particular
clusters configuration.
Automatic partitioning
Data tiering and management in OneFS is handled by the SmartPools framework.
From a data protection and layout efficiency point of view, SmartPools facilitates the
subdivision of large numbers of high-capacity, homogeneous nodes into smaller, more
Mean Time to Data Loss (MTTDL)-friendly disk pools. For example, an 80-node
nearline (NL) cluster would typically run at a +4n protection level. However,
partitioning it into four, twenty node disk pools would allow each pool to run at
+2d:1n, thereby lowering the protection overhead and improving space utilization,
without any net increase in management overhead.
In keeping with the goal of storage management simplicity, OneFS will automatically
calculate and partition the cluster into pools of disks, or node pools, which are
optimized for both MTTDL and efficient space utilization. This means that protection
level decisions, such as the 80-node cluster example above, are not left to the
customer - unless desired.
With Automatic Provisioning, every set of equivalent node hardware is automatically
divided into node pools comprising up to forty nodes and six drives per node. These
Certain similar, but non-identical, node types can be provisioned to an existing node pool either
by node equivalence or compatibility. OneFS requires that a node pool must contain a minimum
of three nodes.
Prior to OneFS 7.2, only identical, equivalence-class nodes with identical drive and memory
configurations could co-exist within the same node pool. With OneFS 7.2, the equivalent
memory requirement has been relaxed slightly to allow S210 or X410 nodes with slightly larger
memory footprints to coexist in a pool with previous generation S200 or X400 nodes.
Figure 16: Node Compatibilities in OneFS 7.2 for S210 and X410 Nodes.
Supported protocols
Clients with adequate credentials and privileges can create, modify, and read data
using one of the standard supported methods for communicating with the cluster:
NFS (Network File System
SMB/CIFS (Server Message Block/Common Internet File System)
FTP (File Transfer Protocol)
HTTP (Hypertext Transfer Protocol)
HDFS (Hadoop Distributed File System)
REST API (Representational State Transfer Application Programming Interface)
Openstack Swift (Object Storage API)
By default, only the SMB/CIFS and NFS protocols are enabled in the Isilon cluster.
The file system root for all data in the cluster is /ifs (the Isilon OneFS file system).
This is presented via SMB/CIFS protocol as an ifs share (\\<cluster_name\ifs), and
via the NFS protocol as an /ifs export (<cluster_name>:/ifs).
Note: Data is common between all protocols, so changes made to file content via one
access protocol are instantly viewable from all others.
Interfaces
Administrators can use multiple interfaces to administer an Isilon storage cluster in
their environments:
Web Administration User Interface (WebUI)
Command Line Interface via SSH network access or RS232 serial connection
LCD Panel on the nodes themselves for simple add/remove functions
RESTful Platform API for programmatic control and automation of cluster
configuration and management.
Active Directory
Active Directory, a Microsoft implementation of LDAP, is a directory service that can
store information about the network resources. While Active Directory can serve
many functions, the primary reason for joining the cluster to the domain is to perform
user and group authentication.
You can configure and manage a clusters Active Directory settings from the Web
Administration interface or the command-line interface; however, it is recommended
that you use Web Administration whenever possible.
Each node in the cluster shares the same Active Directory machine account making it
very easy to administer and manage.
LDAP
The Lightweight Directory Access Protocol (LDAP) is a networking protocol used for
defining, querying, and modifying services and resources. A primary advantage of
LDAP is the open nature of the directory services and the ability to use LDAP across
many platforms. The Isilon clustered storage system can use LDAP to authenticate
users and groups in order to grant them access to the cluster.
NIS
The Network Information Service (NIS), designed by Sun Microsystems, is a directory
services protocol that can be used by the Isilon system to authenticate users and
groups when accessing the cluster. NIS, sometimes referred to as Yellow Pages (YP),
is different from NIS+, which the Isilon cluster does not support.
Local users
The Isilon clustered storage system supports local user and group authentication. You
can create local user and group accounts directly on the cluster, using the WebUI
interface. Local authentication can be useful when directory servicesActive
Directory, LDAP, or NISare not used, or when a specific user or application needs to
access the cluster.
OneFS Auditing
OneFS provides the ability to audit system configuration and NFS and SMB protocol
activity on an Isilon cluster. This allows organizations to satisfy various data
governance and regulatory compliance mandates that they may be bound to.
All audit data is stored and protected within the cluster file system, and is organized
by audit topic. From here, audit data can be exported to third party applications, like
Varonis DatAdvantage, via the EMC Common Event Enabler (CEE) framework. OneFS
Protocol auditing can be enabled per Access Zone, allowing granular control across
the cluster.
Simultaneous upgrade
A simultaneous upgrade installs the new operating system and reboots all nodes in
the cluster at the same time. A simultaneous upgrade requires a temporary, sub-2
minute, interruption of service during the upgrade process while the nodes are
restarted.
Rolling upgrade
A rolling upgrade individually upgrades and restarts each node in the cluster
sequentially. During a rolling upgrade, the cluster remains online and continues
serving data to clients with no interruption in service. A rolling upgrade can only be
performed within a OneFS code version family and not between OneFS major code
version revisions.
SmartQuotas Data Assign and manage quotas that seamlessly partition and thin
Data
Isilon for vCenter Manage Isilon functions from vCenter.
Management
SmartDedupe Data Maximize storage efficiency by scanning the cluster for identical
Deduplication blocks and then eliminating the duplicates, decreasing the amount
of physical storage required.
Please refer to product documentation for details on all of the above Isilon software
products.
Conclusion
With OneFS, organizations and administrators can scale from 18 TB to more than 50
PB within a single file system, single volume, with a single point of administration.
OneFS delivers high-performance, high-throughput, or both, without adding
management complexity.
Next-generation data centers must be built for sustainable scalability. They will
harness the power of automation, leverage the commoditization of hardware, ensure
the full consumption of the network fabric, and provide maximum flexibility for
organizations intent on satisfying an ever-changing set of requirements.
OneFS is the next-generation file system designed to meet these challenges. OneFS
provides:
Fully distributed single file system
High-performance, fully symmetric cluster