Академический Документы
Профессиональный Документы
Культура Документы
4th Revision
TECHNICAL
REPORT
management strategies.
advanced storage solutions and global data
complex technical challenges with
organizations understand and meet
data storage technology, helps
NetApp, a pioneer and industry leader in
Abstract
This guide introduces the NetApp deduplication for FAS technology and describes in detail how to
implement and utilize it.
It should prove useful for customers requiring assistance in understanding and architecting solutions
with deduplication for FAS and NetApp storage systems.
NetApp, Inc.
This page is intentionally blank.
NetApp, Inc.
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision
Table of Contents
1 Introduction............................................................................................................1
1.1 Intended Audience...................................................................................................... 1
1.2 Purpose....................................................................................................................... 1
1.3 Prerequisites and Assumptions ................................................................................. 1
1.4 Document Conventions.............................................................................................. 1
2 Overview.................................................................................................................2
2.1 NetApp Deduplication Technologies ......................................................................... 2
2.1.1 SnapVault for NetBackup™.................................................................................................. 3
2.1.2 NetApp Deduplication for FAS.............................................................................................. 3
2.2 Dense Volumes .......................................................................................................... 3
2.3 Deduplication Features and Functions ...................................................................... 4
2.3.1 General Deduplication Operational Considerations ............................................................ 5
3 Configuration and Operation ...............................................................................6
3.1 Requirements Overview............................................................................................. 6
3.2 Installing and Licensing Deduplication....................................................................... 6
3.2.1 Deduplication Licensing in a Clustered Environment .......................................................... 7
3.3 Command Summary .................................................................................................. 7
3.4 Deduplication Quick Start Guide................................................................................ 8
3.5 Monitoring Deduplication Status ................................................................................ 8
3.6 End-to-End Deduplication Configuration Example.................................................. 10
3.7 Configuring Deduplication Schedules...................................................................... 14
4 Operating Characteristics ..................................................................................16
4.1 Deduplication Target Environment .......................................................................... 16
4.2 Deduplication Performance...................................................................................... 16
4.3 Deduplication Storage Savings................................................................................ 16
4.4 Additional Deduplication Considerations ................................................................. 16
4.4.1 Number of Deduplication Processes.................................................................................. 17
4.4.2 Deduplication and Active/Active Configuration .................................................................. 17
4.4.3 Deduplication and Space Savings on Existing Data ......................................................... 17
4.4.4 Deduplication Best Practices .............................................................................................. 18
NetApp, Inc.
ii
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision
NetApp, Inc.
iii
Technical Report: 16 April 2008
NetApp Deduplication for FAS TR-3505
Deployment and Implementation Guide 4th Revision
1 Introduction
1.2 Purpose
The purpose of this paper is to present a guide for implementing NetApp deduplication for FAS. It will
address step-by-step configuration examples, introduce known caveats and recommendations to
assist the reader in designing optimal solutions, and prepare the audience for performing
deployments of the technology in customer environments.
Its use is threefold:
Provide detailed information to all interested parties.
Educate prior to performing deployments.
Serve as a reference for resolving issues that could arise.
This document is not:
A sales guide (although some high-level thoughts are covered in the “Solutions Overview”
section)
A competitive comparison
A complete product design document
2 Overview
This section provides a quick overview of deduplication in general and then introduces what NetApp
deduplication for FAS is and how it works at a high level.
While all these technologies offer the benefit of reducing the amount of required storage, in the
marketplace they are often not considered “deduplication” technologies when compared to solutions
offered by other vendors. That sentiment, while not entirely accurate, is understood, and NetApp
continues to expand its portfolio with several technologies for further deduplication of data. The
following subsections cover two of the solutions that are available as of the writing of this paper;
additional deduplication technologies are coming in both the short term and the more distant future.
Before delving into technical solutions, it makes sense to understand the value of deduplication to
customers. The primary advantage of data deduplication is that it conserves physical disk space
when storing data on disk. The average UNIX® or Windows® disk volume contains thousands of
duplicate data strings. Traditionally, when copies of these volumes are created, every duplicate data
string is also copied, resulting in an inefficient use of secondary storage. Deduplication helps to
remove this inefficiency and yields a more effective cost per gigabyte in the data center.
To keep track of the many indirect blocks (“IND” in Figure 2) that are pointing to it, each data block
has a block count reference kept in the volume metadata. As additional indirect blocks point to it or
existing ones stop pointing to it, this value is incremented or decremented accordingly. When no
indirect blocks point to a data block, it is released.
Deduplication uses dense volume technology to allow duplicate blocks anywhere in the flexible
volume to be deleted.
Essentially, deduplication only stores unique blocks in the flexible volume and creates a small amount
of additional metadata in the process. Notable features of NetApp deduplication for FAS include:
Works with a high degree of granularity, at the block level.
Operates on the active file system of the flexible volume. Snapshot copies created after
running deduplication enjoy the same storage savings benefits.
Is a background process that can be configured to run automatically, scheduled, or run
manually through the command-line interface.
Is application transparent and therefore can be used for deduplication of data originating from
anywhere in the data center.
Is enabled and managed using a simple command-line interface.
Can be enabled on and deduplicate blocks on flexible volumes with existing data too.
The remainder of this document goes into great detail on the operation of deduplication, but in
general the following occurs:
Newly saved data on the NearStore is stored in blocks as usual by Data ONTAP. Each block
of data has a digital fingerprint, which is compared to all other fingerprints in the flexible
volume. If two fingerprints are found to be the same, a byte-for-byte comparison is done of all
bytes in the block, and, if there is an exact match between the new block and the existing
block on the flexible volume, the duplicate block is discarded and its disk space is reclaimed.
sis status [-l] <vol> Returns current status of deduplication for the specified
flexible volume.
The -l option causes a long listing to be displayed.
df –s <vol> Returns the value of deduplication space savings in the
active file system for the specified flexible volume.
sis check <vol> Verifies and updates the fingerprint database for the
specified flexible volume and includes purging stale
fingerprints.
sis stat <vol> Displays the statistics of flexible volumes that have
deduplication enabled.
Create, Modify, Delete or modify the default deduplication schedule that was configured
Delete Schedules when deduplication was first enabled on the flexible volume or create
(if not doing desired schedule.
manually) sis config [-s sched] <vol>
Manually Run sis start <vol>
Deduplication (if
not using
schedules)
Monitor Status of sis status <vol>
Deduplication
Below, from the sis man page, you see the various State, Status, and Progress messages that can
be returned when running sis status. Note that if you don’t provide a flexible volume name, the
status for all flexible volumes that have deduplication enabled will be displayed.
toaster> sis status
Path State Status Progress
/vol/dvol_1 Enabled Idle Idle for 10:45:23
/vol/dvol_2 Enabled Pending Idle for 15:23:41
/vol/dvol_3 Disabled Idle Idle for 37:12:34
/vol/dvol_4 Enabled Active 25 GB Scanned
/vol/dvol_5 Enabled Active 25 MB Searched
/vol/dvol_6 Enabled Active 40 MB (20%) Done
/vol/dvol_7 Enabled Active 30 MB Verified
/vol/dvol_8 Enabled Active 10% Merged
And following is a textual description of the meaning for each flexible volume:
dvol_1 is Idle. The last deduplication operation on the flexible volume was finished 10:45:23
ago.
dvol_2 is Pending for resource limitation. The deduplication operation on the flexible volume
will become Active when the resource is available.
dvol_3 is Idle because the deduplication operation is disabled on the flexible volume.
dvol_4 is Active. The deduplication operation is doing the whole flexible volume scanning
(initiated with “sis start –s”). So far, it has scanned 25GB of data.
dvol_5 is Active. The operation is searching for duplicate data, and 25MB of data has already
been searched.
dvol_6 is also Active. The operation has saved 40MB of data. This is 20% of the total
duplicate data found in the searching stage.
dvol_7 is Active. It is verifying the metadata of processed data blocks. This process will
remove unused metadata.
dvol_8 is Active. Verified metadata are being merged. This process will merge together all
verified metadata of processed data blocks to an internal format that supports fast sis
operation.
The general flow of the phases deduplication goes through and the correlating sis status
messages when actively running on a flexible volume are shown in Figure 4.
For additional information, the -l option will display detailed status, as shown below.
toaster> sis status -l /vol/dvol_6
Path: /vol/dvol_6
State: Enabled
Status: Active
Progress: 41020 KB (20%) Done
Type: Regular
Schedule: sun-sat@0
Last Operation Begin: Thu Mar 24 13:30:00 PST 2005
Last Operation End: Fri Mar 25 00:34:16 PST 2005
Last Operation Size: 4732932 KB
Last Operation Error: -
1. Begin by creating a flexible volume (keeping in mind the maximum allowable volume size for the
platform, as specified in the requirements table at the beginning of this section).
r200-rtp01*> vol create VolPST aggr0 200g
Creation of volume 'VolPST' with size 200g on containing aggregate
'aggr0' has completed.
2. Now, as a best practice, we’ll disable scheduled Snapshot copies. An alternative to what’s shown
below would be to use the command “snap sched VolPST 0 0 0”.
r200-rtp01*> vol status VolPST
Volume State Status Options
VolPST online raid_dp, flex
Containing aggregate: 'aggr0'
r200-rtp01*> vol options VolPST nosnap true
r200-rtp01*> vol status VolPST
Volume State Status Options
VolPST online raid_dp, flex nosnap=on
Containing aggregate: 'aggr0'
3. Now we’ll enable deduplication on the flexible volume and verify that it’s turned on. The vol
status command will show a sis attribute for flexible volumes that have deduplication turned
on. (It can be a bit confusing, since sis is also indicated for those flexible volumes that have
been written to by SnapVault for NetBackup.)
Note that there needs to be space available in the flexible volume for the sis on command to
complete successfully. That is, if the sis on command were attempted on a flexible volume that
already had data and was completely full, it would fail (since there is no room to create the
required metadata).
Note that after turning deduplication on, Data ONTAP lets you know that if this were an existing
flexible volume that already contained data prior to deduplication being enabled, you would want
to run sis start –s; in this example it’s a brand-new flexible volume, so that’s not necessary.
4. Another way to verify that deduplication is enabled on the flexible volume is to just check the
output from running sis status on the flexible volume.
r200-rtp01*> sis status /vol/VolPST
Path State Status Progress
/vol/VolPST Enabled Idle Idle for 00:00:20
5. Next we’ll turn off the default deduplication schedule. Since in this example the administrators will
be moving large quantities of PST files in as time permits, we’ll want to let them run deduplication
manually at opportune times.
r200-rtp01*> sis config /vol/VolPST
Path Schedule
/vol/VolPST sun-sat@0
r200-rtp01*> sis config -s - /vol/VolPST
r200-rtp01*> sis config /vol/VolPST
Path Schedule
/vol/VolPST -
At this point, in our example, the administrator NFS-mounted the flexible volume to /testPSTs on a
Solaris™ host, sunv240-rtp01, and copied lots of PST files from their users’ directories into our
new PST archive directory flexible volume. The result from the host perspective is shown below.
(Obviously the same sort of thing could be accomplished by mapping a CIFS share to a Windows
host.)
root@sunv240-rtp01 # pwd
/testPSTs
root@sunv240-rtp01 # df -k .
Filesystem kbytes used avail capacity Mounted on
r200-rtp01:/vol/VolPST
167772160 33388384 134383776 20% /testPSTs
The example continues with examining the flexible volume, running deduplication, and monitoring the
status.
6. Use df –s to examine the storage consumed and the space savings provided. Note that no space
savings have been achieved by simply copying data to the flexible volume even though
deduplication is turned on. What has happened is that all the blocks that have been written to this
flexible volume since deduplication was turned on have had their fingerprints written to the
change log file.
r200-rtp01*> df -s /vol/VolPST
Filesystem used saved %saved
/vol/VolPST/ 33388384 0 0%
7. Start deduplication running on the flexible volume. This causes the change log to be processed,
fingerprints to be sorted and merged, and duplicate blocks to be found.
r200-rtp01*> sis start /vol/VolPST
9. Once sis status indicates the flexible volume is once again in the Idle state, deduplication has
finished running, and we can now check the space savings it provided in the flexible volume.
r200-rtp01*> df -s /vol/VolPST
Filesystem used saved %saved
/vol/VolPST/ 24072140 9316052 28%
Run with no arguments, sis config will return the schedules for all flexible volumes that have
deduplication enabled. The example below shows the four different formats the reported schedules
can have.
toaster> sis config
Path Schedule
/vol/dvol_1 -
/vol/dvol_2 23@sun-fri
/vol/dvol_3 auto
/vol/dvol_4 sat@6
When the -s option is specified, the command will set up or modify the schedule on the specified
flexible volume. The schedule parameter can be specified in one of four ways:
[day_list][@hour_list]
[hour_list][@day_list]
-
auto
The day_list specifies which days of the week deduplication should run. It is a comma-separated
list of the first three letters of the day: sun, mon, tue, wed, thu, fri, sat. The names are not case
sensitive. Day ranges such as mon-fri can also be given. The default day_list is sun-sat.
The hour_list specifies which hours of the day deduplication should run on each scheduled day.
The hour_list is a comma-separated list of the integers from 0 to 23. Hour ranges such as 8-17
are allowed.
Step values can be used in conjunction with ranges. For example, 0-23/2 means "every two hours."
The default hour_list is 0 (that is, midnight on the morning of each scheduled day).
If "-" is specified, there won't be a scheduled deduplication operation on the flexible volume.
The “auto” schedule causes deduplication to run on that flexible volume whenever there are 20% new
fingerprints in the change log. This check is done in a background process and occurs every minute.
When deduplication is enabled on a flexible volume the first time, an initial schedule is assigned to
the flexible volume. This initial schedule is sun-sat@0, which means "once every day at midnight."
To configure the schedules shown earlier in this section, the following commands would be issued:
toaster> sis config -s - /vol/dvol_1
toaster> sis config -s 23@sun-fri /vol/dvol_2
toaster> sis config –s auto /vol/dvol3
toaster> sis config –s sat@6 /vol/dvol_4
4 Operating Characteristics
This section discusses where deduplication makes sense and the behavior that you can expect.
If there is very little new data, run deduplication infrequently, because it doesn't make sense
to unnecessarily consume CPU resources. How often you run it will depend on the change
rate of the data in the flexible volume.
The best options are:
Use the auto mode so that deduplication only runs when significant additional data
has been written to each particular flexible volume (this will tend to naturally spread
out when deduplication runs).
Stagger deduplication schedules for the flexible volumes so it runs on alternative
days.
Run deduplication manually.
Run deduplication before creating Snapshot copies, as this will ensure no undeduplicated
data gets locked in Snapshot copies. If a Snapshot copy is created on a flexible volume
before deduplication has a chance to run/complete on that flexible volume, this could result in
lower space savings.
The Snapshot reserve should be greater than 0 if Snapshot copies are to be used. (An
exception to this might be in a SAN environment, where often it is set to zero for thin
provisioning of LUNs.)
There must be some free space in the flexible volume to allow deduplication to operate and
create the metadata it requires. As necessary, flexible volumes can be resized, with no
impact to data access, to accommodate this.
5.1 Licensing
Make sure deduplication is properly licensed and, if the platform is not an R200, make sure the
NearStore option is also properly licensed:
fas3070-rtp01*> license
…
a_sis <license>
nearstore_option <license>
…
Also note that there needs to be free space available in the flexible volume for the “sis on”
command to complete successfully. If a flexible volume is full, deduplication will not run. However, as
noted earlier, flexible volumes can be resized with no impact to data access to accommodate this.
1
Note that the undo option of the sis command is only available in the diag mode, accessed using
the command “priv set diag”.
Note that if sis undo starts processing and then there is not enough space to undeduplicate, it will
stop, complain with a message about insufficient space, and leave the flexible volume dense. All data
is still accessible, but some block sharing is still occurring. Use “df –s” to understand how much free
space you really have and then either grow the flexible volume or delete data or Snapshot copies to
provide the needed free space.
NetApp, Inc.
© 2008 NetApp, Inc. All rights reserved. Specifications subject to change without notice. NetApp, the NetApp logo, Data ONTAP, FlexClone, FlexVol,
NearStore, SnapMirror, SnapVault, and WAFL are registered trademarks and NetApp and Snapshot are trademarks of NetApp, Inc. in the U.S. and
other countries. Solaris is a trademark of Sun Microsystems, Inc. Windows and Microsoft are registered trademarks of Microsoft Corporation. UNIX is
a registered trademark of The Open Group. NetBackup is a trademark of Symantec Corporation or its affiliates in the U.S. and other countries. All
other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such.
24