Вы находитесь на странице: 1из 10

http://webgnuru.com/linux/rsync_incremental.

php
Rsync Date Stamped, Snapshot Style, Incremental Backups
Andrew J. Nelson
Published: 2 August 2011
Introduction
The objective is to create a series of daily, snapshot style, incremental backups with the
following properties to each backup: it will be date stamped; it will reflect reflect the changes to
the file structure since the prior backup; it will appear as a complete file structure. Additionally,
the entirety of the backup will occupy only slightly more disk space than a single complete
backup, even though each daily backup gives the appearance of a full backup. Finally, the entire
backup process will be automated.
Credit
The incremental snapshot backup method with rsync was first published by Mike Rubel. It was
an innovative concept, and he deserves all the credit for it. This document expands and refines
the methods originated by Mike Rubel.
Rsync And Backup Concepts
Terminology
When learning rsync myself, one of the more confusing aspects was that many online tutorials
used inconsistent terms. We will define keywords now, and use them consistently throughout the
document.
Source Directory
The directory that is being backed up; the original data being copied.
Target Directory
The directory to which data is backed up; the destination directory.
File Structure
A directory tree; a group of files and directories with a common root directory.
Rsync
"Rsync is a fast and extraordinarily versatile file copying tool." Rysnc can copy files either
locally (files stay on the same computer) or remotely (files are copied from one computer to
another computer via a network; the network can be a LAN or the internet). Rsync has a
tremendous number of options that allow the user to control every aspect of the copy process.
The beauty of rsync is that it uses an algorithm to compare the source directory against the target
directory, and only copies to the target what is different in the source. It doesn't even copy over

an entire file if only a part of it has changed; it copies only those bytes which are different. This
dramatically reduces your overhead for transferring files.
Incremental Backups
Assume a working directory called Alpha. On day one you create a full backup of Alpha called
Alpha_Full. On day two you would create an incremental backup called Alpha_2. This backup
only contains those files from the working directory Alpha which are different from the full
backup, Alpha_Full. Because only what has changed is copied over, it is much faster and uses far
less disk space than a full backup.
On day three, another backup is made named Alpha_3. This time, only what has changed in
Alpha compared to the prior day's incremental backup (Alpha_2) is copied over, instead of what
has changed in Alpha compared to the full backup (Alpha_Full). This is the key concept in
incremental backups: only what has changed since the prior day is backed up.
Incremental Vs. Differential Backups
Each day, a differential backup copies over what is different in the source directory as compared
against the original full backup. As the amount of time increases since the full backup, the
number of changes between the working source directory and the full backup correspondingly
increase; therefore, the size of the daily backup increases.
Continuing from our prior example, if we were making differential backups Alpha_10 would
consist of all the changes in the source directory in the last 10 days. This tutorial utilizes the
incremental approach instead of the differential, but it is best to understand both forms.
Snapshot Backups, or Incremental Backups Appearing As Full File Structures
It is easiest for most people to utilize a backup if it looks like the full file structure, instead of
perhaps just a few files which have changed since yesterday sitting alone in a directory. Indeed,
if the backup process accounts for deleted files, it is harder to identify that a file has been deleted
without the context of the full file structure. It is also harder to differentiate files backed up
because they were deleted instead of backed up because they were changed.
The most natural way to use a backup is for a user to see the backup as a snapshot: a picture of
the way the file structure looked at a certain point in time.
We accomplish this with rsync by hard linking to all unchanged files from the incremental
backup. The hard link creates the appearance of the unchanged file existing in the incremental
backup; however it does not. It is just a link to the file. Thus, the appearance of a full daily
backup is created without the overhead in time, network resources, or disk space.
The concept of hard links is integral to this process. The next section delves into hard links in
detail, so that we may grok it fully before continuing.

Hard Links
This section will explain what a hard link is and walk through a couple exercises to reinforce the
concept for the user. I know that I intellectually understood it, but didn't really get it, (or grok it,
if you will), until I had played around with some commands looking at it.
An inode is an index of a file's attributes, such as file permissions, owner, group, file size,
number of hard links and soft links, and times of access, modification, and deletion. (This is not
an exhaustive list.) Typically, an inode is associated with exactly one directory entry. However, it
is possible to associate an inode with more than one directory entry by creating a hard link.
Indeed, a file name is not the file itself, but a hard link to the file. So, lets play with hard links for
a bit. Create a new file:

touch foo.txt

We can now use the stat command to examine the properties of that file. Run:

stat foo.txt

The output will show the inode number and the number of links to the inode. Now, lets create a
hard link to foo.txt using the ln command:

ln foo.txt bar.txt

Now run stat on each filename. You will see that the output is exactly the same; they share the
same inode number and have an equal number of links. You can also use ls -i to view just the
inode number; foo.txt and bar.txt are the same thing.
So what happens if you operate on them? If you make changes to one, the same changes are
made to the other. If you modify ownership or any other attribute of one, the same modifications
are made to the other. However, if you use rm on one, you do not delete both. The rm command
actually removes the hard link to the file; the other link will remain. The file itself will not be
deleted until the number of links to it reaches zero.
Go ahead and edit the file, change its permissions, etc. After each change, run stat on each file
until you are convinced in your soul that foo.txt and bar.txt are the same thing. Then use rm on
one, stat the remainder, then rm it.
We make use of this with rsync to create the illusion of a full copy of a file structure, when in
fact all we have done is hard linked. All the data is accessible as if the full structure were there,
without the hard disk and network overhead of doing a full copy. But I'm getting ahead of
myself, more on that later.
Using Rsync

The Basics
The basic usage of rsync is very simple:

rsync [options] [source directory] [target directory]

This compares the source directory against the target directory, and copies over everything that is
different in the source to the target.
There is one point of syntax in this that gets a lot of new rsync users hung up. Given a source
directory bravo containing file widget.txt (breaking conventions by not using foo and bar!), and a
target directory charlie, then

rsync -a /bravo /charlie

is not the same thing as

rsync -a /bravo/ /charlie

The first command copies the directory bravo, and its contents, into charlie, resulting in
/charlie/bravo/widget.txt. The second command copies only the contents of bravo into charlie,
resulting in /charlie/widget.txt. So remember, a trailing slash on the source directory means that
the source directory's contents are copied, not the directory itself.
Rsync Options
Obviously, detailing the massive number of options that rsync has available is beyond the scope
of this document. We will talk about the specific ones used in our method.
The first option we use is -a, which stands for archive mode. Archive mode is a compilation of
other switches, which amount to preserving almost everything.
Rsync Switches Inherent To Archive Mode

r: operate recursively
l: preserve links

p: preserve permissions

t: preserve times

g: preserve groups

o: preserve owner

D: preserve device files and special files

So you can see that by using archive mode, the files are copied in such a way that they could be
restored to the source directory transparently; everything vital to their use is maintained.

The second option we utilize is -v, which as usual requests verbosity in reporting. Its always nice
to see what is going on, and export it to a log if you so choose.
The third option is -h, which requests that transfer amounts are expressed in human readable
format. Instead of 1256842136 bytes transferred, you will see 1.257 GB, or something like that.
The fourth option is --delete, which deletes files from the target directory that have been deleted
in the source directory. So not only are files which are different in the source directory copied to
the target directory, files which don't exist in the source but do exist in the target are deleted from
the target.
Our last option is --link-dest=[link directory], and this takes a bit of explaining.
Rsync typically evaluates against the target directory. The --link-dest option instructs rsync to use
the specified link destination directory as an additional file structure to evaluate against; files that
are unchanged in either the target directory or the link destination directory are not sent.
Additionally, unchanged files are hard linked from the link destination directory to the target
directory.
Lets break that down, continuing our example. Our working directory, Alpha, has a full backup
called Alpha_Full created on day one. On day two, we want to create our incremental backup in
a new directory called Alpha_2. If we created a new empty directory called Alpha_2 and ran:

rsync -avh --delete /Alpha/ /Alpha_2

A new full backup of Alpha would be made in Alpha_2. That isn't incremental, and doesn't do us
any good. By using the --link-dest option like this:

rsync -avh --delete --link-dest=/Alpha_Full /Alpha/ /Alpha_2

We are instructing rsync to backup the contents of Alpha into Alpha_2, evaluating for differences
against Alpha_Full. It is as if Alpha_2 was Alpha_Full for the purpose of checking for changes.
Then, after it transfers those files that have changed, it hard links everything else from
Alpha_Full to Alpha_2. The hard links create the appearance of the files existing twice, once in
both Alpha_Full and Alpha_2, but there is actually only one copy of the file. Finally, since the
--delete option was used, files that were deleted in Alpha will have their hard links deleted in
Alpha_2.
Thus, we have created a snapshot, incremental backup. The next day we would do the same
thing, creating Alpha_3, but instead of evaluating against Alpha_Full for changes, we would
check against Alpha_2.
Rsync and Hard Links

A concern you may have is: what happens to a file that is originally stored in Alpha_Full, then on
day 10 is changed? If Alpha_10 only has a hard link to the original file, and that file is changed,
does rsync change the original, thus ruining the snapshot effect?
The answer is no. Rsync is "wicked smart", as they say in my neck of the woods. When
evaluating for changes against a hard link, if the source file is different, the destination file is
unlinked before transfer. A new version of the file is created in the target directory, and the
historical version of the file is preserved in earlier backups. You have to give explicit options for
rsync to do otherwise, and the man pages hint rather broadly that it might be a bad idea.
Secure Transmission With Rsync
All of our examples thus far have used rsync to copy files locally. Rsync can be used to backup
to another computer across a network, whether that network is a LAN or the internet. You can
specify which remote shell to use with -e, followed by the shell name. The syntax for the source
directory remains the same. The syntax for the target directory works much like scp:
username@domain.name:.
Continuing our example, if we were making our backups of alpha to a remote computer at a
domain named pluto.org, it would look like this:

rsync -avh -e ssh --delete /Alpha/ username@pluto.org:/Alpha_2

Substitute username with an account that has the appropriate rights. You would then be prompted
for the password. You can obviate the need for passwords by using preshared keys, or setting up
an rsync daemon server. Both of those options fall outside the scope of this document.
The remainder of the tutorial assumes that the backups are made to a network share or a separate
partition on the same computer, not a remote machine. Even so, it is important to understand that
rsync has this capability.
Date Stamping With A Shell Script
We now understand the methods and theory behind creating a snapshot style, incremental backup
with rsync. The next step is implementation. Other methods use a backup naming convention
such as backup1, backup2, backup3. The script used to automate the method then deletes
backup3, renames backup2 to backup3, and likewise renames backup1 to backup2. It then
creates a new backup1, using backup2 as the link destination directory.
That successfully creates a rotating backup structure. However, I found the steps required to do
the process somewhat convoluted, and the number of steps increases with the number of backups
you want to keep on hand.
I thought it would be simpler if the script could simply recognize yesterday's backup by name. I
also thought it would be easier to recognize and find the backup you need by date, rather than by

an abstract identifier. We can accomplish both these goals by using date stamps for the backup
names.
Lets walk through the steps we would want our automated script to do.
1. Create a new directory for the daily backup; the directory's name should be today's date.
2. Backup the source directory to the target directory, using the directory named with
yesterday's date as the link destination.
3. (optional) Delete a directory that is a certain number of days old.
So the first step in making this work is getting a shell script to be able to find today's date,
yesterday's date, and a third date from x number of days ago. If you want to keep backups for 28
days, x would equal 29. I'll talk about why step three is optional later.
How To Date Stamp
When we talk about date stamping, we mean using a date format such as 2011-07-24 to represent
July 24, 2011. This format is unambiguous, universally recognizable, and guarantees that files
named in that format will accurately sort by date ascending or descending.
The man page documentation for date is varied, and sadly, often incomplete. I am going to show
some of the hard ways of doing things before showing the easier ways. (I am a firm believer that
you never learn anything by doing it right.)
If we want the system to output today's date in date stamp format, one way of doing it is to use
the date command like this with modifiers to explicitly state the format.

date +%Y-%m-%d

This is also know as an ISO-8601 compliant format. What is not well known about the date
command is that it has a switch to output in ISO-8601: -I. So, this accomplishes the same thing
with far fewer symbols:

date -I

To use this in our script, we can create a variable and assign it to that by running the date
command within backticks.

DAY0=`date -I`

We now have a variable which will always be equal to today's date in date stamp format. Useful
to our objective, no? Next, we need a variable equal to yesterday's date in date stamp format.
I went through quite the process on this before I found a simpler method. The hard way was that
I converted today's date to epoch time (the number of seconds since midnight, January 1, 1970)
and assigned that value to a variable. I then subtracted 1 days worth of seconds from that variable

and assigned that value to a new variable. I then converted the new variable back into date stamp
format. All of this required some esoteric commands.
Now for the easy way. The date command also has an under-documented function where it will
tell you a date in the past or future with simple syntax. For instance, if today's date is 2011-0727, then the following code will output 2011-07-26 and assign it to a variable:

DAY1=`date -I -d "1 day ago"`

The Backup Script


Lets flesh out the rest of the script. Make these assumptions: we are backing up a website to a
directory named /backup located on a separate partition. The full initial backup was made
manually to the directory with a backup name of 2011-07-01. We will set variables equal to our
source directory, our target directory, our link destination directory, and our options. We will then
execute the rsync command. Create a script called website_bak.sh. (Make sure you make it
executable.)

#!/bin/bash

#Website Backup Script

#Todays date in ISO-8601 format:

DAY0=`date -I`

#Yesterdays date in ISO-8601 format:

DAY1=`date -I -d "1 day ago"`

#The source directory:

SRC="/var/www/htdocs/"

#The target directory:

TRG="/backup/website/$DAY0"

#The link destination directory:

LNK="/backup/website/$DAY1"

#The rsync options:

OPT="-avh --delete --link-dest=$LNK"

#Execute the backup

rsync $OPT $SRC $TRG

If the initial full back up is made on 2011-07-01, you can have this script executed as a cron job
starting on 2011-07-02. Each day it will make an incremental date stamped backup, using the
prior day's backup as the link destination directory. There is no need to rename backups or move
them around. You will also note that the rsync command itself creates the target directory. This
will work for one directory level; however, if the directory /website did not exist, such that rsync
had to create /website and /$DAY0, the backup would fail. Rsync would exit with an error code.
The last optional step is to remove older backups. The reason I say this is optional is because it
depends on how much data you are working with and the degree of change it experiences. This
method is very efficient at reducing overhead; you could conceivably keep incremental backups
for years without running out of space if your data doesn't change drastically.
However, lets say that you want to only keep the last 28 days of data. Add the following lines to
the script.

#29 days ago in ISO-8601 format


DAY29=`date -I -d "29 days ago"`

#Delete the backup from 29 days ago, if it exists

if [ -d /backup/website/$DAY29 ]

then

rm /backup/website/$DAY29

fi

Voila, a script that does everything we need for a backup. Our final step is to automate its use.
Cron
Linux has a scheduling utility called cron which will run operations at a fixed time. Going into
how cron works and editing its settings is beyond the scope of this document. However, if you
just want a script to run daily, it is simply a matter of placing it in the correct directory.
So, copy the backup script to the cron daily directory. In Slackware, this is:

cp -v website_bak.sh /etc/cron.daily/

Conclusion

Share this article on Google+ or your favorite social media. Thanks!


After a long journey, we have reached our goal of having automated, snapshot style incremental
backups using rsync. Along the way we learned the difference between incremental and
differential backups, what hard links are, how rsync works, and how to pull it all together with a
shell script. I hope that this article helps you in your endeavors.

Вам также может понравиться