Вы находитесь на странице: 1из 7

Every DBA will see corruption at some point in his or her career.

Hopefully when you do, you'll


have a comprehensive set of backups that let you perform a fast, targeted set of restores with
minimal downtime and data loss. Alternatively, if the most important business requirement is to
minimize downtime, you might fail over to a synchronously updated, redundant system.

But what if you can't do either? What if your backups are corrupt, your backups have been
failing for months and no one noticed, or the piecemeal restore that you want to perform isn’t
possible because you’re using the SIMPLE recovery model? There have been many articles and
blog posts written about using backups to recover from corruptions (see my blog category
Backup/Restore for lots of useful links), but very few about what to do when you don’t have
backups or when your restore operations fail. In this article I’m going to walk you through an
example of using the repair functionality built into DBCC CHECKDB, plus older backups, to
attempt to salvage some data.

One thing to keep in mind is that there are an infinite number of scenarios involving database -
corruption, and at every stage in the repair process something could crop up that you haven’t
seen before. It’s not possible to cover every eventuality in this article, but it will give a good
grounding in the techniques used when performing a repair operation.

Scenario Setup

To practice these repair techniques, I created a simple example database that you can download.
The database is called SQLMag2010 and simulates a simple sales database. I created the
database, performed a full database backup, corrupted the database, and then performed another
full database backup. The corruption is on one of the data pages in the sales table’s clustered
index. Figure 1 shows the timeline of this scenario.

Figure 1: Timeline of a disaster scenario

The WinZip archive contains the two database backups. To set up the scenario, simply download
and unzip the archive and restore the corrupt version of the database using the following code:

RESTORE DATABASE SQLMag2010


FROM DISK = N'C:\SQLskills\
CorruptSQLMag2010.bak'
WITH REPLACE;
GO

This backup restores without errors because the backup wasn't performed using the WITH
CHECKSUM option. This option (introduced in SQL Server 2005) lets backup and restore
operations test page checksums on pages in the database being backed up and restored, but
without this option, no checks are performed. In production, it’s advisable to use this option
when performing backups.

Also note that the database was created using SQL Server 2005, but you can restore it on SQL
Server 2008 R2 and SQL Server 2008, and everything I describe in this article works the same in
all three versions.

Using Database Repair

After you restore the database, run the following


query against the sales table to expose the corruption:

SELECT SUM (salesAmount)


FROM SQLMag2010.dbo.sales
WHERE salesAmount < $1.00;
GO

The error that's returned is shown in Web Figure 1.

Web Figure 1

In the case of a real corruption in your database, you should have some kind of monitoring in
place so that you’re alerted when a user or application hits a high-severity error similar to the one
shown in Web Figure 1. For example, you can create a SQL Agent alert based on the error
severity. (See "Easy Monitoring of High-Severity Errors: Create Agent Alerts," for more -
information about how to do so.) You don’t want to rely on users notifying you or having to
periodically look through the SQL Server error log or Windows event log to see these errors.
Also, the sooner you know that you have corruption, the more likely you can recover with
minimal downtime and data loss.

Generally, the first thing to do when you encounter corruption is determine how much corruption
you’re dealing with. Only one tool will do that for you: DBCC CHECKDB. You need to
determine the extent of the corruption so that you can figure out what your options are in terms
of restore or repair (or potentially just fail over and deal with the corruption offline).

In any disaster recovery situation, you want only information about the corruptions, and you
want to know about all the corruptions, so instruct DBCC to provide all this information using
the following command:

DBCC CHECKDB (SQLMag2010) WITH ALL_ ERRORMSGS,


NO_INFOMSGS;
GO

The results from this command are shown in Web Figure 2. (Note that I removed some of the
output to save space in the article. When you run it, you’ll see the full output.)
Web Figure 2

From the errors, it looks like a single page is damaged. At this point, you might choose to
examine the page using the DBCC PAGE command (which, although undocumented, is
perfectly safe on production systems—I wrote that code, and I told customers the same thing
when I worked at Microsoft). Use the following code to produce a dump of the whole page:

DBCC TRACEON (3604);


DBCC PAGE (SQLMag2010, 1, 158, 2);
GO

The trace flag is necessary to return the DBCC output to your current connection; otherwise, it
goes to the error log. You can read more about using the DBCC PAGE command in many posts
on my blog.

The best way to repair this corruption with minimal downtime and zero data loss is to perform a
single-page piecemeal restore (introduced in SQL Server 2005), starting with the most recent full
database backup and then using the most recent working differential backup and all required
transaction log backups to bring that page up to the same time as the rest of the database.

In this contrived scenario, however, you don't have any backups other than a single full database
backup that was done prior to many other changes (e.g., it’s last week’s backup). As a result,
there’s no way to perform a piecemeal restore because you won’t be able to bring the older page
version from that backup up to the same time as the rest of the database without transaction log
backups. This means you’re going to have to use database repair. As long as the corruptions that
DBCC CHECKDB discovered are repairable, the last line of output will tell you what repair
option you need to use.

In this example, you’re going to need to use REPAIR_ALLOW_DATA_LOSS. This means that
the corruption(s) can be repaired only by what I call “delete-what’s-broken-and-fix-up-all-the-
structural-linkages.” When I wrote the repair code for SQL Server 2005, as a part of rewriting all
of DBCC CHECKDB, I had to come up with the simplest, fastest, and 100 percent correct way
of repairing the various types of corruptions. In most cases, this turned out to be to delete the
corrupt structure in the database and then ensure that all the other structures in the database were
corrected to take account of the deletion.

By specifying the REPAIR_ALLOW_DATA_LOSS option, you’re acknowledging that DBCC


CHECKDB will have to delete some data to perform the repair. There’s no point trying to use
REPAIR_FAST or REPAIR_REBUILD if the DBCC CHECKDB output says you need
REPAIR_ALLOW_DATA_LOSS—it won’t be able to fix the corruptions if you specify a lower
repair level.

At this point, you have no choice but to perform a repair operation. The following are the steps
you should take to perform this operation:
1. Back up the corrupt database so that you have a fall-back option in case anything goes
wrong. In this scenario we’ve already got that backup, but in production people often
forget to back up the database at this stage just in case.
2. Put the database into single-user mode (a requirement for using any repair option with
DBCC CHECKDB).
3. Run the repair operation.
4. Verify that the corruption has been completely repaired by running DBCC CHECKDB
again to ensure there are no further corruption messages.
5. Take any further recovery steps to validate the data in the database. These steps are
discussed in more detail later in the article.
6. Once you’re satisfied with the state of the database, return it to multiuser mode and let
the workload resume.

You might find there are some corruptions that DBCC CHECKDB can’t repair. (For more -
information, see "CHECKDB From Every Angle: Can CHECKDB repair everything?"). You
can tell if this is the case by examining the end of the DBCC CHECKDB output: If the stated
number of corruptions fixed is equal to the stated number of corruptions found, then DBCC
CHECKDB repaired all the errors. If you find there are some corruptions that can’t be repaired,
it might be possible to manually repair them, but the details are beyond the scope of this article.
See my blog category Repair at SQLskills.com for examples.

Step 4 is necessary because there might be more corruptions that DBCC CHECKDB couldn’t
previously find. This can happen if a particular corruption prevents DBCC CHECKDB from
deeply examining a portion of the database—once the corruption is repaired, a subsequent deeper
examination might further reveal corruptions that must be repaired. I call this phenomenon
corruption masking.

For our scenario, the following is the code used to perform the repair operation:

ALTER DATABASE SQLMag2010 SET SINGLE_USER;


GO
DBCC CHECKDB (SQLMag2010, REPAIR_ALLOW_DATA_LOSS)
WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

And this produces the output shown in Web Figure 3 (again, curtailed slightly to save space).
From the output in Web Figure 3, you can see that DBCC CHECKDB fixed all the corruptions it
knew about, and if you run DBCC CHECKDB again, you’ll see that there were no masked
corruptions.

Web Figure 3

Step 5 in the list is to perform any further recovery actions because after running any
REPAIR_ALLOW_DATA_LOSS operation, you need to be aware that some data was deleted.
There are three substeps you might need to take at this point:
1. Check the consistency of any foreign key relationships that involved the table(s) affected
by the repair operation. DBCC CHECKDB doesn’t take constraints into account when
performing repairs. You can use the DBCC CHECKCONSTRAINTS command to check
your defined foreign key relationships. You’ll also want to verify the logic of your
triggers to see if any additional modifications should be made. In general, you need to
check all of your business logic surrounding the data that was deleted.
2. Reinitialize any replication subscribers that were subscribing to publications from the
table(s) affected by the repair operation. All repairs are fully logged, but are specifically
marked not for replication because of the physical nature of the operations being
performed, so the transactional replication log reader job won’t replicate them. Any data
deleted (or rows inserted into nonclustered indexes) by the repair won’t cause the same
actions to occur on the replication subscribers, so they must be reinitialized. Similarly,
merge replication triggers won’t fire because of repair operations, so merge replication
subscribers must be reinitialized.
3. Determine which data was deleted by the repair operations, if any.

Depending on the complexity of the database being repaired, it might be possible to retrieve
some data from an older backup, but you need to know what has been lost. DBCC CHECKDB
doesn’t tell you which data was deleted (mostly because the deleted data was on corrupt database
pages that couldn’t be processed), so you’ll need some way of comparing the postrepair data
with a known good copy of the prerepair data. The tablediff tool that ships with SQL Server can
do this for each table (see MSDN's tablediff Utility web page for more information).

You might also be able to programmatically determine what has been deleted. For instance, if the
table that was repaired has an identity column, and table rows aren’t deleted, then any missing
block of rows should represent those deleted by the repair operation. It just so happens that this is
exactly the case in our example scenario.

To identify the missing block of rows in this example, you can use the code in Listing 1. You can
then restore an older backup and determine if any of the missing rows are present in the restored,
older copy of the database. If so, you can make a business decision to copy them into the
production database using an INSERT SELECT statement, being sure to bear in mind, of course,
that there’s no guarantee that the salvaged rows from the older database are up-to-date because
additional modifications could have occurred against these rows after the backup was performed.
However, at least you might be able to recover some of the deleted data and have a base on
which to investigate what other changes might have occurred.

-- Start of the missing range is when a row does not have a plus-1 neighbor.
SELECT MIN (A.salesID + 1)
FROM SQLMag2010.dbo.sales AS A
WHERE NOT EXISTS (
SELECT B.salesID
FROM SQLMag2010.dbo.sales AS B
WHERE B.salesID = A.salesID + 1);
GO

-- End of the missing range is when a row does not have a minus-1 neighbor
SELECT MAX (A.salesID - 1)
FROM SQLMag2010.dbo.sales AS A
WHERE NOT EXISTS (
SELECT B.salesID
FROM SQLMag2010.dbo.sales AS B
WHERE B.salesID = A.salesID - 1);
GO

The code in Listing 2 can be used to perform the data copy into the production database (of our
example), replacing **MIN** and **MAX** with the values found from running the code in
Listing 2.

SET IDENTITY_INSERT SQLMag2010.dbo.sales ON;


SET NOCOUNT OFF;
GO

INSERT INTO SQLMag2010.dbo.sales


(salesID, customerID, salesDate, salesAmount)
SELECT copy.*
FROM SQLMag2010Copy.dbo.sales AS copy
WHERE copy.salesID > **MIN** - 1
AND copy.salesID < **MAX** + 1;
GO

SET IDENTITY_INSERT SQLMag2010.dbo.sales OFF;


GO

As you can see, using database repair is by no means as straightforward as using a sequence of
RESTORE operations, but it’s not a difficult task, either—as long as you’ve practiced in
advance.

Preparing for a Disaster

It's critical to have a comprehensive SQL Server disaster recovery strategy and to have it
integrated into the business-wide IT disaster recovery strategy. It’s also critical that the strategy
encompasses as many eventualities as possible and that it’s tested as often as possible. All easier
said than done—see "Importance of having a good disaster recovery plan” and “Importance of
testing your disaster recovery plan."

At the very least, you should practice performing common restore operations regularly so that
you aren’t struggling with the syntax of the RESTORE command during a real disaster and you
have some level of confidence that your backups do actually restore. But to be really prepared
for a disaster, and to be able to stay calm when backups don’t work, practice dealing with corrupt
databases using tools other than RESTORE. It will do wonders for your confidence level and
could end up helping to save you and your business from a lot of problems when your first line
of defense against disaster fails.

Over the past few years I’ve created a collection of corrupt databases you can use to practice
performing repair operations. See my blog post "Conference corruption demo scripts and
example corrupt databases" for more details. Lastly, if you ever find yourself having to use
database repair, make sure that when the dust settles and the crisis is over, you go back and
revisit your backup strategy to work out why it failed in the first place. As always, if you have
any questions, drop me a line at paul@SQLskills.com.

Вам также может понравиться