Академический Документы
Профессиональный Документы
Культура Документы
Overview
In my years being a DBA Ive seen many (even made some myself) common mistakes when
reviewing the SQL queries that run against the systems I maintain. With this experience Ive
found that there are some general guidelines that should be followed when writing queries
and also when designing a database schema. In this tutorial we will take a look at a few
different areas where these common mistakes are made and what can be done to fix them.
These areas include:
Query writing
Indexing
Schema design
Explanation
In each section of this tutorial we will take a look at specific examples that will illustrate
things that should be avoided when it comes to performance in SQL Server. For each of
these items I will provide a solution or alternative that would provide better performance.
Please keep in mind that these are general guidelines and there will be exceptions to these
examples but in general following these basic principles should get you off to a fast start
performance wise.
The specific topics that will be covered in this tip are as follows:
Query writing:
o
Indexing:
o
Use WHERE, JOIN, ORDER BY, SELECT Column Order When Creating Indexes
Schema design:
o
Use DELETE CASCADE Option to Handle Child Key Removal in Foreign Key
Relationships
([ChildID] ASC)
)
GO
-- foreign key constraint
ALTER TABLE [dbo].[Child] WITH CHECK
ADD CONSTRAINT [FK_Child_Parent] FOREIGN KEY([ParentID])
REFERENCES [dbo].[Parent] ([ParentID])
ON DELETE CASCADE
GO
-- data load
DECLARE @val BIGINT
DECLARE @val2 BIGINT
SELECT @val=1
WHILE @val < 100000
BEGIN
INSERT INTO dbo.[Parent] VALUES(@val,@val,'TEST' + CAST(@val AS
VARCHAR),getdate()-(@val/24.0))
SELECT @val2=1
WHILE @val2 < 20
BEGIN
INSERT INTO dbo.[Child] VALUES ((@val*100000)+@val2,@val,@val,'TEST' +
CAST(@val AS VARCHAR))
INSERT INTO dbo.[ChildDetail] VALUES (1,(@val*100000)+@val2,9999)
INSERT INTO dbo.[ChildDetail] VALUES (2,(@val*100000)+@val2,1111)
INSERT INTO dbo.[ChildDetail] VALUES (3,(@val*100000)+@val2,3333)
INSERT INTO dbo.[ChildDetail] VALUES (4,(@val*100000)+@val2,7777)
SELECT @val2=@val2+1
END
SELECT @val=@val+1
END
GO
-- data load
INSERT INTO dbo.[Small] VALUES(50,80,'TEST5080')
INSERT INTO dbo.[Small] VALUES(510,810,'TEST510810')
INSERT INTO dbo.[Small] VALUES(7001,9030,'TEST70019030')
INSERT INTO dbo.[Small] VALUES(12093,10093,'TEST1209310093')
INSERT INTO dbo.[Small] VALUES(48756,39843,'TEST48756,39843')
INSERT INTO dbo.[Small] VALUES(829870,57463,'TEST82987057463')
GO
-- cleanup statements
--DROP TABLE [dbo].[Small]
--DROP TABLE [dbo].[ChildDetail]
--DROP TABLE [dbo].[Child]
--DROP TABLE [dbo].[Parent]
-- cleanup statements
--DROP INDEX Child.idxChild_ParentID
Since in most cases this issue arises when queries become really complex and
the optimizer has a lot of possible plans to evaluate, i.e.. multiple table joins, in order to
illustrate this point more clearly we'll use the force order hint with a simple query. Here is
the code to illustrate our poor join order.
SELECT P.ParentID,C.ChildID,S.SmallID
FROM [dbo].[Parent] P INNER JOIN
[dbo].[Child] C ON C.ParentID=P.ParentID INNER JOIN
[dbo].[Small] S ON S.SmallID=C.ParentID
OPTION (FORCE ORDER)
Looking at the explain plan for this query we can see that the Parent and Child tables are
joined first resulting in 1899980 rows which is then joined to the Small table which reduces
the final recordset to 95 rows.
And now let's join them in the proper order so the smallest table is joined first. Here is the
SQL statement.
SELECT P.ParentID,C.ChildID,S.SmallID
FROM [dbo].[Small] S INNER JOIN
[dbo].[Parent] P ON S.SmallID=P.ParentID INNER JOIN
[dbo].[Child] C ON P.ParentID=C.ParentID
Looking at the explain plan for this query we see that the Parent table is first joined to the
Small table resulting in 5 rows which is then joined to the Child table which produces the
final recordset of 95 rows (as above).
Just looking at the explain plans should be enough information for us to see that the second
query will perform better but let's take a look at the SQL Profiler statistics just to confirm. As
we see from below joining the Small table first significantly reduces the amount of data the
query has to process therefore reducing the resources required to execute this query.
CPU
Reads
Writes
Duration
265
5935
309
35
Additional Information
-- cleanup statements
--DROP INDEX Child.idxChild_ParentID
--DROP FUNCTION fn_getParentDate
Now we can write a simple query that calls this function and which does a lookup in the
parent table. Here is the statement.
SELECT dbo.fn_getParentDate(ParentID),ChildID
FROM [dbo].[Child]
Looking at the explain plan for this query we can see it's going to do a scan of the Child table
which makes sense since there is no WHERE clause and for each row returned it uses the
index on the Parent table to do a seek for the lookup.
Now let's rewrite this query and instead of using the function call let's just join the Parent
table in our query and add the DateDataColumn to our SELECT list. Here is the statement.
SELECT P.DateDataColumn,ChildID
FROM [dbo].[Parent] P INNER JOIN
[dbo].[Child] C ON P.ParentID=C.ParentID
Looking at the explain plan for this query we can see it only has to access the Parent table
once but it now has to do a scan of this table before it performs a merge join.
It's not entirely clear from just looking at the above explain plans which statement will
perform better. The index seek in the query with the function might lead you to believe that
it would be faster but let's run the statements and take a look at the SQL Profiler results
below. We can see from these results that in fact the query without the function ran more
than twice as fast and used considerably less resources than the one that uses the function
call.
CPU
Reads
Writes
Duration
Function
14985
5705126
25982
No Function
578
5933
11964
Additional Information
-- cleanup statements
--DROP INDEX Child.idxChild_IntDataColumn
Now let's look at a simple query which would return all the records where IntDataColumn
<> 60000. Here is what that would look like.
SELECT P.ParentID,C.ChildID,C.IntDataColumn
FROM [dbo].[Parent] P INNER JOIN
[dbo].[Child] C ON P.ParentID=C.ParentID
WHERE C.IntDataColumn <> 60000
Looking at the explain plan for this query we see something really interesting. Since
the optimizer has some statistics on the data in this column it has rewritten the query to use
separate < and > clauses. We can see this in the details of the Index Seek under the Seek
Predicate heading.
Now let's see what happens if we have two <> clauses as follows.
SELECT P.ParentID,C.ChildID,C.IntDataColumn
FROM [dbo].[Parent] P INNER JOIN
[dbo].[Child] C ON P.ParentID=C.ParentID
WHERE C.IntDataColumn <> 60000 and C.IntDataColumn <> 5564
Looking at the explain plan for this query we also see that the optimizer has done some
manipulation to the WHERE clause. It is now using the new value we added in the Seek
Predicate and the original value as the other Predicate. Both have been changed to use
separate < and > clauses.
Although the changes that the optimizer has made have certainly helped the query by
avoiding an index scan it's always best to use an equality operator, like = or IN, in you query
if you want the best performance possible. One thing you should consider before making a
change like is you want to make sure you have a good understanding of your data as
changes in your table data can then affect your query results. With that said and given that
we know our table has very few records that satisfy the WHERE condition let's flip it to an
equality operator and see the difference in performance. Here is the new query.
SELECT P.ParentID,C.ChildID,C.IntDataColumn
FROM [dbo].[Parent] P INNER JOIN
[dbo].[Child] C ON P.ParentID=C.ParentID
WHERE C.IntDataColumn IN (3423,87347,93423)
Looking at the explain plan for this query we can see that it's also doing an index seek but
looking deeper into the Seek Predicate we can now see it's using the equality operator
which should be much faster given the number of records that satisfy the WHERE condition.
Now let's take a look at the SQL Profiler results for these two queries. We can see below
that the example using the equality operator runs faster and requires much less resources.
Note: Both queries returned the same result set.
Clause
CPU
Reads
Writes
Duration
Inequality
250
110901
255
Equality
15
654
15
Additional Information
-- cleanup statements
DROP INDEX Parent.idxParent_DateDataColumn
Now let's look at a simple query which would return all the records in the Parent table that
are less than 30 days old. Here is one way that we could write the SQL statement.
SELECT ParentID
FROM [dbo].[Parent]
WHERE dateadd(d,30,DateDataColumn) > getdate()
Looking at the explain plan for this query we can see that the index on the DateDataColumn
that we created is ignored and an index scan is performed.
Now let's rewrite this query and move the function to the other side of the > operator. Here
is the SQL statement.
SELECT ParentID
FROM [dbo].[Parent]
WHERE DateDataColumn > dateadd(d,-30,getdate())
Looking at the explain plan for this query we can see that the optimizer is now using the
index and performs a seek rather than a scan.
To confirm that it is indeed faster let's take a look at the SQL Profiler results for these two
queries. We can see below that when using an index, as is usually the case, we use fewer
resources and our statement executes faster.
CPU
Reads
Writes
Duration
Function
274
43
No Function
Additional Information
Before we get into the details of our explanation let's first create an index on the column
that we are going to use in the WHERE clause of our query. Here is the code to create that
index on the Child table.
CREATE NONCLUSTERED INDEX idxChild_VarcharDataColumn
ON [dbo].[Child] ([VarcharDataColumn])
-- cleanup statements
--DROP INDEX Child.idxChild_VarcharDataColumn
So why does it have to perform a table/index scan? Since all SQL Server indexes are stored
in a B-Tree structure when we begin our search criteria with a wildcard character
the optimizer is not able to use an index to perform a seek to find the data quickly. It either
performs a scan of the table or a scan of an index if all the columns required for the query
are part of the index. Now I understand that there are some cases where this would not be
possible based on your requirements but the following example shows why you should try
to avoid doing this whenever it's possible. Let's write a simple query that performs a search
on the column we indexed above. Here is the code for this simple SQL statement.
SELECT * FROM [dbo].[Child]
WHERE VarcharDataColumn LIKE '%EST5804%'
Looking at the explain plan for this query we can see that the index on the
VarcharDataColumn that we created is ignored and a clustered index scan (essentially a
table scan) has to be performed.
Now let's change the search string in this query to remove the wildcard so the string you are
searching for begins with a valid character. Here is the updated SQL statement. Note: I
picked the search criteria so that both queries return the same result set so that the results
are not skewed by one query returning a larger result set.
SELECT * FROM [dbo].[Child]
WHERE VarcharDataColumn LIKE 'TEST5804%'
Looking at the explain plan for this query we can see that the optimizer is now using the
index we created and performs a seek rather than a scan.
Although we should be able to tell from just comparing the explain plans that the second
query will perform better let's just confirm that it indeed uses less resources and executes
faster than our initial query by looking at the SQL Profilerresults. We can see from below
that by removing the wildcard character from the start of the query we do in fact see quite a
big improvement.
CPU
Reads
Writes
Duration
Wildcard at Start
328
7042
404
No Wildcard at Start
670
64
Additional Information
ON [dbo].[Parent] ([IntDataColumn],[ParentID])
-- cleanup statements
DROP INDEX Parent.idxParentID_IntDataColumnParentID
Let's look at a query that uses the IN predicate to return the second largest value from a
table. One way to do this would be as follows.
SELECT MIN(IntDataColumn)
FROM [dbo].[Parent]
WHERE ParentID IN (SELECT TOP 2 ParentID
FROM [dbo].[Parent]
ORDER BY IntDataColumn DESC)
Just by looking at the query we can see we are going to access the Parent table twice to get
this result. From theexplain plan we can see that the second access does use an index seek
so it might not be too much of an issue.
Now let's rewrite this query and use a derived table to generate the result. Here is that SQL
statement.
SELECT MIN(IntDataColumn)
FROM (SELECT TOP 2 IntDataColumn
FROM [dbo].[Parent]
ORDER BY IntDataColumn DESC) AS A
Notice that from the query we only reference the Parent table once and the explain
plan confirms that we no longer have to access the Parent table a second time, even with an
index.
We can also see from the SQL Profiler results below that we do get some significant
resource savings even for this simple query. Although the CPU and total duration were the
same, we only had to perform 2 reads as opposed to the 8 required by the original query.
CPU
Reads
Writes
Duration
IN Predicate
Derived Table
Additional Information
Let's add an index on the join column, Child.ParentID, and see how this effects the explain
plan. Here is the SQL statement.
CREATE NONCLUSTERED INDEX idxChild_ParentID
ON [dbo].[Child] ([ParentID])
-- cleanup statements
DROP INDEX Child.idxChild_ParentID
Using the same query above if we regenerate the explain plan after adding the index we see
below that the SQL Optimizer is now able to access the Child table using an index seek
which will more than likely run much faster and use less resources.
Let's confirm our assumption by taking a look at the SQL Profiler output of both queries. We
see below that we were correct in our assumption. With the index added the query ran
much faster, used much less CPU and performed way fewer reads.
CPU
Reads
Writes
Duration
No Index
110
14217
110
Index
63
There is one other thing I'd like to mention when it comes to adding indexes on join
columns. As a general guideline I usually start out by indexing all of my foreign key columns
and only remove them if I find that they have a negative impact. I recommend this practice
because more often than not these are the columns that the tables are joined on and you
tend to see a pretty good performance benefit from having these columns indexed.
Additional Information
Use WHERE, JOIN, ORDERBY, SELECT Column Order When Creating Indexes
Overview
The order that the columns are specified in your indexes has an effect on whether or not the
entire index can be used when the SQL Optimizer parses your query.
Explanation
When looking at an explain plan for a query you'll notice that the SQL Optimizer first parses
the WHERE clause, then theJOIN clause, followed by the ORDER BY clause and finally it
processes the data being selected. Based on this fact it makes sense that you would need to
specify the columns in your index in this order if you want the entire index to be used. This
is especially true if you are trying to create a covering index. Let's look at the following
simple query as an example.
SELECT P.ParentID,C.ChildID,C.IntDataColumn,C.VarcharDataColumn
FROM [dbo].[Parent] P INNER JOIN
[dbo].[Child] C ON P.ParentID=C.ParentID
WHERE C.IntDataColumn=32433
ORDER BY ChildID
And we'll use the following index statement to show how progessively adding columns to
the index in the order we mentioned above, WHERE-JOIN-ORDER BY-SELECT, will improve
the queries performance. A couple things to note. First, I've included the entire index
statement here but you can add the columns one at a time to see the difference in each
step. Second, the second create index statement is just an alternative to adding
the SELECT columns directly to the index, instead they are part of an INCLUDE clause.
CREATE NONCLUSTERED INDEX idxChild_JOINIndex
ON [dbo].[Child] ([IntDataColumn],[ParentID],[ChildID],[VarcharDataColumn])
-- cleanup statements
DROP INDEX Child.idxChild_JOINIndex
Let's first take a look at the explain plans for each of these queries as we progessively add
columns to the index.
No Index
WHERE Index
WHERE,JOIN Index
WHERE,JOIN,ORDER BY Index
It's hard to tell just from the explain plans if each step will see an improvement or not
except for maybe just adding the initial index which eliminated the index scan so let's take a
look at the SQL Profiler results to see the actual performance benefit.
Table Type
CPU
Reads
Writes
Duration
No Index
110
14271
103
WHERE Index
129
117
117
60
60
We can see from these results that as we add each column we do see the SQL engine has to
perform less reads to execute the query thereby executing a little faster. The only exception
to this is the step where we added the ORDER BY to the index but this can be attributed to
the fact that we are ordering by ChildID which is a primary key so it's already sorted. The
other thing we should note is that there isn't really a performance difference between
adding theSELECT column directly to the index vs. using the INCLUDE clause.
Additional Information
Now let's recreate this index on the InDataColumn as a clustered index. Here is the SQL
statements.
DROP INDEX Parent.idxParent_IntDataColumn
Looking at the SQL Profiler results for this query we can confirm that having a clustered
index does in fact allow SQL Server to execute the query using less resources, specifically the
number of reads it has to perform to process the data.
Table Type
CPU
Reads
Writes
Duration
Heap
Clustered
The second benefit to having a clustered index on a table is it provides a way to reorganize
the table data when it becomes fragmented. Let's run an update on our table so it becomes
a little bit fragmented. We'll also put the table back to its original state with only the
clustered primary key to make it easier to view the results. Here are the SQL statements to
perform these tasks.
DROP INDEX Parent.idxParent_IntDataColumn
DECLARE @x BIGINT
DECLARE @y BIGINT
SELECT @x=1
WHILE @x < 100000
BEGIN
UPDATE [dbo].[Parent] SET
VarcharDataColumn='TESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTEST
TESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTE
STTESTTESTTESTTESTTESTTEST
TESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTE
STTESTTESTTESTTESTTESTTEST
TESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTE
STTESTTESTTESTTESTTESTTEST
TESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTE
STTESTTESTTESTTESTTESTTEST
TESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTE
STTESTTESTTESTTESTTESTTEST
TESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTE
STTESTTESTTESTTESTTESTTEST
TESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTE
STTESTTESTTESTTESTTESTTEST
TESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTESTTE
STTESTTESTTESTTESTTESTTEST'+
CAST(@x AS VARCHAR)
WHERE ParentID=@x
SELECT @x=@x+1
END
We can double check the fragmentation level of our table using the following query.
SELECT
index_level,avg_fragmentation_in_percent,fragment_count,avg_fragment_size_in_pages,p
age_count
FROM sys.dm_db_index_physical_stats(DB_ID(N'master'), OBJECT_ID(N'dbo.Parent'),
NULL, NULL , 'DETAILED')
We can see from the following results after executing the update above we have some
fragmentation in our table.
index_lev avg_fragmentation_in_pe fragment_co
el
rcent
unt
avg_fragment_size_in_p page_cou
ages
nt
14.3
3507
6.9
24394
5.3
111
1.0
112
Now if our table did not have a clustered index we would have to create a temporary table
and reload the data into this table, then recreate all of the indexes, then drop the original
table and rename the temporary table. We would also have to have disabled any referential
integrity constraints before doing any of this and add them back when we were done. All of
these tasks would also require downtime for the application. Since our table does have a
clustered index we can simply rebuild this index to reorganize the table data. Doing a
regular rebuild would require some downtime but we would avoid all the extra steps
required by the reload. If we don't have the luxury of being able to take our application
offline to do maintenance the SQL Server does provide the ability to perform this task
online, while the table is being accessed. Here is the SQL statement to do an online rebuild
(note: simply remove the WITH condition or replace ON with OFF to perform a regular
offline rebuild).
ALTER INDEX PK_Parent ON Parent REBUILD WITH (ONLINE=ON)
After running the index rebuild statement we can again check the fragmentation in our
table using thesys.dm_db_index_physical_stats query from earlier.
index_lev avg_fragmentation_in_pe fragment_co
el
rcent
unt
avg_fragment_size_in_p page_cou
ages
nt
0.01
18
694.4
12500
7.5
30
Additional Information
Use DELETE CASCADE Option to Handle Child Key Removal in Foreign Key
Relationships
Overview
Using the DELETE CASCADE option in your foreign key constraint definitions means better
performance and less code when removing records from tables that have a parent-child
relationship defined.
Explanation
Let's first confirm that our current schema does indeed have the DELETE CASCADE option
defined on the foreign key between the Parent and Child table. Here is the SQL statement
the check this as well as the result.
SELECT name,delete_referential_action_desc
FROM sys.foreign_keys
name
delete_referential_action_desc
FK_Child_Parent
CASCADE
Now that we've confirmed we have this option defined let's delete a record from the Parent
table using the following SQL statement.
DELETE FROM [dbo].[Parent] where ParentID=82433
Looking at the explain plan for this query we want to note that the SQL Optimizer is first
removing the child records then performing the delete on the Parent table. Because of this
it only needs to access each table once.
Now let's remove the DELETE CASCADE option from our foreign key definition and see if
there are any differences. In order to do this we'll need to drop and recreate the foreign key
without the DELETE CASCADE option. Here are the SQL statements to make this change.
ALTER TABLE [dbo].[Child] DROP CONSTRAINT [FK_Child_Parent]
Once the foreign key has been recreated we can run a second delete to see if there is any
difference in performance. One thing to note here is that without the DELETE CASCADE
option defined we need to run an additional delete statement to remove the records from
the Child table first. Here are the SQL statements to perform the delete.
DELETE FROM [dbo].[Child] where ParentID=62433
DELETE FROM [dbo].[Parent] where ParentID=62433
Looking at the explain plan for these statements we see that they are quite similar. The only
difference being that because we are executing separate delete statements the Child table
needs to be accessed a second time to check the foreign key constraint when deleting from
the Parent table.
Using the SQL Profiler results from each query we can confirm this extra scan of the Child
table does indeed mean that the DELETE CASCADE option performs better. We can see
below that the DELETE CASCADE option uses less resources in every category and runs
about 20% faster.
CPU
Reads
Writes
Duration
No Delete Cascade
344
28488
399
Delete Cascade
250
14249
312
Additional Information
You can't define a foreign key constraint that contains multiple cascade paths
Now let's do some denormalization by moving the ChildDetail table data into the Child
table. We'll first need to add the required columns to the Child table. Then before we can
migrate any data we'll need to remove the primary and foreign key constraints and once the
data is migrated we can recreate them. The following SQL statements perform these tasks.
ALTER TABLE [dbo].[Child] ADD [ChildDetailID] [bigint] NOT NULL DEFAULT
0,[ExtraDataColumn] [bigint]
[dbo].[ChildDetail] CD ON C.ChildID=CD.ChildID
Looking at the SQL Profiler results from these two queries we do see a pretty big benefit
from removing the join of the ChildDetail table. SQL Server performed fewer reads and the
total execution time was also improved.
CPU
Reads
Writes
Duration
Normalized
365
75
Denormalized
250
We should also take a look at how much extra space we are using as this is important in
deciding whether or not to implement this type of change. The following SQL statement will
tell you the amount of disk space each of your tables is consuming.
SELECT o.name,SUM(reserved_page_count) * 8.0 / 1024 AS 'Size (MB)'
FROM sys.dm_db_partition_stats ddps INNER JOIN
sys.objects o ON ddps.object_id=o.object_id
WHERE o.name in ('Parent','Child','ChildDetail')
GROUP BY o.name
The following table shows the results of the above query for both the normalized and
denormalized table schemas. As we can see the denormalized table schema does use about
18MB more disk space. The only question now becomes, is the performance benefit worth
the space this redundant data is holding.
Table
Parent
5.9
5.9
Child
151.6
679.2
ChildDetail
509.6
N/A
Total
667.1
685.1
Additional Information