Академический Документы
Профессиональный Документы
Культура Документы
Writers: Eric N. Hanson, Kevin Farlee, Stefano Stefani, Shu Scott, Gopal Ashok, Torsten
Grabs, Sara Tahir, Joachim Hammer, Sunil Agarwal, T.K. Anand, Richard
Tkachuk, Catherine Chang, and Edward Melomed, Microsoft Corp.
Summary: With the 2008 release, SQL Server makes a major advance in scalability for
data warehousing. It meets the data warehouse needs of the largest enterprises more
easily than ever. SQL Server 2008 provides a range of integrated products that enable
you to build your data warehouse, and query and analyze its data. These include the
SQL Server relational database system, Analysis Services, Integration Services, and
Reporting Services. This paper introduces the new performance and manageability
features for data warehousing across all these components. All these features
contribute to improved scalability.
The information contained in this document represents the current view of Microsoft Corporation on the issues
discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it
should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the
accuracy of any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under
copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or
transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or
for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights
covering subject matter in this document. Except as expressly provided in any written license agreement
from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks,
copyrights, or other intellectual property.
Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses,
logos, people, places and events depicted herein are fictitious, and no association with any real company,
organization, product, domain name, email address, logo, person, place or event is intended or should be
inferred.
Microsoft and SQL Server are either registered trademarks or trademarks of Microsoft Corporation in the United States
and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective
owners.
Table of Contents
Introduction.......................................................................................................................... ..............1
Map of New Data Warehousing Features.............................................................................. ...........1
SQL Server Relational DBMS DW Improvements........................................................................... .1
Star Join........................................................................................................2
Partitioned Table Parallelism..............................................................................3
Partition-Aligned Indexed Views........................................................................4
GROUPING SETS.............................................................................................5
MERGE...........................................................................................................6
Change Data Capture.......................................................................................7
Minimally Logged INSERT.................................................................................8
Data Compression...........................................................................................9
Backup Compression......................................................................................10
Resource Governor........................................................................................10
Integration Services Improvements......................................................................... ......................12
Lookup Performance......................................................................................12
Pipeline Performance......................................................................................14
Analysis Services Improvements................................................................................... ................15
MDX Query Performance: Block Computation....................................................15
Query and Writeback Performance...................................................................17
Analysis Services Enhanced Backup.................................................................18
Scalable Shared Database for AS.....................................................................18
Reporting Services Improvements...................................................................................... ...........20
Reporting Scalability......................................................................................20
Server Scalability...........................................................................................20
Conclusion.............................................................................................................. .........................20
References...................................................................................................................................... ..20
An Introduction to New Data Warehouse Scalability Features in SQL Server 2008 4
Introduction
Microsoft® SQL Server™ 2008 provides a comprehensive data warehouse platform. It
enables you to build and manage your data warehouse, and deliver insight to your
users, with a single, integrated product suite. It scales to meet the needs of the largest
enterprises, in a way that empowers both your end users and your IT staff.
The number one focus of development in the SQL Server 2008 release was to improve
scalability across the entire product suite to comfortably meet the needs of large
enterprises. Here, we’ll introduce the features and enhancements we’ve added to
improve your data warehouse experience. Build. Manage. Deliver. SQL Server 2008 lets
you do it all, with ease.
Data compression
Partition-aligned indexed views
Integration Lookup performance
Services Pipeline performance
Analysis Services Backup MDX Query Performance:
Block Computation
Query and Writeback
Performance
Scalable Shared Database
Reporting Reporting scalability
Services Server scalability
Star Join
With dimensionally modeled data warehouses, a big part of your workload typically
consists of what are known as star join queries. These queries follow a common pattern
that joins the fact table with one or several dimension tables. In addition, star join
queries usually express filter conditions against the non-key columns of the dimension
tables and perform an aggregation (typically SUM) on a column of the fact table (called
a measure column). With SQL Server 2008, you will experience significant performance
improvements for many star join queries that process a significant fraction of fact table
rows.
The new technology employed is based on bitmap filters, also known as Bloom filters
(see Bloom filter, Wikipedia 2007, http://en.wikipedia.org/wiki/Bloom_filter). It allows
SQL Server to eliminate non-qualifying fact table rows from further processing early
during query evaluation. This saves a considerable amount of CPU time compared to
query processing technologies used by competing products. While your results may
vary, we’ve typically seen entire relational data warehouse query workloads experience
performance improvements of 15-25% when using the new star join query processing
capability. Some individual queries speed up by a factor of seven or more.
Figure 1: Star join query plan with join reduction processing for efficient DW
The new star join optimization uses a series of hash joins, building a hash table for each
dimension table that participates. As a byproduct of building this hash table, additional
information, called a bitmap filter, is built. Bitmap filters are represented as boxes in
Figure 1, labeled “Join Reduction Info.” These filters are pushed down into the scan on
the fact table, and effectively eliminate almost all the rows that would be eliminated
later by the joins. This eliminates the need to spend CPU time later copying the
eliminated rows and probing the hash tables for them. The illustration shows the effect
of this filtering within the fact table scan. The SQL Server 2008 query executor also re-
orders the bitmaps during execution, putting the most selective one first, then the next
most selective one, and so forth. This saves more CPU time, because once a fact table
row fails a check against a bitmap, the row is skipped.
The new star join optimization is available in Microsoft SQL Server 2008 Enterprise
Edition. The query processor in SQL Server applies the optimization automatically to
queries following the star join pattern when this is attractive in terms of estimated
query cost. You do not need to make any changes to your application to benefit from
this significant performance improvement.
The following figure shows how aggregates move with base table partitions when
switching in a partition.
GROUPING SETS
GROUPING SETS allow you to write one query that produces multiple groupings and
returns a single result set. The result set is equivalent to a UNION ALL of differently
grouped rows. By using GROUPING SETS, you can focus on the different levels of
information (groupings) your business needs, rather than the mechanics of how to
combine several query results. GROUPING SETS enables you to write reports with
multiple groupings easily, with improved query performance.
In this simple but typical example, using the AdventureWorksDW sample database, you
may want to see the following aggregates for a specific reporting period:
• Total sales amount by quarter and country
• Total sales amount by quarter for all countries
• The grand total
To get this result without GROUPING SETS, you must either run multiple queries or if
one result set is desired, use UNION ALL to combine these queries. With GROUPING
SETS, your query can be expressed like this:
Country Period
Canada Mexico USA Totals
MERGE
The MERGE statement allows you to perform multiple Database Manipulation Language
(DML) operations (INSERT, UPDATE, and DELETE) on a table or view in a single
Transact-SQL statement. The target table or view is joined with a data source and the
DML operations are performed on the results of the join. The MERGE statement has
three WHEN clauses, each of which allows you to perform a specific DML action on a
given row in the result set:
• For every row that exists in both the target and the source, the WHEN MATCHED
clause allows you to UPDATE or DELETE the given row in the target table.
• For every row that exists in the source but not in the target, the WHEN
[TARGET] NOT MATCHED clause allows you to INSERT a row into the target.
• For every row that exists in the target but not in the source, the WHEN SOURCE
NOT MATCHED clause allows you to UPDATE or DELETE the given row in the
target table.
You can also specify a search condition with each of the WHEN clauses to choose which
type of DML operation should be performed on the row. The OUTPUT clause for the
MERGE statement includes a new virtual column called $action, which you can use to
identify the DML action that was performed on each row.
In the context of data warehousing, the MERGE statement is used to perform efficient
INSERT and UPDATE operations for Slowly Changing Dimensions (SCD) and to maintain
the fact table in various common scenarios. The MERGE statement has better
performance characteristics than running separate INSERT, UPDATE, and DELETE
statements since it only requires a single pass over the data.
SQL Server 2008 also includes a powerful extension to the INSERT statement that
allows it to consume rows returned by the OUTPUT clause of a nested INSERT, UPDATE,
DELETE, or MERGE statements.
Suppose you have a DimBook table (ISBN, Price, IsCurrent) that tracks the price
history and current price for each book in a bookstore. Price changes and new book
additions are made on a weekly basis. Every week a source table WeeklyChanges
(ISBN, Price) is generated and these changes are applied to the DimBook table. A
row is inserted for each new book. Existing books whose prices have changed during
the week are updated with IsCurrent=0 and a new row is inserted to reflect the new
price. The following single Transact-SQL statement performs these operations using the
new MERGE and INSERT capabilities.
INSERT INTO DimBook(ISBN, Price, IsCurrent)
SELECT ISBN, Price, 1
FROM
(
MERGE DimBook as book
USING WeeklyChanges AS src
ON (book.ISBN = src.ISBN and book.IsCurrent = 1)
WHEN MATCHED THEN
UPDATE SET book.IsCurrent = 0
WHEN NOT MATCHED THEN
INSERT VALUES (src.ISBN, src.Price, 1)
OUTPUT $action, src.ISBN, src.Price
) AS Changes(action, ISBN, Price)
WHERE action = 'UPDATE';
provides you with a very efficient way to extract changes on an incremental basis,
reducing overall ETL processing time.
The following diagram provides an overview of the components that make up Change
Data Capture.
reducing the number of log records to be written and the amount of log space required
to complete the operation. For a discussion of table requirements for minimal logging,
see SQL Server Books Online. In particular, you must use table locking (TABLOCK) on
the target table.
Operations that can be minimally logged in SQL 2005 include bulk import operations,
SELECT INTO, and index creation and rebuild. SQL 2008 extends the optimization to
INSERT INTO…SELECT FROM T-SQL operations that insert a large number of rows into
an existing target table when that table is a heap that has no nonclustered indexes, and
the TABLOCK hint is used on the target. The optimization works whether the target
table is empty or contains data.
A key scenario for using minimally logged INSERT is this: you create an empty table on
specific file groups, so you can control where the data is physically placed. Then you
use INSERT INTO…SELECT FROM to populate it, in a minimally logged fashion. This puts
the data where you want it, and only writes it to disk once. Once the data is loaded, you
can then create the required indexes. It is important to note that indexes themselves
can be created with minimal logging.
Data Compression
The new data compression feature in SQL Server 2008 reduces the size of tables,
indexes or a subset of their partitions by storing fixed-length data types in variable
length storage format and by reducing the redundant data. The space savings achieved
depends on the schema and the data distribution. Based on our testing with various
data warehouse databases, we have seen a reduction in the size of real user databases
up to 87% (a 7 to 1 compression ratio) but more commonly you should expect a
reduction in the range of 50-70% (a compression ratio between roughly 2 to 1 and 3
to 1).
SQL Server provides two types of compression as follows:
• ROW compression enables storing fixed length types in variable length storage
format. So for example, if you have a column of data type BIGINT which takes 8
bytes of storage in fixed format, when compressed it takes a variable number of
bytes—anywhere from 0 bytes to up to 8 bytes. Since column values are stored as
variable length, an additional 4–bit length code is stored for each field within the
row. Additionally, zero and NULL values don’t take any storage except for the 4–bit
code.
• PAGE compression is built on top of ROW compression. It minimizes storage of
redundant data on the page by storing commonly occurring byte patterns on the
page once and then referencing these values for respective columns. The byte
pattern recognition is type-independent. Under PAGE compression, SQL Server
optimizes space on a page using two techniques.
The first technique is column prefix. In this case, the system looks for a common byte
pattern as a prefix for all values of a specific column across rows on the page. This
process is repeated for all the columns in the table or index. The column prefix values
that are computed are stored as an anchor record on the page and the data or index
rows refer to the anchor record for the common prefix, if available, for each column.
The second technique is page level dictionary. This dictionary stores common values
across columns and rows and stores them in a dictionary. The columns are then
modified to refer to the dictionary entry.
Compression comes with additional CPU cost. This overhead is paid when you query or
execute DML operations on compressed data. The relative CPU overhead with ROW is
less than for PAGE, but PAGE compression can provide better compression. Since there
are many kinds of workloads and data patterns, SQL Server exposes compression
granularity at a partition level. You can choose to compress the whole table or index or
a subset of partitions. For example, in a DW workload, if CPU is the dominant cost in
your workload but you want to save some disk space, you may want to enable PAGE
compression on partitions that are not accessed frequently while not compressing the
current partition(s) that are accessed and manipulated more frequently. This reduces
the total CPU cost, at a small increase in disk space requirements. If I/O cost is
dominant for your workload, or you need to reduce disk space costs, compressing all
data using PAGE compression may be the best choice. Compression can give many-fold
speedups if it causes your working set of frequently touched pages to be cached in the
main memory buffer pool, when it does not otherwise fit in memory. Preliminary
performance results on one large-scale internal DW query performance benchmark used
to test SQL Server 2008 show a 58% disk savings, an average 15% reduction in query
runtime, and an average 20% increase in CPU cost. Some queries speeded up by a
factor of up to seven. Your results depend on your workload, database, and hardware.
The commands to compress data are exposed as options in CREATE/ALTER DDL
statements and support both ONLINE and OFFLINE mode. Additionally, a stored
procedure is provided to help you estimate the space savings prior to actual
compression.
Backup Compression
Backup compression helps you to save in multiple ways.
By reducing the size of your SQL backups, you save significantly on disk media for your
SQL backups. While all compression results depend on the nature of the data being
compressed, results of 50% are not uncommon, and greater compression is possible.
This enables you to use less storage for keeping your backups online, or to keep more
cycles of backups online using the same storage.
Backup compression also saves you time. Traditional SQL backups are almost entirely
limited by I/O performance. By reducing the I/O load of the backup process, we actually
speed up both backups and restores.
Of course, nothing is entirely free, and this reduction in space and time come at the
expense of using CPU cycles. The good news here is that the savings in I/O time offsets
the increased use of CPU time, and you can control how much CPU is used by your
backups at the expense of the rest of your workload by taking advantage of the
Resource Governor.
Resource Governor
The new Resource Governor in SQL Server 2008 enables you to control the amount of
CPU and memory resources allocated to different parts of your relational database
workload. It can be used to prevent runaway queries (that deny resources to others)
and to reserve resources for important parts of your workload. SQL Server 2005
resource allocation policies treat all workloads equally, and allocate shared resources
(for example, CPU bandwidth, and memory) as they are requested. This sometimes
causes a disproportionate distribution of resources, which in turn results in uneven
performance or unexpected slowdowns.
Microsoft Corporation ©2007
An Introduction to New Data Warehouse Scalability Features in SQL Server 2008 14
Lookup Performance
The Lookup component in SSIS runs faster, and is even easier to program than in SQL
Server 2005. A lookup tests whether each row in a stream of rows has a matching row
in another dataset. A lookup is like a database join operation. Typically you use lookup
within an integration process, such as the ETL layer that populates a data warehouse
from source systems.
A lookup builds a cache of retrieved rows pulled from the dataset being probed. In SQL
Server 2005, the Lookup component could only get data from specific OleDb
connections, and the cache could be populated only by using a SQL query. In SQL
Server 2008, the new version of Lookup allows you to populate the cache using a
separate pipeline in the same package or a different package. You can use source data
from just about anywhere.
SQL Server 2005 reloads the cache every time it is used. For example, if you have two
pipelines in the same package that each require the same reference dataset, each
Lookup component would cache its own copy. In SQL Server 2008, you can save the
cache to virtual memory or permanent file storage. This means that within the same
package, multiple Lookup components can share the same cache. You can save the
cache to a file and share it with other packages. The cache file format is optimized for
speed, and access to it can be orders of magnitude faster than reloading the reference
dataset from the original relational source.
In SQL Server 2008, the Lookup component introduces the miss-cache feature. When
the component is configured to perform lookups directly against the database, the
miss-cache feature saves time by optionally loading into cache the key values that have
no matching entries in the reference dataset. For example, if the component receives
the value 123 in the incoming pipeline, but the Lookup component already knows that
there are no matching entries in the reference dataset, the component will not try again
to find 123 in the reference dataset. This reduces a redundant and expensive trip to the
database. The miss-cache feature alone can contribute up to a 40% performance
improvement in some scenarios.
Other enhancements to the Lookup component include:
• Optimized I/O routines leading to faster cache loading and lookup operations.
• More intuitive user interface that simplifies the configuration of the Lookup
component, in particular the caching options.
• Rows in the input that do not match at least one entry in the reference dataset
are now sent to the No Match output. The Error output only handles errors such
as truncations.
• Query statements in lookup transformations can be changed at runtime, making
programming transformations more flexible.
• Informational and error messages are improved to help with troubleshooting and
performance analysis.
The following figure illustrates a scenario that uses the new Lookup.
Pipeline Performance
In SQL Server 2008 SSIS, several threads can work together to do the work that a
single thread is forced to do by itself in SQL Server 2005 SSIS. This can give you a
several-fold speedup in ETL performance.
In SQL Server 2005 SSIS, pipeline parallelism is more coarse-grained. When users have
a simple package with one or two execution trees, there are only one or two processors
used, and the package might not benefit from a multiprocessor machine with more than
a few processors. Even if users logically split the data flow by using multicast, all output
paths of a multicast belong to the same execution tree, and they are executed serially
by the SQL Server 2005 SSIS data flow task.
To achieve a high level of parallelism, pipelines in SQL Server 2008 SSIS allow more
parallel processing, which means that for any multiprocessor machine this should result
in faster performance.
By using a shared thread pool, multiple outputs of a multicast can be executed
simultaneously. In short, the multicast gives an ability to have an active buffer on each
output and not just have one buffer (and one active thread), which is passed to each
output. You do not need to use the “Union All” trick as a workaround to introduce more
parallelism.
For example, suppose you have a flow that contains a multicast with four outputs. Each
output flows into an aggregate. In SQL Server 2005 SSIS, only one of the aggregates is
processed at a time. In SQL Server 2008 SSIS, all four aggregates can be processed in
parallel.
The following figure shows how the enhanced SQL Server 2008 pipeline parallelism
works.
Sales is a base measure, so we simply obtain the storage engine data to fill the two
spaces at the leaves, and then work up the tree, applying the operator to fill the space
at the root. Hence the one row (Product3, 2004, 3) and the two rows { (Product3,
2005, 20), (Product6, 2005, 5)} are retrieved, and the + operator applied to them to
yield the result.
Figure 9: Block computation example that avoids doing work for NULL cells
The + operator operates on spaces, not simply scalar values. It is responsible for
combining the two given spaces to produce a space that contains each product that
appears in either space, with the summed value.
We only operate on data that could contribute to the result. There is no notion of the
complete space over which we must perform the calculation
writeback data from the compressed MOLAP format is much faster than querying the
relational data source. Hence, MOLAP writeback partitions have better query
performance than ROLAP. The extent of the performance improvement varies and
depends on a number of factors including the volume of writeback data and the nature
of the query.
MOLAP writeback partitions should also improve cell writeback performance since the
server internally sends queries to compute the writeback deltas and these queries
probably access the writeback partition. Note that the writeback transaction commit can
be a bit slower since the server must update the MOLAP partition data in addition to the
writeback table, but this should be insignificant compared with the other performance
gains.
Reporting Scalability
The SQL Server 2008 Reporting Services reporting engine has had a major upgrade
from the prior release, so that it can render much larger reports than it could before.
Although this is not specifically a data warehousing improvement (it is useful in
operational reporting too), it is useful in some data warehousing scenarios. If you
create reports with hundreds or thousands of pages, SQL Server 2008 Reporting
Services helps you to render the reports faster. Moreover, the size of the largest report
that can be rendered has been increased dramatically, given the same hardware
configuration.
Server Scalability
SQL Server 2008 Reporting Services does not run inside Internet Information Server
(IIS). It can manage its own memory, and has its own memory limits. This allows you
to configure the memory settings so SSRS can run on the same computer more
effectively with other services, such as SQL Server.
Conclusion
SQL Server gives you everything you need for data warehousing. With the 2008
release, it scales to meet the needs of the largest enterprises more readily than ever. As
illustrated by the many data warehouse scale enhancements introduced in this paper, it
is a major advance over previous releases. The number one change you’ll see is
improved scalability across the board, for data warehouse construction, relational query
processing, reporting, and analysis.
References
Bloom filter, Wikipedia 2007, http://en.wikipedia.org/wiki/Bloom_filter.
Hanson, Eric., Improving Performance with SQL Server 2005 Indexed Views,
http://www.microsoft.com/technet/prodtechnol/sql/2005/impprfiv.mspx.