An Introduction To New Data Warehouse Scalability Features in SQL Server 2008

An Introduction to New Data Warehouse Scalability Features in SQL Server 2008 1
An Introduction to New Data Warehouse

Scalability Features in SQL Server 2008
SQL Server Technical Article
Writers: Eric N. Hanson, Kevin Farlee, Stefano Stefani, Shu Scott, Gopal Ashok, Torsten
Grabs, Sara Tahir, Joachim Hammer, Sunil Agarwal, T.K. Anand, Richard
Tkachuk, Catherine Chang, and Edward Melomed, Microsoft Corp.
Technical Reviewer: Eric N. Hanson, Microsoft Corp.
Published: December 2007

Applies To: SQL Server 2008
Summary: With the 2008 release, SQL Server makes a major advance in scalability for
data warehousing. It meets the data warehouse needs of the largest enterprises more
easily than ever. SQL Server 2008 provides a range of integrated products that enable
you to build your data warehouse, and query and analyze its data. These include the
SQL Server relational database system, Analysis Services, Integration Services, and
Reporting Services. This paper introduces the new performance and manageability
features for data warehousing across all these components. All these features
contribute to improved scalability.
Microsoft Corporation ©2007

Copyright
This is a preliminary document and may be changed substantially prior to final commercial release of the
software described herein.
The information contained in this document represents the current view of Microsoft Corporation on the issues
discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it
should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the
accuracy of any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under
copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or
transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or
for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights
covering subject matter in this document. Except as expressly provided in any written license agreement
from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks,
copyrights, or other intellectual property.
Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses,
logos, people, places and events depicted herein are fictitious, and no association with any real company,
organization, product, domain name, email address, logo, person, place or event is intended or should be
inferred.
 2007 MicrosoftCorporation. All rights reserved.
Microsoft and SQL Server are either registered trademarks or trademarks of Microsoft Corporation in the United States
and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective
owners.
Table of Contents
Introduction.......................................................................................................................... ..............1
Map of New Data Warehousing Features.............................................................................. ...........1
SQL Server Relational DBMS DW Improvements........................................................................... .1
Star Join........................................................................................................2
Partitioned Table Parallelism..............................................................................3
Partition-Aligned Indexed Views........................................................................4
GROUPING SETS.............................................................................................5
MERGE...........................................................................................................6
Change Data Capture.......................................................................................7
Minimally Logged INSERT.................................................................................8
Data Compression...........................................................................................9
Backup Compression......................................................................................10
Resource Governor........................................................................................10
Integration Services Improvements......................................................................... ......................12
Lookup Performance......................................................................................12
Pipeline Performance......................................................................................14
Analysis Services Improvements................................................................................... ................15
MDX Query Performance: Block Computation....................................................15
Query and Writeback Performance...................................................................17
Analysis Services Enhanced Backup.................................................................18
Scalable Shared Database for AS.....................................................................18
Reporting Services Improvements...................................................................................... ...........20
Reporting Scalability......................................................................................20
Server Scalability...........................................................................................20
Conclusion.............................................................................................................. .........................20
References...................................................................................................................................... ..20
Introduction
Microsoft® SQL Server™ 2008 provides a comprehensive data warehouse platform. It
enables you to build and manage your data warehouse, and deliver insight to your
users, with a single, integrated product suite. It scales to meet the needs of the largest
enterprises, in a way that empowers both your end users and your IT staff.
The number one focus of development in the SQL Server 2008 release was to improve
scalability across the entire product suite to comfortably meet the needs of large
enterprises. Here, we’ll introduce the features and enhancements we’ve added to
improve your data warehouse experience. Build. Manage. Deliver. SQL Server 2008 lets
you do it all, with ease.
Map of New Data Warehousing Features

The following table shows the new scalability features in SQL Server 2008, and where
they help with the activities that surround your data warehouse (DW).
Build Manage Deliver Insight

SQL Server MERGE statement Backup Star join performance
Relational DBMS Change data capture compression Faster parallel query on
(CDC) partitioned tables
Minimally logged GROUPING SETS
INSERT Resource governor
Data compression
Partition-aligned indexed views
Integration Lookup performance
Services Pipeline performance
Analysis Services Backup MDX Query Performance:
Block Computation
Query and Writeback
Performance
Scalable Shared Database
Reporting Reporting scalability
Services Server scalability
Table 1: SQL Server 2008 improvements

This white paper briefly describes the data warehousing enhancements in each of the
different components of SQL Server 2008, and how they help you get the most from
your data warehouse. For full details on how to use these features, see SQL
Server 2008 Books Online (BOL).
SQL Server Relational DBMS DW Improvements

The SQL Server 2008 relational DBMS contains significant advances over earlier
releases, so that it performs better when you create, manage, and query large data

warehouses. This section elaborates on the relational DBMS data warehouse

improvements listed in Table 1.
Star Join
With dimensionally modeled data warehouses, a big part of your workload typically
consists of what are known as star join queries. These queries follow a common pattern
that joins the fact table with one or several dimension tables. In addition, star join
queries usually express filter conditions against the non-key columns of the dimension
tables and perform an aggregation (typically SUM) on a column of the fact table (called
a measure column). With SQL Server 2008, you will experience significant performance
improvements for many star join queries that process a significant fraction of fact table
rows.
The new technology employed is based on bitmap filters, also known as Bloom filters
(see Bloom filter, Wikipedia 2007, http://en.wikipedia.org/wiki/Bloom_filter). It allows
SQL Server to eliminate non-qualifying fact table rows from further processing early
during query evaluation. This saves a considerable amount of CPU time compared to
query processing technologies used by competing products. While your results may
vary, we’ve typically seen entire relational data warehouse query workloads experience
performance improvements of 15-25% when using the new star join query processing
capability. Some individual queries speed up by a factor of seven or more.

Figure 1: Star join query plan with join reduction processing for efficient DW
The new star join optimization uses a series of hash joins, building a hash table for each
dimension table that participates. As a byproduct of building this hash table, additional
information, called a bitmap filter, is built. Bitmap filters are represented as boxes in
Figure 1, labeled “Join Reduction Info.” These filters are pushed down into the scan on
the fact table, and effectively eliminate almost all the rows that would be eliminated
later by the joins. This eliminates the need to spend CPU time later copying the
eliminated rows and probing the hash tables for them. The illustration shows the effect
of this filtering within the fact table scan. The SQL Server 2008 query executor also re-
orders the bitmaps during execution, putting the most selective one first, then the next
most selective one, and so forth. This saves more CPU time, because once a fact table
row fails a check against a bitmap, the row is skipped.
The new star join optimization is available in Microsoft SQL Server 2008 Enterprise
Edition. The query processor in SQL Server applies the optimization automatically to
queries following the star join pattern when this is attractive in terms of estimated
query cost. You do not need to make any changes to your application to benefit from
this significant performance improvement.
Partitioned Table Parallelism

Wouldn’t you like to get the most power you can out of the hardware you own? The
partitioned table parallelism (PTP) feature in SQL Server 2008 helps you do that. Data
warehouse applications typically collect large amounts of historical data in fact tables,
which are often partitioned by date. In SQL Server 2005, queries that touch more than
one partition use one thread (and thus one processor core) per partition. This
sometimes limits the performance of queries that involve partitioned tables, especially
when running on parallel shared memory multiprocessor (SMP) computers with many
processor cores. Partitioned table parallelism improves the performance of parallel
query plans against partitioned tables by better utilizing the processing power of the
existing hardware, regardless of how many partitions a query touches. The feature
works by default without the need for manual tuning or configuration. The following
figure illustrates the impact of partitioned table parallelism in a typical data warehouse
scenario.

Figure 2: Partitioned table parallelism.

Assume that we have a fact table representing sales data organized by sales date
across four partitions, each containing seven days of data, as shown in the top portion
of the figure. Query Q summarizes sales over seven days. The query can touch different
partitions depending on when it is executed. This is illustrated by query Q1, which
touches a single partition P2 and by Q2, which touches two partitions since the relevant
data at the time of execution spans P3 and P4.
Executing Q1 and Q2 in SQL Server 2005 may generate some unexpected behavior.
Because there is special-case logic that can allocate all threads to a single-partition
query, Q1 results in a parallel plan involving P3 that is processed by all available threads
(execution not shown in the figure). In the case of Q2, however, the executor assigns a
single thread each to partitions P3 and P4 (see Old Allocation in the figure) even if the
underlying hardware has additional threads available. Hence on an 8-way computer, Q2
utilizes only 2/8 (25%) of the available CPU power and very probably executes
muchslower than Q1.
Executing Q1 and Q2 in SQL Server 2008 results in better utilization of the available
hardware, and thus in better performance as well as more predictable behavior. In the
case of Q1, the executor again assigns all available threads to process data in P2 (not
shown). Q2 results in a parallel plan in which the executor assigns all available threads
to both P3 and P4 round-robin style, producing the effect illustrated in the figure under
New Allocation. The CPU remains fully utilized and the performance of Q1 and Q2 are
comparable. Under this new round-robin allocation of threads, the performance boost
provided by partitioned table parallelism becomes more pronounced the more processor
cores there are compared to the number of partitions affected by a query. When all data
accessed by a query is in the main memory buffer pool, which is typical for the most
recent partitions, we’ve observed speedups of 16 times or more in internal tests for
queries that touch two partitions. Actual results depend on the query, data organization,
and hardware configuration.
For details on the thread allocation strategy as well as on the manageability features of
partitioned table parallelism, see SQL Server 2008 Books Online.
Partition-Aligned Indexed Views

Partition-aligned indexed views enable you to create and manage summary aggregates
in your relational data warehouse more efficiently, and use them in scenarios where you
couldn’t effectively use them before, improving query performance. In a typical
scenario, you have a fact table that is partitioned by date. Indexed views (summary
aggregates) are defined on this table to help speed up queries. When you switch in a
new table partition, the matching partitions of the partition-aligned indexed views
defined on the partitioned table switch too, and do so automatically.
This is a significant improvement over SQL Server 2005, where you must drop any
indexed views defined on a partitioned table before using the ALTER TABLE SWITCH
operation to switch a partition in or out of it. The partition-aligned indexed views
feature in SQL Server 2008 gives you the benefits of indexed views on large partitioned
tables, while avoiding the cost of rebuilding aggregates on an entire partitioned table.
These benefits include automatic maintenance of aggregates, and indexed view
matching (automatic query rewrite to use the aggregates to solve queries that refer
only to the base tables, not the aggregates). For more details on indexed views, see
Improving Performance with SQL Server 2005 Indexed Views on Microsoft TechNet.
The following figure shows how aggregates move with base table partitions when
switching in a partition.
Figure 3: Partition-aligned indexed views
GROUPING SETS
GROUPING SETS allow you to write one query that produces multiple groupings and
returns a single result set. The result set is equivalent to a UNION ALL of differently
grouped rows. By using GROUPING SETS, you can focus on the different levels of
information (groupings) your business needs, rather than the mechanics of how to
combine several query results. GROUPING SETS enables you to write reports with
multiple groupings easily, with improved query performance.
In this simple but typical example, using the AdventureWorksDW sample database, you
may want to see the following aggregates for a specific reporting period:
• Total sales amount by quarter and country
• Total sales amount by quarter for all countries
• The grand total
To get this result without GROUPING SETS, you must either run multiple queries or if
one result set is desired, use UNION ALL to combine these queries. With GROUPING
SETS, your query can be expressed like this:

SELECT D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry

, SUM(F.SalesAmount) AS SalesAmount
FROM dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN (2003,2004)
GROUP BY GROUPING SETS (
(CalendarYear, CalendarQuarter, SalesTerritoryCountry)
, (CalendarYear, CalendarQuarter)
, () )
ORDER BY D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry
Typically, you display the result of this query as a type of pivot table, like this:
Country Period
Canada Mexico USA Totals
2007 Q1 1000 1000 5000 7000

Year, 2007 Q2 1000 1000 6000 8000
Quarter 2007 Q3 1200 1300 7000 9500
2007 Q4 1300 1400 8000 10700
Grand Total: 35200
Table 2: Output from a GROUPING SETS query, formatted as a pivot table

As the number of possible groupings increases, the simplicity and performance benefits
provided by GROUPING SETS become even greater.
MERGE
The MERGE statement allows you to perform multiple Database Manipulation Language
(DML) operations (INSERT, UPDATE, and DELETE) on a table or view in a single
Transact-SQL statement. The target table or view is joined with a data source and the
DML operations are performed on the results of the join. The MERGE statement has
three WHEN clauses, each of which allows you to perform a specific DML action on a
given row in the result set:
• For every row that exists in both the target and the source, the WHEN MATCHED
clause allows you to UPDATE or DELETE the given row in the target table.
• For every row that exists in the source but not in the target, the WHEN
[TARGET] NOT MATCHED clause allows you to INSERT a row into the target.
• For every row that exists in the target but not in the source, the WHEN SOURCE
NOT MATCHED clause allows you to UPDATE or DELETE the given row in the
target table.
You can also specify a search condition with each of the WHEN clauses to choose which
type of DML operation should be performed on the row. The OUTPUT clause for the
MERGE statement includes a new virtual column called $action, which you can use to
identify the DML action that was performed on each row.

In the context of data warehousing, the MERGE statement is used to perform efficient
INSERT and UPDATE operations for Slowly Changing Dimensions (SCD) and to maintain
the fact table in various common scenarios. The MERGE statement has better
performance characteristics than running separate INSERT, UPDATE, and DELETE
statements since it only requires a single pass over the data.
SQL Server 2008 also includes a powerful extension to the INSERT statement that
allows it to consume rows returned by the OUTPUT clause of a nested INSERT, UPDATE,
DELETE, or MERGE statements.
Suppose you have a DimBook table (ISBN, Price, IsCurrent) that tracks the price
history and current price for each book in a bookstore. Price changes and new book
additions are made on a weekly basis. Every week a source table WeeklyChanges
(ISBN, Price) is generated and these changes are applied to the DimBook table. A
row is inserted for each new book. Existing books whose prices have changed during
the week are updated with IsCurrent=0 and a new row is inserted to reflect the new
price. The following single Transact-SQL statement performs these operations using the
new MERGE and INSERT capabilities.
INSERT INTO DimBook(ISBN, Price, IsCurrent)
SELECT ISBN, Price, 1
FROM
(
MERGE DimBook as book
USING WeeklyChanges AS src
ON (book.ISBN = src.ISBN and book.IsCurrent = 1)
WHEN MATCHED THEN
UPDATE SET book.IsCurrent = 0
WHEN NOT MATCHED THEN
INSERT VALUES (src.ISBN, src.Price, 1)
OUTPUT $action, src.ISBN, src.Price
) AS Changes(action, ISBN, Price)
WHERE action = 'UPDATE';
Change Data Capture

Change Data Capture (CDC) is one of the new data tracking features introduces in SQL
Server 2008. Mainly designed for data warehousing scenarios, Change Data Capture
provides an efficient mechanism to track and gather data changes made to user tables
and gives you access to change data in an easily consumable relational format.
Typically, you use CDC in an operational database to capture changes for later
movement into your data warehouse. The availability of CDC in SQL Server eliminates
the need for using intrusive methods such as user triggers, timestamp columns, and
expensive queries to determine what changed in the operational system.
The auxiliary information gathered along with the change data allows CDC to provide
answers to a variety of questions. For example, here is a set of questions for which CDC
can provide the answers efficiently:
• I want all rows that changed between 12:00 A.M. and 12:00 P.M.
• I need to know whether a change is an insert, update, or delete.
• For an updated row, I want to know which column(s) changed.
One of the scenarios in which CDC can be extremely beneficial is Extract, Transform and
Load (ETL). With the explosion of data volume and shrinking maintenance windows due
to global operations, it is critical to optimize the ETL process. Change Data Capture
provides you with a very efficient way to extract changes on an incremental basis,
reducing overall ETL processing time.
The following diagram provides an overview of the components that make up Change
Data Capture.
Figure 4: Change Data Capture

CDC uses a capture job to extract changes from the SQL Server transaction log, and
populate change tables. The CDC API allows you to write an application to get
information from change tables. You can use this in your ETL packages. The CDC
cleanup job removes information that is no longer needed from change tables.
Minimally Logged INSERT

In general, when you write data to a user database, you must write it to disk twice:
once to the log, and once to the database itself. This is because the database system
uses an undo/redo log so it can rollback or redo transactions when needed. But it’s
possible to write the data to disk only once in some important cases that involve
inserting data into existing tables, thereby speeding up your ETL processes significantly.
This is what the new minimally logged INSERT feature does in SQL Server 2008. A
factor of 2 or more speedup is common with minimal logging compared with full
logging. Your results will depend on your application and hardware.
Minimal logging consists of logging only the information that is required to rollback the
transaction without supporting point-in-time recovery. Minimal logging is only available
under the bulk logged and simple recovery models. When a minimally logged
transaction is committed, the log records and the data pages containing the newly
inserted rows are flushed to the disk to guarantee the durability of the transaction.
Minimal logging greatly improves the performance of large scale INSERT operations by
reducing the number of log records to be written and the amount of log space required
to complete the operation. For a discussion of table requirements for minimal logging,
see SQL Server Books Online. In particular, you must use table locking (TABLOCK) on
the target table.
Operations that can be minimally logged in SQL 2005 include bulk import operations,
SELECT INTO, and index creation and rebuild. SQL 2008 extends the optimization to
INSERT INTO…SELECT FROM T-SQL operations that insert a large number of rows into
an existing target table when that table is a heap that has no nonclustered indexes, and
the TABLOCK hint is used on the target. The optimization works whether the target
table is empty or contains data.
A key scenario for using minimally logged INSERT is this: you create an empty table on
specific file groups, so you can control where the data is physically placed. Then you
use INSERT INTO…SELECT FROM to populate it, in a minimally logged fashion. This puts
the data where you want it, and only writes it to disk once. Once the data is loaded, you
can then create the required indexes. It is important to note that indexes themselves
can be created with minimal logging.
Data Compression
The new data compression feature in SQL Server 2008 reduces the size of tables,
indexes or a subset of their partitions by storing fixed-length data types in variable
length storage format and by reducing the redundant data. The space savings achieved
depends on the schema and the data distribution. Based on our testing with various
data warehouse databases, we have seen a reduction in the size of real user databases
up to 87% (a 7 to 1 compression ratio) but more commonly you should expect a
reduction in the range of 50-70% (a compression ratio between roughly 2 to 1 and 3
to 1).
SQL Server provides two types of compression as follows:
• ROW compression enables storing fixed length types in variable length storage
format. So for example, if you have a column of data type BIGINT which takes 8
bytes of storage in fixed format, when compressed it takes a variable number of
bytes—anywhere from 0 bytes to up to 8 bytes. Since column values are stored as
variable length, an additional 4–bit length code is stored for each field within the
row. Additionally, zero and NULL values don’t take any storage except for the 4–bit
code.
• PAGE compression is built on top of ROW compression. It minimizes storage of
redundant data on the page by storing commonly occurring byte patterns on the
page once and then referencing these values for respective columns. The byte
pattern recognition is type-independent. Under PAGE compression, SQL Server
optimizes space on a page using two techniques.
The first technique is column prefix. In this case, the system looks for a common byte
pattern as a prefix for all values of a specific column across rows on the page. This
process is repeated for all the columns in the table or index. The column prefix values
that are computed are stored as an anchor record on the page and the data or index
rows refer to the anchor record for the common prefix, if available, for each column.
The second technique is page level dictionary. This dictionary stores common values
across columns and rows and stores them in a dictionary. The columns are then
modified to refer to the dictionary entry.

Compression comes with additional CPU cost. This overhead is paid when you query or
execute DML operations on compressed data. The relative CPU overhead with ROW is
less than for PAGE, but PAGE compression can provide better compression. Since there
are many kinds of workloads and data patterns, SQL Server exposes compression
granularity at a partition level. You can choose to compress the whole table or index or
a subset of partitions. For example, in a DW workload, if CPU is the dominant cost in
your workload but you want to save some disk space, you may want to enable PAGE
compression on partitions that are not accessed frequently while not compressing the
current partition(s) that are accessed and manipulated more frequently. This reduces
the total CPU cost, at a small increase in disk space requirements. If I/O cost is
dominant for your workload, or you need to reduce disk space costs, compressing all
data using PAGE compression may be the best choice. Compression can give many-fold
speedups if it causes your working set of frequently touched pages to be cached in the
main memory buffer pool, when it does not otherwise fit in memory. Preliminary
performance results on one large-scale internal DW query performance benchmark used
to test SQL Server 2008 show a 58% disk savings, an average 15% reduction in query
runtime, and an average 20% increase in CPU cost. Some queries speeded up by a
factor of up to seven. Your results depend on your workload, database, and hardware.
The commands to compress data are exposed as options in CREATE/ALTER DDL
statements and support both ONLINE and OFFLINE mode. Additionally, a stored
procedure is provided to help you estimate the space savings prior to actual
compression.
Backup Compression
Backup compression helps you to save in multiple ways.
By reducing the size of your SQL backups, you save significantly on disk media for your
SQL backups. While all compression results depend on the nature of the data being
compressed, results of 50% are not uncommon, and greater compression is possible.
This enables you to use less storage for keeping your backups online, or to keep more
cycles of backups online using the same storage.
Backup compression also saves you time. Traditional SQL backups are almost entirely
limited by I/O performance. By reducing the I/O load of the backup process, we actually
speed up both backups and restores.
Of course, nothing is entirely free, and this reduction in space and time come at the
expense of using CPU cycles. The good news here is that the savings in I/O time offsets
the increased use of CPU time, and you can control how much CPU is used by your
backups at the expense of the rest of your workload by taking advantage of the
Resource Governor.
Resource Governor
The new Resource Governor in SQL Server 2008 enables you to control the amount of
CPU and memory resources allocated to different parts of your relational database
workload. It can be used to prevent runaway queries (that deny resources to others)
and to reserve resources for important parts of your workload. SQL Server 2005
resource allocation policies treat all workloads equally, and allocate shared resources
(for example, CPU bandwidth, and memory) as they are requested. This sometimes
causes a disproportionate distribution of resources, which in turn results in uneven
performance or unexpected slowdowns.
The primary goals of the Resource Governor are as follows:

1. Monitoring: Enable monitoring of resource consumption per group of requests
(workload group).
2. Predictability: Provide predictable execution of workloads in an environment
where there is resource contention. This is achieved by explicitly specifying
resource boundaries between workloads (via resource pool controls). The
implementation of resource boundaries also prevents, or reduces the probability
of run-away queries. The monitoring capability provided with the Resource
Governor facilitates easier detection of run-away queries.
3. Prioritization: Enable the prioritization of workloads.
There are three new concepts which are important to understanding of the resource
governor: workload groups, resource pools, classification (and classifier user-defined
functions).
• Group: A workload group, or group, is a user-specified category of requests that
are similar according to the classification rules that are applied to each request.
The value of a group is in the aggregate monitoring of resource consumption and
a uniform policy that is applied to all the requests in a group. A group defines
the policies for its members.
• Pool: A resource pool, or pool, represents a portion of the physical resources of
the server. Depending on its settings, a pool may have a fixed size (its minimum
and maximum resource usage settings are equal to each other) or have a part
which is shared between multiple pools (its minimum is less than its effective
maximum). “Shared” in this case simply means that resources go to the pool
that requests the resources first. In the default configuration all resources are
shared, thus maintaining backward compatibility with SQL Server 2005 policies.
• Classification: Classification is a set of user-written rules that enable Resource
Governor to classify requests into the groups described previously. It is
implemented through a scalar Transact-SQL user-defined function (UDF) which is
designated as a “classifier UDF” for the Resource Governor.
These concepts are illustrated in the following figure.

Figure 5: Resource Governor example: requests, classification, groups, and

pools
The Resource Governor may be used without any application changes.
Integration Services Improvements

Doing ETL to move data from your operational systems into your data warehouse can
be a time-intensive task. To make this process faster, SQL Server 2008 Integration
Services (SSIS) introduces two important scalability features: improved lookup
performance and improved transformation pipeline performance.
Lookup Performance
The Lookup component in SSIS runs faster, and is even easier to program than in SQL
Server 2005. A lookup tests whether each row in a stream of rows has a matching row
in another dataset. A lookup is like a database join operation. Typically you use lookup
within an integration process, such as the ETL layer that populates a data warehouse
from source systems.
A lookup builds a cache of retrieved rows pulled from the dataset being probed. In SQL
Server 2005, the Lookup component could only get data from specific OleDb
connections, and the cache could be populated only by using a SQL query. In SQL
Server 2008, the new version of Lookup allows you to populate the cache using a

separate pipeline in the same package or a different package. You can use source data
from just about anywhere.
SQL Server 2005 reloads the cache every time it is used. For example, if you have two
pipelines in the same package that each require the same reference dataset, each
Lookup component would cache its own copy. In SQL Server 2008, you can save the
cache to virtual memory or permanent file storage. This means that within the same
package, multiple Lookup components can share the same cache. You can save the
cache to a file and share it with other packages. The cache file format is optimized for
speed, and access to it can be orders of magnitude faster than reloading the reference
dataset from the original relational source.
In SQL Server 2008, the Lookup component introduces the miss-cache feature. When
the component is configured to perform lookups directly against the database, the
miss-cache feature saves time by optionally loading into cache the key values that have
no matching entries in the reference dataset. For example, if the component receives
the value 123 in the incoming pipeline, but the Lookup component already knows that
there are no matching entries in the reference dataset, the component will not try again
to find 123 in the reference dataset. This reduces a redundant and expensive trip to the
database. The miss-cache feature alone can contribute up to a 40% performance
improvement in some scenarios.
Other enhancements to the Lookup component include:
• Optimized I/O routines leading to faster cache loading and lookup operations.
• More intuitive user interface that simplifies the configuration of the Lookup
component, in particular the caching options.
• Rows in the input that do not match at least one entry in the reference dataset
are now sent to the No Match output. The Error output only handles errors such
as truncations.
• Query statements in lookup transformations can be changed at runtime, making
programming transformations more flexible.
• Informational and error messages are improved to help with troubleshooting and
performance analysis.
The following figure illustrates a scenario that uses the new Lookup.

Figure 6: Lookup Scenario

Dataflow 1 populates a Cache Connection Manager (CCM) from a custom source, and
then Dataflow 2 uses the same CCM to populate the lookup's cache. The figure also
shows use of 3 outputs from the Lookup component.
Pipeline Performance
In SQL Server 2008 SSIS, several threads can work together to do the work that a
single thread is forced to do by itself in SQL Server 2005 SSIS. This can give you a
several-fold speedup in ETL performance.
In SQL Server 2005 SSIS, pipeline parallelism is more coarse-grained. When users have
a simple package with one or two execution trees, there are only one or two processors
used, and the package might not benefit from a multiprocessor machine with more than
a few processors. Even if users logically split the data flow by using multicast, all output
paths of a multicast belong to the same execution tree, and they are executed serially
by the SQL Server 2005 SSIS data flow task.
To achieve a high level of parallelism, pipelines in SQL Server 2008 SSIS allow more
parallel processing, which means that for any multiprocessor machine this should result
in faster performance.
By using a shared thread pool, multiple outputs of a multicast can be executed
simultaneously. In short, the multicast gives an ability to have an active buffer on each
output and not just have one buffer (and one active thread), which is passed to each
output. You do not need to use the “Union All” trick as a workaround to introduce more
parallelism.
For example, suppose you have a flow that contains a multicast with four outputs. Each
output flows into an aggregate. In SQL Server 2005 SSIS, only one of the aggregates is
processed at a time. In SQL Server 2008 SSIS, all four aggregates can be processed in
parallel.
The following figure shows how the enhanced SQL Server 2008 pipeline parallelism
works.

Figure 7: Improved pipeline parallelism in Integration Services
Analysis Services Improvements

SQL Server 2008 Analysis Services (SSAS) dramatically improves query speed with the
new block computation, writeback, and scalable shared database performance features.
Manageability also improves with the ability to back up much larger databases.
MDX Query Performance: Block Computation

Improved block computation in SQL Server 2008 SSAS speeds up MDX query
processing primarily by doing work just for the non-null values in a cube space. No time
is wasted evaluating null cells. The key idea behind subspace computation is best
introduced by contrasting it with a “naïve” cell-by-cell evaluation of a calculation.
Consider a calculation RollingSum that sums the sales for the previous year and the
current year, and a query that requests the RollingSum for 2005 for all products.
RollingSum = (Year.PrevMember, Sales) + Sales
SELECT 2005 on columns, Product.Members on rows WHERE RollingSum
A cell-by-cell evaluation of this calculation proceeds as shown in the following figure.

Figure 8: Cell-by-cell evaluation example

The 10 cells for [2005, all products] are evaluated in turn. For each, we navigate to the
previous year, obtain the sales value, and add it to the sales for the current year. There
are two significant performance issues with this approach.
First, if the data is sparse, cells are calculated even though they are bound to return a
null value. In the example, calculating the cells for anything but Product3 and Product6
is a waste of effort. The impact of this can be extreme—in a sparsely populated cube,
the difference can be several orders of magnitude in the number of cells evaluated.
Second, even if the data is totally dense, meaning that every cell has a value and there
is no wasted effort visiting empty cells, there is much repeated effort. The same work
(such as getting the previous Year member, setting up the new context for the previous
Year cell, checking for recursion) is re-done for each Product. It would be much more
efficient to move this work out of the inner loop of evaluating each cell.
Now consider the same example performed using a subspace computation approach.
First, we work our way down an execution tree, determining which spaces need to be
filled. Given the query, we need to compute the space:
[Product.*, 2005, RollingSum]
Given the calculation, this means we must first compute the space:
[Product.*, 2004, Sales]
followed by the space:
[Product.*, 2005, Sales]
and then apply the ‘+’ operator to those two spaces.

Sales is a base measure, so we simply obtain the storage engine data to fill the two
spaces at the leaves, and then work up the tree, applying the operator to fill the space
at the root. Hence the one row (Product3, 2004, 3) and the two rows { (Product3,
2005, 20), (Product6, 2005, 5)} are retrieved, and the + operator applied to them to
yield the result.
Figure 9: Block computation example that avoids doing work for NULL cells
The + operator operates on spaces, not simply scalar values. It is responsible for
combining the two given spaces to produce a space that contains each product that
appears in either space, with the summed value.
We only operate on data that could contribute to the result. There is no notion of the
complete space over which we must perform the calculation
Query and Writeback Performance

The performance of writeback operations, and queries on writeback data, is improved in
SQL Server 2008 Analysis Services. Cell writeback in Analysis Services is the ability for
end users to update cell values at the leaf or aggregate levels. Cell writeback uses a
special writeback partition per measure group which stores the difference (delta)
between the updated cell value and the original value. When an MDX query requests
cell data from this measure group, the storage engine accesses all partitions including
the writeback partition and aggregates the results to produce the right cell value.
In SQL Server 2005 and earlier releases, Analysis Services required writeback partitions
to have ROLAP storage. This is a common cause for performance issues in cell writeback
since ROLAP partitions query the relational data source on demand to retrieve their
data. In SQL Server 2008, we allow writeback partitions with MOLAP storage. Retrieving
writeback data from the compressed MOLAP format is much faster than querying the
relational data source. Hence, MOLAP writeback partitions have better query
performance than ROLAP. The extent of the performance improvement varies and
depends on a number of factors including the volume of writeback data and the nature
of the query.
MOLAP writeback partitions should also improve cell writeback performance since the
server internally sends queries to compute the writeback deltas and these queries
probably access the writeback partition. Note that the writeback transaction commit can
be a bit slower since the server must update the MOLAP partition data in addition to the
writeback table, but this should be insignificant compared with the other performance
gains.
Analysis Services Enhanced Backup

In SQL Server 2008 Analysis Services one of the performance enhancements you’ll find
is a new backup storage subsystem. The backup storage subsystem now has been
rewritten to allow for greater performance and scalability. The changes are transparent
to your application – no code change is necessary to take advantage of them.
The new backup storage subsystem had introduced a new format to the Analysis
Services backup files. The file name extension hasn’t changed. However, the internal
format is different, so backup can scale well to handle databases over a terabyte in size.
SQL Server 2008 Analysis Services backup is fully backward compatible with SQL
Server 2005 Analysis Services. It allows you to restore databases backed up in SQL
Server 2005 Analysis Services. SQL Server 2008 Analysis Services does not have ability
to store backups in the old format for use in SQL Server 2005 Analysis Services.
The new highly performing backup storage subsystem allows customer to implement
new backup scenarios. Where previously you had to rely on the raw file system copy
utilities to back up large databases, now you have the option to use the built-in backup
subsystem that is integrated with the transactional system, and allows running backup
in parallel to other operations.
Scalable Shared Database for AS

Now you can scale out your OLAP query workload across many small servers with just a
single copy of the database. SQL Server 2008 Analysis Services supports this with a
feature called scalable shared database (SSD).
Scale out can apply to various scenarios and workloads such as processing, queries,
data and cache management. The most common scale-out scenario for Analysis
Services is to spread the query load across multiple servers in response to increasing
number of concurrent users. This has been achieved in the past by placing a load
balancing solution such as Microsoft Network Load Balancing (NLB) capability in front of
multiple servers and replicating the physical data between the servers. Managing such
an environment poses various challenges and the data replication is a major one. The
scalable shared database feature enables DBAs to mark a database as read only and
share it across multiple server instances from a Storage Area Network (SAN), thereby
eliminating the need to replicate the data. This saves disk space, and the time that
otherwise would have been spent copying data.
The following figure illustrates a typical SSD configuration.

Figure 10: Scalable shared database for Analysis Services

An alternate solution for improving performance is scale up, where a single large server
is used instead of multiple small servers. The advantage of scale up is that a single
query can typically be processed faster on a larger machine. But using scale out via
SSD can save you money on hardware (given the lower cost per processor) and still
satisfy your needs for many multi-user workloads. Furthermore, SSD allows you to
scale to more processors than can be accommodated in a single large server.
The scalable shared database feature consists of three logical parts:
• Read-only database: Allows marking a database as read-only
• Database storage location: Allows a database to reside outside the server Data
folder
• Attach/detach database: Allows attaching or detaching a database from any UNC
path
These features together enable the query scale-out scenario. However, each feature is
independent and has usages outside of query scale out as well.
The SSD for AS feature works in a similar way as the SSD feature introduced in the SQL
Server 2005 relational database.

Reporting Services Improvements

SQL Server 2008 Reporting Services (SSRS) provides performance, scale, and design
improvements that make it a great choice for your enterprise reporting needs. We
highlight the two major scalability improvements here.
Reporting Scalability
The SQL Server 2008 Reporting Services reporting engine has had a major upgrade
from the prior release, so that it can render much larger reports than it could before.
Although this is not specifically a data warehousing improvement (it is useful in
operational reporting too), it is useful in some data warehousing scenarios. If you
create reports with hundreds or thousands of pages, SQL Server 2008 Reporting
Services helps you to render the reports faster. Moreover, the size of the largest report
that can be rendered has been increased dramatically, given the same hardware
configuration.
Server Scalability
SQL Server 2008 Reporting Services does not run inside Internet Information Server
(IIS). It can manage its own memory, and has its own memory limits. This allows you
to configure the memory settings so SSRS can run on the same computer more
effectively with other services, such as SQL Server.
Conclusion
SQL Server gives you everything you need for data warehousing. With the 2008
release, it scales to meet the needs of the largest enterprises more readily than ever. As
illustrated by the many data warehouse scale enhancements introduced in this paper, it
is a major advance over previous releases. The number one change you’ll see is
improved scalability across the board, for data warehouse construction, relational query
processing, reporting, and analysis.
For more information:

• SQL Server Web site
• SQL Server TechCenter
• SQL Server Developer Center
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to
5 (excellent), how would you rate this paper and why have you given it this rating? For
example:
• Are you rating it high due to having good examples, excellent screenshots, clear
writing, or another reason?
• Are you rating it low due to poor examples, fuzzy screenshots, unclear writing?
This feedback will help us improve the quality of the white papers we release. Send
feedback.
References
Bloom filter, Wikipedia 2007, http://en.wikipedia.org/wiki/Bloom_filter.

Hanson, Eric., Improving Performance with SQL Server 2005 Indexed Views,
http://www.microsoft.com/technet/prodtechnol/sql/2005/impprfiv.mspx.

An Introduction To New Data Warehouse Scalability Features in SQL Server 2008

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

An Introduction To New Data Warehouse Scalability Features in SQL Server 2008

Загружено:

Авторское право:

Доступные форматы

An Introduction to New Data Warehouse Scalability Features in SQL Server 2008 1

An Introduction to New Data Warehouse

Technical Reviewer: Eric N. Hanson, Microsoft Corp.

Published: December 2007

Microsoft Corporation ©2007

 2007 MicrosoftCorporation. All rights reserved.

Map of New Data Warehousing Features

Build Manage Deliver Insight

Table 1: SQL Server 2008 improvements

SQL Server Relational DBMS DW Improvements

Microsoft Corporation ©2007

warehouses. This section elaborates on the relational DBMS data warehouse

Microsoft Corporation ©2007

Partitioned Table Parallelism

Microsoft Corporation ©2007

Figure 2: Partitioned table parallelism.

Partition-Aligned Indexed Views

Figure 3: Partition-aligned indexed views

Microsoft Corporation ©2007

SELECT D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry

2007 Q1 1000 1000 5000 7000

Table 2: Output from a GROUPING SETS query, formatted as a pivot table

Microsoft Corporation ©2007

Change Data Capture

Figure 4: Change Data Capture

Minimally Logged INSERT

Microsoft Corporation ©2007

The primary goals of the Resource Governor are as follows:

Microsoft Corporation ©2007

Figure 5: Resource Governor example: requests, classification, groups, and

Integration Services Improvements

Microsoft Corporation ©2007

Microsoft Corporation ©2007

Figure 6: Lookup Scenario

Microsoft Corporation ©2007

Figure 7: Improved pipeline parallelism in Integration Services

Analysis Services Improvements

MDX Query Performance: Block Computation

Microsoft Corporation ©2007

Figure 8: Cell-by-cell evaluation example

Microsoft Corporation ©2007

Query and Writeback Performance

Analysis Services Enhanced Backup

Scalable Shared Database for AS

Microsoft Corporation ©2007

Figure 10: Scalable shared database for Analysis Services

Microsoft Corporation ©2007

Reporting Services Improvements

For more information:

Microsoft Corporation ©2007

Microsoft Corporation ©2007

Вам также может понравиться