Академический Документы
Профессиональный Документы
Культура Документы
What is De-normalization?
De-normalization is the process of attempting to optimize the performance of a database by
adding redundant data. It is sometimes necessary because current DBMSs implement the
relational model poorly. A true relational DBMS would allow for a fully normalized database
at the logical level, while providing physical storage of data that is tuned for high
performance. De-normalization is a technique to move from higher to lower normal forms of
database modeling in order to speed up database access.
What are the basic functions for master, msdb, model, tempdb databases?
The Master database contains catalog and data for all databases of the SQL
Server instance and it holds the engine together. Because SQL Server cannot start if
the master database is not working.
The msdb database contains data of database backups, SQL Agent, DTS
packages, SQL Server jobs, and log shipping.
The tempdb contains temporary objects like global and local temporary tables and
stored procedures.
The model is a template database which is used for creating a new user
database.
1. A clustered index is a special type of index that reorders the way records in the table are
2.
physically stored. Therefore table can have only one clustered index. The leaf nodes of a
clustered index contain the data pages.
A non clustered index is a special type of index in which the logical order of the index does
not match the physical stored order of the rows on disk. The leaf node of a non clustered
index does not consist of the data pages. Instead, the leaf nodes contain index rows.
/* Create (UNIQUE) Nonclustered Index over Table */
CREATE(UNIQUE) NONCLUSTERED INDEX[IX_MyTable_NonClustered]
ON[dbo].[Table1]
(
[First]ASC,
[Second]ASC
)ON[PRIMARY]
GO
Page 2
Now, Query will use Index Seek, before that, Query will use Table Scan,
/* Create Clustered Index over Table */
CREATE CLUSTERED INDEX[IX_MyTable_Clustered]
ON[dbo].[MyTable]
(
[ID]ASC
)ON[PRIMARY]
GO
Note: When any data modification operations (DML - INSERT, UPDATE, or DELETE
statements) table fragmentation can occur. DBCC DBREINDEX statement can be used to
rebuild all the indexes on all the tables in database. DBCC DBREINDEX is efficient over
dropping and recreating indexes. E.g. DBCC DBREINDEX (TableName, '', 80)
No indexes
A clustered index
A nonclustered index
Index Seek and Index Scan are operation for query tuning in execution plans.
Table Scan scans every record of the table. So the cost of proportional is the number of rows
of that table.
Index Seek only touches the rows which qualify and the pages that contain that qualifying
rows, so the cost of proportional is the number of qualifying rows and pages instead of the
number of rows in the table.
The 'fill factor' option indicate how full SQL Server will create each index page.
When the index page doesnt have free space for inserting a new row, SQL Server
will create new index page and transfer some rows from the previous index page to
the new index page. This process is called page split.
Page 3
If we want to reduce the number of page splits then we can use Fill factor option.
Using Fill factor, SQL Server will reserve some space on each index page.
The fill factor is a value from 1 through 100 that indicates the percentage of the
index page to be left empty. The default value for fill factor is 0.
If the table contains the data which is not changed frequently then we can set the fill
factor option to 100. When the table's data is modified frequently, we can set the fill
factor option to 80% or as we want.
TRUNCATE is faster and uses fewer system and transaction log resources than DELETE.
TRUNCATE removes the data by de-allocating the data pages used to store the tables data,
and only the page de-allocations are recorded in the transaction log.
TRUNCATE removes all rows from a table, but the table structure, its columns, constraints,
indexes and so on, remains. The counter used by an identity for new rows is reset to the seed
for the column.
You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint. Because
TRUNCATE TABLE is not logged, it cannot activate a trigger.
DELETE
Page 4
DELETE removes rows one at a time and records an entry in the transaction log for each
deleted row.
If you want to retain the identity counter, use DELETE instead. If you want to remove table
definition and its data, use the DROP TABLE statement.
If we drop a table, does it also drop related objects like constraints, indexes,
columns, defaults, Views and Stored Procedures?
YES, SQL Server drops all related objects, which exists inside a table like, constraints, indexes,
columns, defaults etc. BUT dropping a table will not drop Views and Stored Procedures as they exist
outside the table.
What is a table called, if it has neither Cluster nor Non-cluster Index? What is
it used for?
Unindexed table or Heap. A heap is a table that does not have a clustered index and, therefore, the
pages are not linked by pointers. Many times it is better to drop all indexes from table and then do
bulk of inserts and to restore those indexes after that.
Page 5
take these changes into account. UPDATE_STATISTICS updates the indexes on these tables
accordingly.
By Mistake, duplicate records exists in table, then how to delete copy of the
record?
WITH [CTE DUPLICATE] AS
(
SELECT
RN = ROW_NUMBER () OVER (PARTITION BY CompanyTitle ORDER BY Id DESC),
Id, CompanyTitle, ContactName, LastContactDate
FROM Suppliers
)
DELETE FROM [CTE DUPLICATE] WHERE RN > 1
Page 6
For example, you can monitor a production environment to see which stored procedures are
hampering performances by executing too slowly.
Use SQL Profiler to monitor only the events in which you are interested. If traces are becoming
too large, you can filter them based on the information you want, so that only a subset of the event
data is collected. Monitoring too many events adds overhead to the server and the monitoring process
and can cause the trace file or trace table to grow very large, especially when the monitoring process
takes place over a long period of time.
Steps to create and run a new trace based on this definition file.
1. Open Profiler.
2. From the File menu, select New > Trace...
3. In the 'Connect to SQL Server' dialog box, Connect to the SQL Server that you are
going to be tracing, by providing the server name and login information (Make sure
you connect as a sysadmin).
4. In the General tab of the 'Trace Properties' dialog box click on the folder icon
against the 'Template file name:' text box. Select the trace template file that
you've just downloaded.
5. Now we need to save the Profiler output to a table, for later analysis. So, Check the
check box against 'Save to table:' check box. In the 'Connect to SQL Server' dialog
box, specify an SQL Server name (and login, password), on which you'd like to store
the Profiler output. In the 'Destination Table' dialog box, select a database and table
name. Click OK. It's a good idea to save the Profiler output to a different SQL Server,
than the one on which we are conducting load test.
6. Check 'Enable trace stop time:' and select a date and time at which you want the
trace to stop itself. (I typically conduct the load test for about 2 hours).
7. Click "Run" to start the trace.
It's also a good idea to run Profiler on a client machine, instead of, on the SQL Server itself.
On the client machine, make sure you have enough space on the system drive
Page 7
Inline Table-Value User-Defined Function: An Inline Table-Value user-defined function returns a
table data type and is an exceptional alternative to a view as the user-defined function can pass
parameters into a T-SQL select command and in essence provide us with a parameterized, nonupdateable view of the underlying tables.
Multi-statement Table-Value User-Defined Function: A Multi-Statement Table-Value user-defined
function returns a table and is also an exceptional alternative to a view as the function can support
multiple T-SQL statements to build the final result where the view is limited to a single SELECT
statement. Also, the ability to pass parameters into a TSQL select command or a group of them gives
us the capability to in essence create a parameterized, non-updateable view of the data in the
underlying tables. Within the create function command you must define the table structure that is
being returned. After creating this type of user-defined function, It can be used in the FROM clause of
a T-SQL command unlike the behavior found when using a stored procedure which can also return
record sets.
WITH (NOLOCK) is used to unlock the data which is locked by the transaction that is
not yet committed. This command is used before SELECT statement.
When the transaction is committed or rolled back then there is no need to use
NOLOCK function because the data is already released by the committed transaction.
Syntax: WITH(NOLOCK)
Example:
SELECT * FROM EmpDetails WITH(NOLOCK)
Page 8
Consistency This property says that the transaction should be always in consistent state.
If any transaction is going to affect the databases consistent state then the transaction
could be rolled back.
Isolation This property says that one transaction cannot retrieve the data that has been
modified by any other transaction until its completed.
Durability When any transaction is committed then it must be persisted. In the case of
failure only committed transaction will be recovered and uncommitted transaction will be
rolled back.
What is a CTE?
A common table expression (CTE) is a temporary named result set that can be used within other
statements like SELECT, INSERT, UPDATE, and DELETE. It is not stored as an object and its
pelifetime/scope is limited to the query. It is defined using the WITH statement as the following
example shows:
What are temp tables? What is the difference between global and local temp
tables?
Temporary tables are temporary storage structures. You may use temporary tables as buckets to store
data that you will manipulate before arriving at a final format. The hash (#) character is used to
declare a temporary table as it is prepended to the table name. A single hash (#) specifies a local
temporary table.
CREATE TABLE #tempLocal (nameid int, fname varchar (50), lname varchar (50))
Local temporary tables are available to the current connection for the user, so they disappear when
the user disconnects.
Page 9
Global temporary tables may be created with double hashes (##). These are available to all users via
all connections, and they are deleted only when all connections are closed.
CREATE TABLE ##tempGlobal (nameid int, fname varchar (50), lname varchar (50))
Once created, these tables are used just like permanent tables; they should be deleted when you are
finished with them. Within SQL Server, temporary tables are stored in the Temporary Tables folder of
the tempdb database.
BEGIN TRANSACTION
Statement1
Statement2
..................
...............
IF(@@ERROR>0)
ROLLBACK TRANSACTION
ELSE
COMMIT TRANSACTION
Page 10
1.
ERROR_NUMBER() - This returns the error number and its value is same as for @@ERROR
function.
2.
3.
ERROR_LINE() - This returns the line number of T-SQL statement that caused error.
ERROR_SEVERITY() - This returns the severity level of the error. A TRY..CATCH block
combination catches all the errors that have a severity between 11 and 19
4.
5.
ERROR_PROCEDURE() - This returns the name of the stored procedure or trigger where the
error occurred.
6.
ERROR_MESSAGE() - This returns the full text of error message. The text includes the
values supplied for any substitutable parameters, such as lengths, object names, or times .
Page 11
GO
CREATE TABLE dbo.Vendors
(VendorID int PRIMARY KEY, VendorName nvarchar (50),
CreditRating tinyint)
GO
ALTER TABLE dbo.Vendors ADD CONSTRAINT CK_Vendor_CreditRating
CHECK (CreditRating >= 1 AND CreditRating <= 5)
Note: To modify a CHECK constraint, you must delete the existing CHECK constraint and then recreate it with the new definition.
What is Identity?
Identity (or AutoNumber) is a column that automatically generates numeric values. A start
and increment value can be set, but most DBA leave these at 1. A GUID column also
generates numbers; the value of this cannot be controlled. Identity/GUID columns do
not need to be indexed.
What is a view? What is the WITH CHECK OPTION clause for a view?
Views are designed to control access to data . A view is a virtual table that consists of fields
from one or more real tables. Views are often used to join multiple tables or to control
access to the underlying tables.
The WITH CHECK OPTION for a view prevents data modifications (to the data) that do not
confirm to the WHERE clause of the view definition. This allows data to be updated via the
view, but only if it belongs in the view
The with check option causes makes the where clause a two-way restriction. This option is useful
when the view should limit inserts and updates with the same restrictions applied to the where
clause.
This clause is very important because it prevents changes that do not meet the
view's criteria.
Example: Create a view on database pubs for table authors that show the name, phone number and
state from all authors from California. This is very simple:
CREATE VIEW dbo.AuthorsCA
AS
SELECT au_id, au_fname, au_lname, phone, state, contract
FROM dbo.authors
WHERE state = 'ca'
This is an updatable view and a user can change any column, even the state column:
UPDATE AuthorsCA SET state='NY'
After this update, there will be no authors from California state. This might not be the expected
behavior.
Page 12
Example: Same as above but the state column cannot be changed.
CREATE VIEW dbo.AuthorsCA2
AS
SELECT au_id, au_fname, au_lname, phone, state, contract
FROM dbo.authors
WHERE state = 'ca'
With Check Option
The view is still updatable, except for the state column:
Note: UPDATE AuthorsCA2 SET state='NY'
This will cause an error and the state will not be changed.
Page 13
What is an execution plan? When would you use it? How would you view the
execution plan?
An execution plan is basically a road map that graphically or textually shows the data
retrieval methods chosen by the SQL Server query optimizer for a stored procedure or adhoc query and is a very useful tool for a developer to understand the performance
characteristics of a query or stored procedure since the plan is the one that SQL Server will
place in its cache and use to execute the stored procedure or query. From within Query
Analyzer is an option called "Show Execution Plan" (located on the Query drop-down menu).
If this option is turned on it will display query execution plan in separate window when
query is ran again.
Page 14
RANK is one of the Ranking functions which are used to give rank to each row in the result set
of a SELECT statement.
For using this function first specify the function name, followed by the empty parentheses.
Then specify the OVER function. For this function, you have to pass an ORDER BY clause as an
argument. The clause specifies the column(s) that you are going to rank.
For Example
SELECT ROW_NUMBER() OVER(ORDER BY Salary DESC) AS [RowNumber], EmpName, Salary,
[Month], [Year] FROM EmpSalary
In the result you will see that the highest salary got the first rand and the lowest salary got
the last rank. Here the rows with equal salaries will not get same ranks.
2. Transactional Replication
Page 15
3. Merge Replication
Merge replication replicate data from multiple sources into a single central database.
The initial load will be same as in snapshot replication but later it allows change of
data both on subscriber and publisher, later when they come on-line it detects and
combines them and updates accordingly.
-When a record is updated in the table that existing record will be there on DELETED Magic
table and modified data will be there in INSERTED Magic table.
-When a record is deleted from that table that record will be there on DELETED Magic table.
What is Trigger?
-In SQL the Trigger is the procedural code that executed when you INSERT, DELETE or
UPDATE data in the table.
-Triggers are useful when you want to perform any automatic actions such as cascading
changes through related tables, enforcing column restrictions, comparing the results of data
modifications and maintaining the referential integrity of data across a database.
-For example, to prevent the user to delete the any Employee from EmpDetails table,
following trigger can be created.
Create Trigger del_emp
on EmpDetails
For delete
as
Begin
rollback transaction
print "You cannot delete any Employee!"
End
-When someone will delete a row from the EmpDetails table, the del_emp trigger cancels
the deletion, rolls back the transaction, and prints a message "You cannot delete any
Employee!"
Page 16
What is Normalization of database? What are its benefits?
Normalization is set of rules that are to be applied while designing the database tables
which are to be connected with each other by relationships. This set of rules is called
Normalization.
Benefits of normalizing the database are
1. No need to restructure existing tables for new data.
2. Reducing repetitive entries.
3. Reducing required storage space
4. Increased speed and flexibility of queries.
Page 17
Note: Messages can be sent from within the same database, different database, or even
Cursor:
Cursors are required when we need to update records in a database table in singleton
fashion that means row by row. A Cursor also impacts the performance of the SQL Server
since it uses the SQL Server instances memory, reduce concurrency, decrease
network bandwidth and lock resources. So, better to avoid the use of cursor.
Alternatives solutions for like as WHILE loop, Temporary tables and Table variables.
Note: We should use cursor in that case when there is no option except cursor.
@Table Variable
Table variable is similar to temporary table except with more flexibility. It is not physically
stored in the hard disk, it is stored in the memory. We should choose this when we need to
store less
100
records.
Syntax:
Page 18
While function is used to count more than two billion rows in a table?
Count(), COUNT_BIG()
Page 19
Example: We can have a DeptID column in the Employee table which is pointing to DeptID
column in a department table where it a primary key.
Defined Keys CREATE TABLE Department ( ID int PRIMARY KEY, Name varchar (50) NOT NULL, Address
varchar (200) NOT NULL, )
CREATE TABLE Student ( ID int PRIMARY KEY, RollNo varchar(10) NOT NULL,
Name varchar (50) NOT NULL, EnrollNo varchar(50) UNIQUE, Address varchar(200)
NOT NULL, DeptID int FOREIGN KEY REFERENCES Department(DeptID))
Note: Practically in database, we have only three types of keys Primary Key, Unique
Key and Foreign Key. Other types of keys are only concepts of RDBMS that we
need to know.
Page 20
While Inserting the high volume of data into the target table, indexes got
fragmented heavily up to 85%-90% (If the destination table has primary clustered
key and two non-clustered keys). We can use the online index rebuilding feature
to rebuild/defrag the indexes, but again the fragmentation level was back to 90%
after every 15-20 minutes during the load. This whole process of data transfer and
parallel online index rebuilds took almost 12-13 hours which was much more than
our expected time for data transfer.
Then we came with an approach to make the target table a heap by dropping all the
indexes on the target table in the beginning, transfer the data to the heap and on
data transfer completion, recreate indexes on the target table. With this approach,
the whole process (by dropping indexes, transferring data and recreating indexes)
took just 3-4 hours which was what we were expecting.
So the recommendation is to consider dropping your target table indexes if
possible before inserting data to it. Specially, if the volume of inserts is very high.
Page 21
Beware when you are using "Table or view" or "Table name or view name from
variable" data access mode in OLEDB source. It behaves like SELECT * and pulls all the
columns.
Tip: Try to fit as many rows into the buffer which will eventually reduce the number of
buffers passing through the dataflow pipeline engine and improve performance.
Best Practice #3 - Effect of OLEDB Destination Settings
There are couple of settings with OLEDB destination which can impact the performance of
data transfer as listed below.
Data Access Mode This setting provides the 'fast load' option which internally uses a
BULK INSERT statement for uploading data into the destination table instead of a
simple INSERT statement (for each single row) as in the case for other options. So unless
you have a reason for changing it, don't change this default value of fast load.
Keep Identity By default this setting is unchecked which means the destination table (if it
has an identity column) will create identity values on its own. If you check this setting, the
dataflow engine will ensure that the source identity values are preserved and same value is
inserted into the destination table.
Keep Nulls Again by default this setting is unchecked which means default value will be
inserted (if the default constraint is defined on the target column) during insert into the
Page 22
destination table if NULL value is coming from the source for that particular column. If you
check this option then default constraint on the destination table's column will be ignored
and preserved NULL of the source column will be inserted into the destination.
Table Lock By default this setting is checked and the recommendation is to let it be
checked unless the same table is being used by some other process at same time. It
specifies a table lock will be acquired on the destination table instead of acquiring multiple
row level locks, which could turn into lock escalation problems.
Check Constraints Again by default this setting is checked and recommendation is to uncheck it if you are sure that the incoming data is not going to violate constraints of the
destination table. This setting specifies that the dataflow pipeline engine will
validate the incoming data against the constraints of target table. If you un-check
this option it will improve the performance of the data load.
Best Practice #4 - Effect of Rows Per Batch and Maximum Insert Commit Size
Settings.
Rows per batch The default value for this setting is -1 which specifies all incoming rows
will be treated as a single batch. You can change this default behavior and break all
incoming rows into multiple batches. The allowed value is only positive integer which
specifies the maximum number of rows in a batch.
Maximum insert commit size The default value for this setting is '2147483647' (largest
value for 4 byte integer type) which specifies all incoming rows will be committed once on
successful completion.
You can specify a positive value for this setting to indicate that commit will be done for
those number of records. You might be wondering, changing the default value for this
setting will put overhead on the dataflow engine to commit several times. Yes that is true,
but at the same time it will release the pressure on the transaction log and tempdb to grow
tremendously specifically during high volume data transfers.
The above two settings are very important to understand to improve the performance of
tempdb and the transaction log. For example if you leave 'Max insert commit size' to its
default, the transaction log and tempdb will keep on growing during the extraction process
and if you are transferring a high volume of data the tempdb will soon run out of memory as
a result of this your extraction will fail. So it is recommended to set these values to an
optimum value based on your environment.
Best Practice #5 SQL Server Destination Adapter
It is recommended to use the SQL Server Destination adapter, if your target is a local SQL
Server database. It provides a similar level of data insertion performance as the
Bulk Insert task and provides some additional benefits. But, with the SQL Server
Destination adapter we can transformation the data before uploading it to the
destination, which is not possible with Bulk Insert task.
Page 23
Note: Remember if your SQL Server database is on a remote server, you cannot use
SQL Server Destination adapter. It is better to use the OLEDB destination adapter to
minimize future changes.
Best Practice #6 - Avoid asynchronous transformation (such as Sort, Aggregate
Transformations) wherever possible
Internally, SSIS runtime engine executes the package. It executes every task other than
data flow task in the defined order. Whenever the SSIS runtime engine encounters a data
flow task, it hands over the execution of the data flow task to data flow pipeline engine.
The data flow pipeline engine breaks the execution of a data flow task into one more
execution tree(s) to achieve high performance.
At run time Data Flow Engine divides the Data Flow Task operations into Execution Trees. These
execution trees specify how buffers and threads are allocated in the package.
Each tree creates a new buffer and may execute on a different thread. When a new buffer is created
such as when a partially blocking or blocking transformation is added to the pipeline, additional
memory is required to handle the data transformation.
A new buffer requires extra memory to deal with the transformation that it is associated
with.
Buffer usage
Use of buffers by SSIS transformation type,
- Row-by-row transformations: Rows are processed as they enter the component, thus,
there is no need to accumulate data. Because it is able to use buffers previously created (by
preceding components/precedents), its not necessary to create new ones and copy data into
them. Examples: Data Conversion, Lookup, Derived Column, etc.
- Partially blocking transformations: These are usually used to combine data sets. Since
there is more than one data entry, it is possible to have huge amounts of rows waiting,
stored in memory, for the other data set to reach the component. In these cases, the
components data output is copied to new buffers and new execution threads may be
created. Examples: Union All, Merge Join, etc.
- Fully blocking transformations: Some transformations need the complete data set before
they start running. Therefore, these are the ones that impact on performance the most. In
these cases, as well, new buffers and new execution threads are created.
Examples: Aggregate, Sort.
SSIS reuses previously used buffers as much as possible, in order to increase performance.
Row-by-row transformations are known as synchronous. Each input row produces one output
row. On the other hand, in partially-blocking and fully-blocking transformations, known as
Page 24
asynchronous, there is no need to have the same number of input rows as output rows (they need
no output rows at all).
As per the above Best Practice, the execution tree creates buffers for storing incoming rows and
performing transformations. So,
How many buffers does it create?
How many rows fit into a single buffer?
How does it impact performance?
The number of buffer created is dependent on how many rows fit into a buffer and how many rows
fit into a buffer dependent on few other factors.
--The first consideration is, the estimated row size, which is the sum of the maximum sizes of
all the columns from the incoming records.
Page 25
--The second consideration is, the DefaultBufferMaxSize property of the data flow task. This
property specifies the default maximum size of a buffer. The default value is 10 MB and its
upper and lower boundaries are constrained by two internal properties of SSIS which are
MaxBufferSize (100MB) and MinBufferSize (64 KB). It means the size of a buffer can be as small as
64 KB and as large as 100 MB.
--The third factor is, DefaultBufferMaxRows which is again a property of data flow task which
specifies the default number of rows in a buffer. Its default value is 10000.
Best Practice #8 - How DelayValidation property can help you
SSIS uses validation to determine if the package could fail at runtime. SSIS uses two types of
validation. First is package validation (early validation) which validates the package and all its
components before starting the execution of the package. Second SSIS uses component validation
(late validation), which validates the components of the package once started.
Let's consider a scenario where the first component of the package creates an object i.e. a
temporary table, which is being referenced by the second component of the package. During
package validation, the first component has not yet executed, so no object has been created
causing a package validation failure when validating the second component.
SSIS will throw a validation exception and will not start the package execution. So how will you
get this package running in this common scenario?
To help you in this scenario, every component has a DelayValidation (default=FALSE) property. If
you set it to TRUE, early validation will be skipped and the component will be validated during
package execution.
Best Practice #9 - Better performance with parallel execution
SSIS has been designed to achieve high performance by running the executables of the
package and data flow tasks in parallel. This parallel execution of the SSIS package
executables and data flow tasks can be controlled by two properties provided by SSIS as
discussed below.
Page 26
This property specifies the number of source threads (does data pull from source)
and worker thread (does transformation and upload into the destination) that can be
created by data flow pipeline engine to manage the flow of data and data
transformation inside a data flow task, it means if the EngineThreads has value 5
then up to 5 source threads and also up to 5 worker threads can be created. Please
note, this property is just a suggestion to the data flow pipeline engine, the pipeline
engine may create less or more threads if required.
Best Practice #10 - Lookup transformation consideration
In the data warehousing world, it's a frequent requirement to have records from a
source by matching them with a lookup table.
Lookup transformation has been designed to perform optimally; for example by
default it uses
Full Caching mode, in which all reference dataset records are brought into
memory in the beginning (pre-execute phase of the package) and kept for
reference. This way it ensures the lookup operation performs faster and at the same
time it reduces the load on the reference data table as it does not have to fetch
each individual record one by one when required.
If you do not have enough memory or the data does change frequently you can either use
Partial caching mode or No Caching mode.
In Partial Caching mode, whenever a record is required it is pulled from the reference
table and kept in memory, with it you can also specify the maximum amount of memory to
be used for caching and if it crosses that limit it removes the least used records from
memory to make room for new records.
This mode is recommended when you have memory constraints and your reference data
does not change frequently.
No Caching mode performs slower as every time it needs a record it pulls from the
reference table and no caching is done except the last row.
It is recommended if you have a large reference dataset and you don't have enough
memory to hold it and also if your reference data is changing frequently and you want the
latest data.
Best Practice #11 - Finally few more general SSIS tips
1. Merge Statement: Use the MERGE statement for joining INSERT and UPDATE
data in a single statement while incrementally uploading data (no need for lookup
transformation) and Change Data Capture for incremental data pulls.
2. Change Data Capture: CDC is a new feature in SQL Server 2008 that records
insert, update and delete activity in SQL Server tables. A good example of how
this feature can be used is in performing periodic updates to a data warehouse.
Page 27
The requirement for the extract, transform, and load (ETL) process is to update the
data warehouse with any data that has changed in the source systems since the
last time the ETL process was run.
Note: CDC is a feature that must be enabled at the database level; it is disabled
by default. To enable CDC you must be a member of the sysadmin fixed server
role. You can enable CDC on any user database; you cannot enable it on system
databases.
3. RunInOptimizedMode (default FALSE) property of data flow task can be set to
TRUE to disable columns for letting them flow down the line if they are not being
used by downstream components of the data flow task. Hence it improves the
performance of the data flow task.
The SSIS project also has the RunInOptimizedMode property, which is applicable
at design time only, which if you set to TRUE ensures all the data flow tasks are
run in optimized mode irrespective of individual settings at the data flow task level.
4. Make use of sequence containers to group logical related tasks into a single
group for better visibility and understanding.
5. By default a task, like Execute SQL task or Data Flow task, opens a
connection when starting and closes it once its execution completes. If you want to
reuse the same connection in multiple tasks, you can set RetainSameConnection
property of connection manager to TRUE, in that case once the connection is
opened it will stay open, so that other tasks can reuse and also in that single
connection.
Page 28
What is data mart? Data marts are generally designed for a single subject area.
An organization may have data pertaining to different departments like
Finance, HR, Marketting etc. stored in data warehouse and each department
may have separate data marts. These data marts can be built on top of the
data warehouse.
Page 29
What is Fact? A fact is something that is quantifiable (Or measurable). Facts are
typically (but not always) numerical values that can be aggregated.
Additive Measures
Additive measures can be used with any aggregation function like Sum(), Avg() etc.
Example is Sales Quantity etc.
Page 30