Lab 01

Decision Support Systems LEIC - Alameda 2010/2011
Lab Session #1
Goals
During this lab session, you are expected to fulll the following objectives: 1. Familiarize yourself with the SQL Server Management Studio; 2. Acquire basic familiarity with Transact-SQL; 3. Interact with a database using basic analytical processing operations (namely, the CUBE and ROLLUP operations). NOTE: If you are familiar with SQL Management Studio or Transact-SQL, you may quickly browse through the document and go directly to Section 3.3.
SQL Server Management Studio
SQL Server Management Studio is a tool included with Microsoft SQL Server 2005 and later versions for conguring, managing, and administering all components within Microsoft SQL Server. The tool includes both script editors and graphical tools which work with objects and features of the server. In the tutorial SQL Server Management Studio: 1. Follow Lesson 1: Basic Navigation in SQL Server Management Studio. In this lesson you will learn how to use the components of Management Studio, how to recongure the environment layout, and how to restore the default layout. 2. Follow Lesson 2: Writing Transact-SQL. In this lesson, you will learn how to open Query Editor, how to manage code, and how to use other features of Query Editor.
Transact-SQL
Transact-SQL (T-SQL) is an extension to SQL. T-SQL expands on the SQL standard to include procedural programming, local variables, various support functions for string processing, date processing, mathematics, etc. and changes to the DELETE and UPDATE statements. Of particular interest for our course is the fact that Transact-SQL has grown to support new aggregate operators (e.g., the ROLLUP and CUBE operators) that are specically optimized for very large databases, such as those found in data marts and data warehouses. In the tutorial Writing Transact-SQL Statements:
Lab Session 1
Decision Support Systems
Page 2 of 19
1. Follow Lesson 1: Creating Database Objects. In this lesson, you create a database, create a table in the database, insert data into the table, update the data, and read the data. 2. Follow Lesson 3: Deleting Database Objects. In this lesson, you remove access to data, delete data from a table, delete the table, and then delete the database.
Aggregation functions in Transact-SQL
Throughout the rest of this lab session, we will the Job Booking System (JoBS) database, a business database for the ctional company DisasterFix. We will use this database to review common Transact-SQL operations before moving to some analytical processing operations. The tasks described here and several others on Transact-SQL were taken from http://www.blackwasp.co.uk/SQLProgrammingFundamentals.aspx.
3.1
Joins in T-SQL
As you may know, JOIN is a clause within a SELECT statement that species that the results will be obtained from two tables. It usually includes a join predicate, which is a conditional statement that determines exactly which rows from each of the tables will be joined by the query. Usually the join will be based upon a foreign key relationship and will only return combined results from the two tables when the key values in both tables match. However, it is possible to join tables based upon non-key values or even perform a cross join that has no predicate and returns all possible combinations of values from the two tables. Usually normalization results in a database with many related tables. You may therefore want to join more than two tables for a single set of results. In this case, you can include multiple joins in a SELECT statement. The rst join combines the rst two tables; the second join combines the results of the rst join with a third table, and so on. In this part, we will use an extended version of the JoBS database. To this purpose, download the script SQLscript1.sql and use it to create the database and sample data. 1. Possibly the most common form of join that you will use is the inner join. With this type of join, the two tables are combined based upon a join predicate. Wherever a row in one table matches a row in the other, the two rows are combined and added to the outputted results. If a row in either table matches several in the other, each combination will be included in the results. If a row in either table does not match any in the other table, it will be excluded from the results altogether. The join clause, second table name and join predicate are included in the SELECT statement immediately after the name of the rst table. The join uses the INNER JOIN clause and the predicate uses the ON clause. The basic syntax for the SELECT statement is as follows: SELECT <columns> FROM <table-1> INNER JOIN <table-2> ON <predicate> (a) As an example, we can join the Jobs and Engineers tables. The following query returns a list of jobs and the details of the engineer that performed the work. In this case, the predicate species that the tables will be joined only where the EngineerId in both tables is a match. These columns are used in a foreign key relationship denition for the two tables, although this is not a requirement for the join to be executed. However, if non-key columns are frequently used in join operations, you should consider adding appropriate indexes to improve query performance. SELECT * FROM Jobs INNER JOIN Engineers ON Jobs.EngineerId = Engineers.EngineerId The query returns four results as each of the four jobs in the database has an engineer associated with it. However, not all of the engineers have been assigned work so the twenty-one that have not undertaken a job are not included in the results. The example query returns every column from both tables. For larger tables, or when using multiple joins, this may include a lot of data that you do not require. In addition to making the results more dicult to read, it can also lengthen the processing time for the query and increase the amount of
Lab Session 1
Page 3 of 19
network trac generated. You should therefore only return the columns that you require. This can be achieved by specifying a column list as usual. However, if the two tables include columns with the same name, the ambiguity can cause an error. (b) Try executing the following query: SELECT EngineerId, JobId, VisitDate, EngineerName FROM Jobs INNER JOIN Engineers ON Jobs.EngineerId = Engineers.EngineerId This query fails because the EngineerId column appears in both tables. You must therefore specify which tables EngineerId column you require by prexing it with the table name and a full-stop (period) character. The following query resolves the problem: SELECT Jobs.EngineerId, JobId, VisitDate, EngineerName FROM Jobs INNER JOIN Engineers ON Jobs.EngineerId = Engineers.EngineerId (c) When using joins, you can also use WHERE clauses, ORDERBY clauses, etc. As with the column selection, you must specify the table name for any columns with ambiguous names. For example: SELECT Jobs.EngineerId, JobId, VisitDate, EngineerName FROM Jobs INNER JOIN Engineers ON Jobs.EngineerId = Engineers.EngineerId WHERE Jobs.EngineerId = 4 OR Jobs.EngineerId = 8 ORDER BY Jobs.EngineerId (d) A second syntax can be used for creating inner joins without providing a join clause. This syntax is known as an implicit inner join. In this case, the names of the tables to be joined are provided in a comma-delimited list. The join predicate becomes part of the WHERE clause.
Lab Session 1
Page 4 of 19
SELECT Jobs.EngineerId, JobId, VisitDate, EngineerName FROM Jobs, Engineers WHERE Jobs.EngineerId = Engineers.EngineerId Generally speaking, the syntax that you use for inner joins is a matter of personal preference or imposed coding standards. Many people prefer the explicit syntax as the join predicates are separated from those in the WHERE clause and are kept closer to the names of the tables that they are acting upon. It is also easier to change the type of join with the explicit syntax. It is useful to understand both variations, as you are likely to encounter each syntax in real-world scenarios. (e) Providing the full name of the table as a prex to column names can lead to long query statements. This is especially true if, to make the query more easily understood, you prefer to prex every column name rather than just those that are ambiguous. In such circumstances, it is useful to replace the table names with table aliases. A table alias provides a short code, usually of one or two letters, that can be used in place of a table name. Each alias is dened in the query after the full table name. In the next example, the J alias represents the Jobs table and the Engineers table has the alias E. This makes the query much more readable. SELECT J.EngineerId, J.JobId, J.VisitDate, E.EngineerName FROM Jobs J INNER JOIN Engineers E ON J.EngineerId = E.EngineerId (f) Table aliases are always required when creating self-referencing joins, where the two tables that are being joined are actually the same table. In the following example, the Jobs table is being joined to itself so that the initial job and follow up job can be combined in a single row in the results. The initial jobs table has an alias of J, whilst the follow up job uses the JF alias. SELECT J.JobId, J.StandardJobId, J.EngineerId, JF.JobId AS FollowUpJobId, JF.StandardJobId AS FollowUpStandardJobId, JF.EngineerId AS FollowUpEngineerId FROM Jobs J INNER JOIN Jobs JF ON J.FollowUpJobId = JF.JobId (g) When you wish to join more than two tables, you can simply add further INNER JOIN and ON clauses to the query. For example, in the JoBS database there is a many-to-many link between
Lab Session 1
Page 5 of 19
Engineers and their Skills via a junction table named EngineerSkills. We can join the three tables to generate a list that contains all engineers that have skills with one row for every engineer and skill combination. The following query returns these results in order of skill name. Note that some engineers appear more than once in the list because they have multiple skills. SELECT E.EngineerName, E.HourlyRate, E.OvertimeRate, S.SkillName FROM Engineers E INNER JOIN EngineerSkills ES ON E.EngineerId = ES.EngineerId INNER JOIN Skills S ON ES.SkillCode = S.SkillCode ORDER BY S.SkillName 2. Outer joins use a similar syntax to explicit inner joins but provide dierent results in some cases. The key dierence is that outer joins include all of the rows from one or both tables, even if there are no matching rows dened by the join predicate. Where information is missing, NULL values are substituted in the columns of the returned results. The easiest way to explain the dierence is with an example. (a) Firstly, lets consider an inner join. The query below joins the CustomerComplaints table to the Engineers table so that each complaint can be shown alongside the engineer that was at fault. Although there are three complaints in the database, only one row is returned by the query. This is because the other two rows dene complaints that were deemed not to be the engineers fault. In these cases the EngineerId column in the CustomerComplaints table is set to NULL . SELECT E.EngineerName, C.Complaint FROM CustomerComplaints C INNER JOIN Engineers E ON C.EngineerId = E.EngineerId EngineerName Joey Ohara Complaint Engineer did not have appropriate parts and was rude.
This may not be the information that we require from the query. If we actually want to list every complaint in the database, showing the engineers name only where an engineer is associated with the complaint, we must use an outer join. In this case, we will use a left outer join. This indicates that all of the values from the table to the left of the join clause will be returned. If there is no associated row from the table on the right of the join, the columns from that table will be included but will contain NULL values. To demonstrate, execute the following command, noting that the join clause has been modied to LEFT JOIN :
Lab Session 1
Page 6 of 19
SELECT E.EngineerName, C.Complaint FROM CustomerComplaints C LEFT JOIN Engineers E ON C.EngineerId = E.EngineerId The results from this query contain all three complaints. For the two complaints not associated with an engineer, the EngineerName columns value is NULL . EngineerName NULL NULL Joey Ohara Complaint Customer has received an incorrect charge on their direct debit account. Customer does not wish to receive direct marketing information. Engineer did not have appropriate parts and was rude.
(b) The opposite of a left outer join is a right outer join. As you may imagine, this returns all of the rows from the table on the right of the join clause with null values for columns from the table to the left of the clause when no matching row exists. If you alter the previous query to use a right join, all twenty-ve engineers will be returned. One of these engineers, Joey Ohara, will have a complaint listed, whilst all of the others will have NULL in the Complaint column. SELECT E.EngineerName, C.Complaint FROM CustomerComplaints C RIGHT JOIN Engineers E ON C.EngineerId = E.EngineerId (c) The third style of outer join is the full outer join. This type of join ensures that every row from both tables is included in the nal results. If any row in either table does not have a matching partner, the row is populated with NULL values for the missing data. Try modifying the query to use a full join as shown below. This time you will see twenty-seven returned rows. One will be a complaint with a matching engineer, two will be complaints without engineers and twenty-four will be for engineers with no complaints made against them. SELECT E.EngineerName, C.Complaint FROM CustomerComplaints C FULL JOIN Engineers E ON C.EngineerId = E.EngineerId 3. Cross joins are the simplest form of join and possibly the least used. A cross join does not dene a join predicate. This means that the results contain every possible combination from the two tables. This is the Cartesian product of the tables, so this type of join is often called a Cartesian join. (a) If we modify the outer join query to provide a cross join, all combinations of data are returned. As there are twenty-ve rows in the Engineers table and three rows in the CustomerComplaints table, the resultant list contains seventy-ve rows.
Lab Session 1
Page 7 of 19
SELECT E.EngineerName, C.Complaint FROM CustomerComplaints C CROSS JOIN Engineers E Note that cross joins should be used with care, particularly with large tables as the number of results can grow very quickly. (b) As with inner joins, cross joins have an implicit syntax variation. This is the same syntax as for inner joins but with no join predicate in the WHERE clause: SELECT E.EngineerName, C.Complaint FROM CustomerComplaints C, Engineers E
3.2
More Queries in T-SQL
A union allows you to combine the results of two queries. The results of the second query are simply appended to those of the rst. If you wish to combine the rows from three or more queries, you can chain two or more UNION commands. The source of the data in each query is unimportant. You can combine information from two dierent tables or perform two queries against the same table, potentially using dierent columns for each select. The key limitations are that the two queries must return the same number of columns and that the columns must have compatible data types that appear in the same order. In this part, we again make use of the JoBS database. To this purpose, download the script SQLscript2.sql and use it to create the database and sample data. 1. In the rst example we will look at the basic UNION command. This command combines the results of two queries into a single set of data. It then looks through the data to nd any exact duplicates and removes them before returning the results. (a) We use two queries for the demonstration. The rst query lists all of the complaints from the CustomerComplaints table. Run the statement below to see the three rows of data. SELECT CustomerNumber, Complaint, ComplaintTime FROM CustomerComplaints The CustomerFeedback table stores customer comments that are not formal complaints. To see the three rows that exist in the table, execute the following statement. Note that we are using columns with compatible data types to those in the previous query, as will be required for the UNION . Note that these are simple queries so that the UNION is not over-complicated. In normal use you can include queries containing WHERE clauses. SELECT CustomerNumber, Message, FeedbackTime FROM CustomerFeedback If you compare the results of the two queries you should notice that one row in each query is an exact duplicate. This will be removed when we execute the query containing a union of both of these sets of data. The UNION command is used by placing it directly between two SELECT statements. We can therefore combine the results of the previous two queries by executing the following statement. This
Lab Session 1
Page 8 of 19
will append the results of the second query to those of the rst, remove one of the two duplicates and return a result set containing ve rows. SELECT CustomerNumber, Complaint, ComplaintTime FROM CustomerComplaints UNION SELECT CustomerNumber, Message, FeedbackTime FROM CustomerFeedback (b) The column names in the two individual queries do not match. When you view the results, you should notice that the column names are taken from the rst query. If you wish to use column aliases, you may apply them to the rst query only for the correct results. For example: SELECT CustomerNumber, Complaint AS Feedback, ComplaintTime AS Time FROM CustomerComplaints UNION SELECT CustomerNumber, Message, FeedbackTime FROM CustomerFeedback (c) There is a single clause that can be used to modify the behavior of the UNION command. By adding the ALL clause, you instruct SQL Server that any duplicate rows in the combined results should be retained rather than excluded. This is ideal when you need to include the duplicated rows or when you know that every returned result will be unique. As the additional processing element is removed, the overall performance of the query can be substantially improved. The query below matches the previous one except for the addition of the ALL clause. As the identical rows are retained in the results, this query returns six rows instead of the previous ve. SELECT CustomerNumber, Complaint, ComplaintTime FROM CustomerComplaints UNION ALL SELECT CustomerNumber, Message, FeedbackTime FROM CustomerFeedback
Lab Session 1
Page 9 of 19
(d) You may only have one ORDERBY clause in a SELECT statement. When using queries that include unions, the ORDERBY clause must appear at the end of the T-SQL command. The columns being ordered can include the alias names applied to the rst query. The sorting is applied to the entire, combined set of data, as in the following example: SELECT CustomerNumber, Complaint AS Feedback, ComplaintTime AS Time, Complaint AS Type FROM CustomerComplaints UNION SELECT CustomerNumber, Message, FeedbackTime, Feedback FROM CustomerFeedback ORDER BY Time The UNION command allows you to combine the results of two separate queries, either as a distinct query or with the inclusion of duplicated values. The UNION command is an example of a set operation. We now examine two other set operations provided natively by SQL Server. These are accessed using the EXCEPT and INTERSECT commands, which are used in the same manner as UNION . They also have similar limitations, in that the combined queries must present columns with compatible data types that appear in the same order. Unlike the UNION command, the two new operations perform comparisons of the values in addition to returning only distinct rows. This presents an interesting situation when the individual query results include null values. In a normal query, two null values are not considered equal. However, when two nulls are compared during an EXCEPT or INTERSECT operation they are deemed to be equivalent. 2. The rst of the two new set operations that we will examine is EXCEPT . This command uses the results of two queries to generate a set of results. All of the rows from the rst query that do not have matching rows in the second query are returned. This means that the order of the two queries is important. (a) To demonstrate the use of the EXCEPT command we use the same two queries that were used in the union operations. The rst query retrieves the three customer complaints from the JoBS database. SELECT CustomerNumber, Complaint, ComplaintTime FROM CustomerComplaints The second query returns the three rows of customer feedback. If you run both queries you will see that a matching row appears in each. This row has the feedback, Customer does not wish to receive direct marketing information. SELECT CustomerNumber, Message, FeedbackTime FROM
Lab Session 1
Page 10 of 19
CustomerFeedback We can use the EXCEPT operation to return every complaint that does not have a matching customer feedback row with the following query. This will return two rows: SELECT CustomerNumber, Complaint, ComplaintTime FROM CustomerComplaints EXCEPT SELECT CustomerNumber, Message, FeedbackTime FROM CustomerFeedback Similarly, we can retrieve all of the customer feedback messages that do not appear in the customer complaints table by reversing the two queries as follows: SELECT CustomerNumber, Message, FeedbackTime FROM CustomerFeedback EXCEPT SELECT CustomerNumber, Complaint, ComplaintTime FROM CustomerComplaints 3. The second set operation uses the INTERSECT command. This returns the distinct set of rows that appear in both queries, represented by the shaded area in the diagram below. (a) We can use INTERSECT to return all of the customer complaints in the JoBS database that also appear within the customer feedback table. In this case, the order of the two queries is unimportant. The query below will return the row that is common to both tables. SELECT CustomerNumber, Complaint AS Feedback, ComplaintTime AS Time FROM CustomerComplaints INTERSECT SELECT CustomerNumber, Message, FeedbackTime FROM CustomerFeedback
Lab Session 1
Page 11 of 19
3.3
Aggregation in T-SQL
Transact-SQL (T-SQL) includes several aggregate functions that can be used in queries. An aggregate function calculates a value based upon the contents of multiple result rows. The calculated values of aggregations are returned in columns in a result set. 3.3.1 Basic Aggregation
We examine ve commonly used aggregate functions. These allow you to count the rows returned by a query, calculate a sum or average value and determine the maximum and minimum values in a column. There are further aggregations for calculating checksums and statistical values but these are used less often and are not examined here. We consider only simple examples that perform calculations across entire result sets. We will then introduce grouping, which allows multiple aggregate values to be generated for groups of related rows in a result set. The queries in this part are executed against the JoBS tutorial database. As such, if you have not done so already, you should download the script SQLscript2.sql and use it to create the database and sample data. 1. The rst aggregation function that we will examine is COUNT . In its simplest form, this function returns the number of rows in a table as an integer value. (a) For example, to count all of the rows in the CustomerAddresses table you can execute the following query, which should return a value of twenty-one: SELECT COUNT(*) FROM CustomerAddresses If you have a query that may count billions of rows, an int value may not be large enough. In this case, you may use the COUNT_BIG function. The functionality is almost identical to count, except that the returned result is a bigint value. SELECT COUNT_BIG(*) FROM CustomerAddresses (b) The above syntax use an asterisk to specify that all rows in the resultant data will be counted irrespective of the data that they contain. If you replace the asterisk character with the name of a column, the functionality is modied slightly. In that case, all rows that contain information in the named column are counted. Any rows with a NULL value in the stated column are excluded from the count. The CustomerAddresses table contains several columns for each customer address. The second of these, Address2, is a nullable column so can be used in a demonstration. Run the following query to see that the NULL values are not counted: SELECT COUNT(Address2) FROM CustomerAddresses (c) In addition to counting only non-null values, you may decide that you want to count only distinct, non-null values. This can be achieved by adding the DISTINCT clause to the column name. The following query should return a count of nineteen rows even though there are twenty-one values in the TownOrCity column. As the Leeds and Manchester values are duplicated, each is only counted once. SELECT COUNT(DISTINCT TownOrCity) FROM CustomerAddresses (d) Notice that the name of the calculated column is generated automatically by SQL Server when running aggregation queries. As with other types of query, you can include an alias to give a name to the calculated columns. The next query calculates a count of distinct cities and a count of all addresses. To easily identify the two values, the columns are named. SELECT COUNT(DISTINCT TownOrCity) AS Cities, COUNT(*) AS Addresses FROM CustomerAddresses Note that this query includes two aggregate functions. It is acceptable to add many aggregated columns to a query. However, you may not mix aggregated and non-aggregated columns in a query without the use of grouping.
Lab Session 1
Page 12 of 19
(e) So far, the queries we have executed have counted the rows from an entire table. It is possible to include a WHERE clause in a query to lter the results before performing a calculation. In the next example, only addresses within the city of Leeds are counted. This should return a value of two. SELECT COUNT(*) FROM CustomerAddresses WHERE TownOrCity=Leeds 2. The second aggregate function that we will consider is SUM . This function can be applied to numeric columns to add all of the non-null values together and return the total. The syntax is similar to that of the COUNT function but you may not use an asterisk between the parentheses. (a) To obtain the total value for all of the contracts in the JoBS database, we can use the following T-SQL statement: SELECT SUM(ContractValue) FROM Contracts (b) The value being totaled need not be a simple column name. You may include a mathematical expression to be calculated for each row in the table before the summing occurs. A very simplistic expression could double the contract value before totaling.1 SELECT SUM(ContractValue * 2) FROM Contracts (c) As with the COUNT function, you may decide to only sum distinct results using the DISTINCT clause. SELECT sum(DISTINCT ContractValue) FROM Contracts 3. The AVG function calculates the arithmetic mean of a column or expression. This is achieved by summing all of the values and dividing by the number of rows. However, as with all other aggregate functions, any NULL values are ignored. This is important when calculating the average as it can produce unexpected results. For example, if the column being processed contains the values 1, 2, 3, 4 and NULL the sum of the values is 10. You may expect that the average would be 2 as there are ve rows. Actually, the result of the AVG function will be 2.5, as the total is divided by the number of non-null rows (4), not the overall number of rows (5). To calculate the average contract value for all contracts, run the following query: SELECT AVG(ContractValue) FROM Contracts As with the SUM function, you can use the DISTINCT clause. 4. The last two aggregate functions that we will consider are MIN and MAX . These return the smallest and largest values, respectively, when used with numeric columns. These functions may also be applied to character-based columns. Used in this manner, they return the rst or last item in the column, according to the sort sequence of the table. SELECT MAX(ContractValue), MIN(ContractValue) FROM Contracts 3.3.2 Grouping
So far, we examined some of the aggregate functions provided by SQL Server. These functions allow us to count rows, calculate totals and averages, and nd the maximum and minimum values across a series of rows returned by a query. In each case, we performed an aggregation over a querys entire result set. This allowed us to answer questions such as What is the entire sales income for the business? A common requirement is to perform similar functions over groups of related rows in a result set, calculating the aggregates for each group to provide a summary. This allows us to rene the questions asked into, for example, What is the sales income by region for the business? The results of this query would include one row per group with totals in each.
1 This query is for example purposes only. In reality it would be more ecient to calculate the sum and double the nal result.
Lab Session 1
Page 13 of 19
This type of query can be created using the GROUP BY clause within a SELECT statement. The GROUP BY clause denes one or more columns or expressions that should be used to determine which rows belong to each group. The groups can then be summarized using the appropriate aggregate functions. In this part, we again make use of the JoBS database. To this purpose, download the script SQLscript2.sql and use it to create the database and sample data. 1. The GROUP BY clause is added to the end of a querys SELECT statement. The clause is followed by a comma-separated list of the expressions for which you wish to create grouped sets. Often each item in the list will be a simple column name. However, you can create expressions that combine column names, literal values and functions to create groups. GROUP BY expression1, expression2, ..., expressionN (a) As an example, we may wish to calculate the number of units of stock held by each engineer in the JoBS database. To calculate the total number of stocked parts for all of the engineers combined, we would use the following statement: SELECT SUM(UnitsHeld) FROM EngineerStock To modify the statement to show a subtotal for each engineer individually we can add the engineer ID to the list of returned columns and to a GROUP BY clause. Note that EngineerId is not required in the column list, but without it you will not be able to see the link between each total and the engineer. SELECT EngineerId, SUM(UnitsHeld) FROM EngineerStock GROUP BY EngineerId (b) As with previous aggregation operations you will see that the calculated column in the results has no column name assigned. As before, we can add a column alias if desired: SELECT EngineerId, SUM(UnitsHeld) AS Units FROM EngineerStock GROUP BY EngineerId (c) When creating queries that use grouping, the list of columns to be returned in the results is more restricted than for a standard query. The column list may only include aggregated values and columns or expressions that appear in the GROUP BY clauses list. The following query attempts to select the PartNumber column, which meets neither criterion, so causes an error when executed. SELECT EngineerId, PartNumber, SUM(UnitsHeld) FROM EngineerStock GROUP BY EngineerId 2. A GROUP BY clause operates on any data that is gathered by the preceding query. This makes it appropriate for queries that gather information from more than one linked table. However, there are some common mistakes that can occur when aggregating data from joined tables with grouping. When using joins for one-to-many or many-to-many relationships, queries often return some duplicated data from the joined tables. If the duplicated data includes values that are to be included within SUM or AVG functions, they can return unexpected results due to double counting. Similarly, when using the COUNT function, duplicates can be counted twice unless the DISTINCT clause is included. When using small amounts of data for testing purposes it is possible that you will not notice such calculation errors. You should always pay extra attention when testing aggregates in grouped queries. It can be useful to execute queries without the grouping to help identify areas of potential duplication that are likely to cause miscalculations. In the following query we join the Engineers and EngineerStock tables to improve the previous query. In this example we include the engineers name in addition to their unique ID. As both the ID and name are displayed, they are both included in the GROUP BY clause. This does not generate extra groups because there is a one-to-one relationship between an engineers name and ID.
Lab Session 1
Page 14 of 19
SELECT E.EngineerName, S.EngineerId, SUM(S.UnitsHeld) FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId GROUP BY S.EngineerId, E.EngineerName The behavior of the GROUP BY clause can be modied using the ALL keyword. This keyword causes all of the possible groups from a table to be returned, even if the querys WHERE clause lters out all of the results for a particular group. In this case, the group will be displayed but all aggregated values in the row will be null. GROUP BY ALL has some limitations. Key to these is that it should never be used with remote tables. Additionally, the clause is not supported by Microsoft when used with a WHERE clause. Without such a WHERE clause the addition of the ALL keyword has no meaning. Generally you should not use the ALL variation; it is included in this article only for completeness. Microsoft has indicated that GROUP BY ALL will be removed from future versions of SQL Server. 3. The results of a query containing grouping can be ltered in two ways. Firstly, a WHERE clause may be used before the GROUP BY clause to lter the results before any aggregate calculations are made. Secondly, you can use the HAVING clause to add criteria that are considered after the grouping and aggregation operations. (a) The HAVING clause uses a similar syntax to the WHERE clause. The key dierence is the ability to apply ltering to calculated columns. For example, if we wish to amend the previous query to only return engineers that have a total stock of more than twenty units we can use the following query: SELECT E.EngineerName, S.EngineerId, SUM(S.UnitsHeld) FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId GROUP BY S.EngineerId, E.EngineerName HAVING SUM(S.UnitsHeld) > 20 (b) Often both a WHERE clause and a HAVING clause will be used in the same query. For example, the following query lists the engineers who have more than twenty units of stock of 15mm Copper Pipe (15COPIPE). The WHERE clause limits the initial query to the copper pipe item before the grouping occurs. The HAVING clause removes any rows where the total stock calculation gives an answer of twenty or less. SELECT
Lab Session 1
Page 15 of 19
E.EngineerName, S.EngineerId, sum(S.UnitsHeld) FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId WHERE S.PartNumber = 15COPIPE GROUP BY S.EngineerId, E.EngineerName HAVING sum(S.UnitsHeld) > 20 4. For the nal example we will add a sort order to the above query. When grouping results, ordering may be applied to any of the columns in the query, including those that are calculated by aggregate functions. The query below demonstrates this by ordering the results by the calculated stock total, with the highest stock level rst and the lowest last. SELECT E.EngineerName, S.EngineerId, SUM(S.UnitsHeld) FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId WHERE S.PartNumber = 15COPIPE GROUP BY S.EngineerId, E.EngineerName HAVING SUM(S.UnitsHeld) > 20 ORDER BY SUM(S.UnitsHeld) DESC 3.3.3 Analytical Processing in T-SQL
In Section 3.3.2 we investigated the use of the GROUP BY clause to create queries that performed aggregation of data for groups of related rows. You could use this grouping to calculate a business sales income for each of a set of possible targeted regions and list the results using a single query. Without the grouping, the same aggregate functions could be used to calculate the total sales income for the entire business. However, to obtain both the subtotals and the overall total would require two separate queries to be executed. If there were multiple groups to calculate subtotals across, for example sales region and sales person, to obtain the subtotals required to populate a simple table would require further queries. We now look at two clauses that can be added to a query containing a GROUP BY clause. The WITH ROLLUP and WITH CUBE clauses assist in summarizing information in the manner described above. In addition to the totals for each grouped row, new rows are added to the results to provide grand totals. We
Lab Session 1
Page 16 of 19
note that, although the examples in this part use the SUM aggregate function, the clauses can also be used with COUNT , AVG , MIN and MAX . In this part, we again make use of the JoBS database. To this purpose, download the script SQLscript2.sql and use it to create the database and sample data. 1. Lets start with a very simple example of the WITH ROLLUP clause that can be built upon as we progress through the rest of the lab session. (a) If we wish to nd the total number of stock items held by each engineer in the JoBS database, we can use a query with a GROUP BY clause. The query below obtains this information by summing the UnitsHeld value from the EngineerStock table, with grouping according to the engineer holding the stock. SELECT E.EngineerName, SUM(S.UnitsHeld) FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId GROUP BY E.EngineerName The above query nds the subtotals of stock levels for each engineer but does not nd the total stock held for all of the engineers combined. To obtain this, we must execute the aggregate function without any grouping, as follows: SELECT SUM(UnitsHeld) FROM EngineerStock If we wished to combine the results from the two queries above, we could add a new rst column to the second query, possibly using a literal VarChar value of Total. This would make the column lists compatible so that the queries result sets could be combined using a UNION . However, using two separate queries would be inecient. Instead, we can use a rollup. (b) When only one column is used for grouping, a rollup adds one new row to the querys output. This row contains aggregated values for the entire result set. If you are using the SUM function, the new row includes the total of all of the rows returned by the query. For other aggregate functions the appropriate result for the entire result set is calculated. To use the rollup functionality, add the WITH ROLLUP clause to the end of the query: SELECT E.EngineerName, SUM(S.UnitsHeld) FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId GROUP BY E.EngineerName WITH ROLLUP The results from the above query should contain the grand total for all of the rows. In the grouping column, EngineerName, the new row has a NULL value. Sometimes this is useful and sometimes you will want to replace the NULL with more descriptive text. This process will be explained shortly.
Lab Session 1
Page 17 of 19
The ROLLUP clause become more interesting when used in queries that have grouping across several columns. In these cases, the grouping columns are used to generate a hierarchy. The rst grouping column is the root of the hierarchy and each subsequent grouping column adds another layer. Totals are then added at each level of the hierarchy. For example, if we were to modify the previous query so that it grouped by the engineers name, then by the stock items part number, subtotals would be created for each engineer and a grand total created for all of the engineers in the results. These would be the rows added by the clause that are supplementary to the totals calculated for each combination of engineer and part. (c) You can picture this by placing the data into a simple pivot table. Below there are three engineers and three products. We can see grand totals for each engineer but not for each product. The missing totals are shown in the table as question marks (?). Part/Engineer Part 1 Part 2 Part 2 Totals Engineer A 10 8 12 30 Engineer B 16 10 11 37 Engineer C 20 7 23 50 Totals ? ? ? 117
To show this using the data from the JoBS database, execute the following query. This will not, of course, be displayed as a pivot table. SELECT E.EngineerName, S.PartNumber, SUM(S.UnitsHeld) FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId GROUP BY E.EngineerName, S.PartNumber WITH ROLLUP (d) If the order of grouping is altered, the totals that are calculated also change. If we change the grouping to be by part rst, then by engineer, we lose the engineer totals but gain summaries for each part. Again, we can see this by pivoting the data: Part/Engineer Part 1 Part 2 Part 2 Totals Engineer A 10 8 12 ? Engineer B 16 10 11 ? Engineer C 20 7 23 ? Totals 46 25 46 117
The query for this is as follows, note the changed order of grouping columns. SELECT S.PartNumber, E.EngineerName, SUM(S.UnitsHeld) FROM EngineerStock S
Lab Session 1
Page 18 of 19
INNER JOIN Engineers E ON S.EngineerId = E.EngineerId GROUP BY S.PartNumber, E.EngineerName WITH ROLLUP 2. When summarizing information using the ROLLUP clause, SQL Server provides a new function named GROUPING . This function requires a single parameter containing the name of one of the grouping columns in the query. The function then returns either 1 or 0 for every row in the results set. If the named column is aggregated for the row, meaning that the rollup has inserted a NULL value into the column for that row, the value of the function will be 1. If not, it will be zero. (a) We can see the GROUPING function in action by executing the following query. This query adds two columns containing the GROUPING results for PartNumber and EngineerName. Scan the results to conrm that wherever a NULL is inserted by the ROLLUP , a corresponding 1 appears in the matching GROUPING column. SELECT S.PartNumber, E.EngineerName, SUM(S.UnitsHeld), GROUPING(S.PartNumber) AS PartGrouping, GROUPING(E.EngineerName) AS EngineerGrouping FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId GROUP BY S.PartNumber, E.EngineerName WITH ROLLUP With the use of the GROUPING function, we can determine whether a NULL has been generated for a grouping column by the summarizing process. We can use this knowledge to replace the NULL with a more suitable value when necessary. To do this we will use a conditional statement named CASE . We do not describe the CASE statement in detail here, and instead refer to the Transact-SQL reference manual. We simply describe one usage, which allows a value to be tested and used to determine which of two other values to return in the results. (b) We use two CASE statements. The rst will be for the PartNumber column. The condition being checked, or predicate, will be whether the GROUPING function for the column returns 1. If it does, the literal text, ALL PARTS will be returned for the row. If not, the column value from the table will be returned. The syntax for this condition is: CASE WHEN GROUPING(S.PartNumber)=1 THEN ALL PARTS ELSE S.PartNumber END As you can see, the predicate appears after the WHEN clause. The two possible answers appear after the THEN and ELSE clauses. This makes the statement easy to read in English as When the grouping value of the PartNumber column is 1 then return ALL PARTS. Otherwise, return the value of the PartNumber column. The second CASE statement will be similar to the rst, except that it will operate on the engineers name.
Lab Session 1
Page 19 of 19
To demonstrate, execute the following query. You should see several occurrences of the ALL ENGINEERS text and one grand total row containing both ALL PARTS and ALL ENGINEERS. SELECT CASE WHEN GROUPING(S.PartNumber) = 1 THEN ALL PARTS ELSE S.PartNumber END, CASE WHEN grouping(E.EngineerName) = 1 THEN ALL ENGINEERS ELSE E.EngineerName END, SUM(S.UnitsHeld) FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId GROUP BY S.PartNumber, E.EngineerName WITH ROLLUP Sometimes you will nd queries where the NULL values are replaced with standard text using the ISNULL function. This gives the correct results when the grouped columns cannot contain NULL values. However, if they can contain NULL values, these will be replaced too, potentially giving misleading results. 3. For our nal example we use the WITH CUBE clause. This is similar to the WITH ROLLUP clause in its operation, as it adds new summary rows for groups of results. However, rather than creating a hierarchy of results, a cube contains totals for every possible permutation of group. Returning to our pivot table layout, every total will be calculated. Part/Engineer Part 1 Part 2 Part 2 Totals Engineer A 10 8 12 30 Engineer B 16 10 11 37 Engineer C 20 7 23 50 Totals 46 25 46 117
(a) To show the results, execute the following query. This is the same as the previous query except that it uses CUBE instead of ROLLUP . You can see that the result set contains subtotals for every engineer and for every part individually. SELECT CASE WHEN GROUPING(S.PartNumber) = 1 THEN ALL PARTS ELSE S.PartNumber END, CASE WHEN grouping(E.EngineerName) = 1 THEN ALL ENGINEERS ELSE E.EngineerName END, SUM(S.UnitsHeld) FROM EngineerStock S INNER JOIN Engineers E ON S.EngineerId = E.EngineerId GROUP BY S.PartNumber, E.EngineerName WITH CUBE

Lab 01

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lab 01

Загружено:

Авторское право:

Доступные форматы

Decision Support Systems LEIC - Alameda 2010/2011

SQL Server Management Studio

Decision Support Systems

Aggregation functions in Transact-SQL

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

More Queries in T-SQL

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Вам также может понравиться