Arrays and Lists in SQL Server 2005

Arrays and Lists in SQL Server 2005 Introduction In the public forums for SQL Server, you often
see people asking How do I use arrays in SQL Server? Or Why does SELECT * FROM tbl WHERE col IN (@list) not work? The short answer to the first question is that SQL Server does not have arrays SQL Server has tables. However, you cannot specify a table as input to SQL Server from a client. What you can do is to specify a string and unpack that into a table. This article describes a number of different ways to do this, both good and bad. I first give a background to the problem (including a quick, if the not the best, solution). I then give a brief overview over the methods, whereupon I discuss general issues that apply, no matter which method you use. Having dealt with these introductory topics, I devote the rest of the main article to detailed descriptions of all methods, and I discuss their strengths and weaknesses. To find out how well these methods perform, I have conducted performance tests, and I relate the results of these tests in a separate appendix. If you feel deterred by the sheer length of this article, you should be relieved to know that this is the kind of article where you may come and go as you please. If you are a plain SQL programmer who want to know "how do I?", you can drop off already after the first solution if you are in a hurry. If you have a little more time, you read the background, the overview and the General Considerations section, and study the methods that look the most appealing to you. True SQL buffs who are curious about the performance numbers, might find the explanations of the methods a little tedious and may prefer to skim these parts, and then go directly to the performance-test appendix. Note: this article covers SQL 2005 only (save a few back references to SQL 2000). If you are using SQL 2000, there is an older version of this article that covers SQL 2000, SQL 7 and SQL 6.5. Note: all samples in this article refer to the Northwind database. This database does not ship with SQL 2005, but you can download the script to install it from Microsoft's web site. Background Comma-separated List of Values You have a number of key values, identifying a couple of rows in a table, and you want to retrieve these rows. If you are the sort of person who composes your SQL statements in client code, you might have something that looks like this: SQL = "SELECT ProductID, ProductName FROM Northwind..Products " & _ "WHERE ProductID IN (" & List & ")" rs = cmd.Execute(SQL) List is here a variable which you somewhere have assigned a string value of a comma-separated list, for instance "9, 12, 27, 39". This sort of code above is bad practice, because you should never interpolate parameter values into your query string. (Why is beyond the scope of this article, but I discuss this in detail in my article The Curse and Blessings of Dynamic SQL, particularly in the sections on SQL Injection and Caching Query Plans.) Since this is bad practice, you want to use stored procedures. However, at first glance you don't seem to find that any apparent way of doing this. Many have tried with:
CREATE SELECT FROM WHERE PROCEDURE get_product_names @ids varchar(50) AS ProductID, ProductName Northwind..Products ProductID IN (@ids)
But when they test this: EXEC get_product_names '9, 12, 27, 37' The reward is this error message: Server: Msg 245, Level 16, State 1, Procedure get_product_names, Line 2 Syntax error converting the varchar value '9, 12, 27, 37' to a column of data type int.
This fails, because we are no longer composing an SQL statement dynamically, and @ids is just one value in the IN clause. An IN clause could also read: ... WHERE col IN (@a, @b, @c) Or more directly, consider this little script:
CREATE go INSERT INSERT SELECT TABLE #csv (a varchar(20) NOT NULL) #csv (a) VALUES ('9, 12, 27, 37') #csv (a) VALUES ('something else') a FROM #csv WHERE a IN ('9, 12, 27, 37')
The correct way of handling the situation is to use a function that unpacks the string into a table. Here is a very simple such function:
CREATE FUNCTION iter$simple_intlist_to_tbl (@list nvarchar(MAX)) RETURNS @tbl TABLE (number int NOT NULL) AS BEGIN DECLARE @pos int, @nextpos int, @valuelen int SELECT @pos = 0, @nextpos = 1 WHILE @nextpos > 0 BEGIN SELECT @nextpos = charindex(',', @list, @pos + 1) SELECT @valuelen = CASE WHEN @nextpos > 0 THEN @nextpos ELSE len(@list) + 1 END - @pos - 1 INSERT @tbl (number) VALUES (convert(int, substring(@list, @pos + 1, @valuelen))) SELECT @pos = @nextpos END RETURN END
The function simply iterates over the string looking for commas, and extracts the values one by one. The only complexity is the logic to handle the last value in the string. Here is an example of how you could use this function:
CREATE PROCEDURE get_product_names_iter @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN iter$simple_intlist_to_tbl(@ids) i ON P.ProductID = i.number go EXEC get_product_names_iter '9, 12, 27, 37'
So there, you have a solution to the problem. But let me say it directly that the function above is not extremely speedy, and almost all methods I will discuss in this article are faster than the one above. Nevertheless it's good enough for many situations, particularly if your list is short. So if you are in a hurry and want to move on with your project, feel free to stop here and come back later if you are curious or run into problems with performance and need to learn more. If you are in the situation your lists are in a table column, and you want to rush, head for the section Unpacking Lists in a Table. Inserting Many Rows A related problem is you need to insert many rows. You think that one INSERT statement at a time, or calling a stored procedure for every row will be slow because of the many network roundtrips, so you would like to some more efficient method. Of course, if you are looking into to importing data on a major scale, you should consider SQL Server Integration Services (SSIS) or to use bulk load with BCP or BULK INSERT. But sometimes SSIS or bulk load would shoot over the target and you want something more lightweight, yet more efficient than sending one row at a time.
Indeed, most of the methods that I describe in this article can be used for this purpose. Most of them are best fitted to handle "arrays" of single values, although they can be reworked to handle records with several fields. There are two methods that serve this purpose better, and that is XML and a trick with INSERT-EXEC that I discuss in the section Making the List into Many SELECT. In this article I'm focusing on comma-separated lists, since most questions on the newsgroup are about this scenario. But I will occasionally touch the topic of inserting many rows. Overview of the Methods As I've already hinted there are quite a few methods to unpack a list into table. Here I will just give a quick overview of the methods, before I move on to the general considerations. The Iterative Method. Looping through a comma-separated list, and returning the elements in a table. While being one of the slower methods, performance is still acceptable for most applications. Its main advantage is that it is easy to understand and easily adaptable to different input formats. Using the CLR. SQL 2005 adds the possibility to write table-valued functions (and stored procedures etc) in .Net languages such as C# and Visual Basic. This is one of the fastest methods, and if you are used to C# or VB programming, you will find that this method lends it well to extensions. XML. XML is the prime choice when you need to insert many rows at one time. To handle a comma-separated list it is maybe a bit of overkill. If you use the new XQuery methods added in SQL 2005, XML has good performance. Using a Table of Numbers to unpack a comma-separated list. A relationally "pure" solution, that is not far behind the CLR in performance. Fixed-length Elements. Rather than using a comma-separated list, use a string where all elements have the same length. This is the fastest method of all up to a certain limit (which is quite high). This method also uses a table of numbers. Using a Recursive Common Table Expression (CTE). If you want a method that does not require any extra support such as the CLR or a table of numbers, and maybe not even a function, this method is a good choice. Performance is not marvellous, but better than for the iterative method. Dynamic SQL. For a list of numbers, it may appear simpler than any other method, but there are several complications with regards to security. And while performance has improved a lot since SQL 2000, this method is slower than most other methods, particularly for long input. Making the List into Many SELECT. The list is transformed into many SELECT statements of which the result sets are inserted into a temp table. This method does not really have any advantage for handling comma-separated lists, but it is an interesting alternative when you need to insert many rows. Really Slow Methods. Methods that uses charindex, patindex or LIKE. These solutions are just unbelievably slow even for short input.
General Considerations Interface Most of the methods I present are packaged into functions that take an input parameter which is a list of values and returns a table, like this:
CREATE FUNCTION list_to_table (@list nvarchar(MAX)) RETURNS @tbl TABLE (number int NOT NULL) AS
The reason the methods are in functions is obvious: this permits you to easily reuse the function in many queries. Here I will discuss some considerations about the interface of such functions. The Input Parameters In this article the, input parameter is always of the data type nvarchar(MAX). This data type is a new data type in SQL 2005 that can fit up to 2 GB of data, just like the old ntext data type, but nvarchar(MAX) does not have the many quirks of ntext. I made this choice, because I wanted to make the functions as generally applicable as possible. By using nvarchar the functions can handle Unicode input, and with MAX the functions permit unlimited input. With nvarchar(4000), they would silently yield incorrect results with longer input, which is very bad in my book. Nevertheless, there is a performance cost for these choices. If you use an SQL collation, you should know that varchar gives you better performance (more on that in a minute). And some operations are slower with the MAX data types. Thus, if you know that your lists will never exceed 8000 bytes and you will only work with your ANSI code page, but you need all performance you can get, feel free to use varchar(8000) instead.
Some of the functions take a second parameter to permit you to specify the delimiter, or in case of fixed-length the element length. In some functions I have opted to hard-code the delimiter, but for all methods that takes a delimiter (save dynamic SQL), you can always add such a parameter. To keep things simple, I have consistently used one-character delimiters. If you need multi-character delimiters, most methods can be extended to handle this. The Output Table The output from the functions is tables. For all methods, I've included one function for returning a table of strings, and for some methods also a function for returning a table of integers. (Which is likely to be the most common data type for this kind of lists.) If you have a list of integers and a function that returns strings, you can use it like this:
SELECT ... FROM tbl t JOIN list_to_table(@list) l ON t.id = convert(int, t.str)
Similar applies to other data types. You can easily clone a version of the function that has the convert built-in. (However, check the Robustness section for some things to look out for.) The data type that requires most consideration is actually strings. Return nvarchar or varchar? Obviously when you work with Unicode data, you need to get nvarchar strings back. It may be tempting to always return nvarchar, but for reasons that I will return to in the performance section, you should make sure that you have a varchar string when you join with a varchar column. For some methods, the return table includes both a varchar and an nvarchar column. In some functions, I also return the position in the list for the list elements. This can be handy when you have two or more lists that are horizontal slices of the same source, so you can say things like:
INSERT tbl(id, col2, col3) SELECT a.number, b.str, c.str FROM intlist_to_table(@list1) a JOIN charlist_to_table(@list2) b ON a.listpos = b.listpos JOIN charlist_to_table(@list3) c ON a.listpos = c.listpos
That is, this is a way to insert many rows in one go, although it's not really the best one. Sometimes this can be OK if you only have two or three columns per row to insert, but as the number of parallel lists grows, it gets out of hand, and you should investigate XML instead. The particular danger with the approach above is that if the lists get out of sync with each other, you will insert incorrect data. For some methods, the list position can easily be derived from the method itself, for others (but not all) you can use the row_number() function, a very valuable addition to SQL 2005. Robustness It can't be denied that parsing strings is a bit risky. As long as the input plays by the rules everything goes fine, but what happens if it doesn't? A good list-to-table function can protect you from some accidents, but not all. Here are a couple of situations to watch out for. Delimiter in the Input Say that you have a couple of city names like this: Berlin, Barcelona, Frankfurt (Main), Birmingham, Kbenhavn. From this you want to compose a comma-separated list that you pass to list-to-table function. With the names listed above, that works fine, but then some joker enters Dallas, TX. Oh-oh. There are several ways to deal with this problem. One is to use a delimiter that is unlikely to appear in the input data, for instance a control character. Many programs put strings into quotes, so the above list would read "Berlin","Barcelona" etc. This latter format is not supported by any of the functions I present, but you could tweak some of them a bit to get there. Sometimes you cannot really make any assumption about the delimiter at all, for instance if the input comes from user input or on the wire. In such case you will need to use a method with a general escape mechanism, of which I present one, to wit XML. Or you can avoid the delimiter business completely by using fixed-length strings. When you work with lists of integers, this is not very likely to be a problem. Extra Spacing
If you have an input list that goes: ALFKI, VINET, BERGS,FRANK Do you want those extra spaces to be included in the data returned by the list-to-table function? Probably not. All functions in this article strips trailing and leading spaces from list elements. However, there are some methods, where this is not possible. (Or more precisely, they are not able to handle inconsistent spacing.) Illegal Input Say that you have a function that accepts a list of integers, and the input is 9, 12, a, 23, 12. What should happen? With no particular coding, SQL Server will give you a conversion error, and the batch will be aborted. If you prefer, you can add checks to your function so that the illegal value is ignored or replaced with NULL. To focus on the main theme, I have not added such checks to the functions in this article. Empty Elements What if your function that accepts a list of integers, is fed the input: 9, 12,, 23, 12. How should that double comma be interpreted? If you just do a simple-minded convert, you will get a 0 back, which is not really good. It would be better to return NULL or just leave out the element. (Raise an error? You cannot raise errors in functions.) One approach I have taken in some functions in this article is to avoid the problem altogether by using space as delimiter. But since T-SQL does not provide a function to collapse internal spacing, the approach is not without problems. For methods that build on logic of traditional programming, you can easily handle multiple spaces, but for methods that uses a combination of charindex and set-based logic, you would still have to filter out empty elements in the WHERE clause. (Something I have not done in this article.) Performance Considerations While I have conducted performance tests and devoted a long appendix to them, the important performance aspect is not with the methods themselves, but how you use them and how they are packaged. In this section I will look into some important issues. varchar vs. nvarchar As I discussed in the Interface section, it appears to be a good choice for a function that unpacks a list of strings to have an nvarchar column in its return table, so it can work with both Unicode and 8-bit data. Functionally, it's sound. Performancewise it can be a disaster. Say that you have:
SELECT ... FROM tbl t JOIN list_to_table(@list) l ON t.indexedvarcharcol = l.nvarcharcol
Why is this bad? Recall that SQL Server has a strict data-type precedence, which says that if two values of different data types meet, the one with lower precedence is converted to the higher type. varchar has lower precedence than nvarchar. (Quite naturally, since the set of possible values for a varchar column is a subset of the possible values for an nvarchar column.) Thus, in the query above indexedvarcharcol will be converted to nvarchar. The cost for this depends on the collation of the column. If the column has a Windows collation, SQL Server will still be able to use the index, because in a Windows collation the possible the varchar values comes in the same order as the nvarchar values. SQL 2005 makes benefit of this by doing a range seek. But this range seek is more expensive than a straight seek, and my tests indicate a doubling of the execution time, which of course is reason enough to avoid it. The real disaster is if the column has an SQL collation: in this case the index is completely useless, because in an SQL collation, varchar and nvarchar values sort differently. This means that SQL Server will have to find some other way to run the query, most likely one that requires a scan of the entire table. Instead of a sub-second response, it may take minutes to run the query. Thus, it is instrumental that you have separate functions to return varchar and nvarchar data. Or do as I have done in some functions in this article: have two return columns, one for varchar and one for nvarchar. You can also always try to remember to convert the output column to the appropriate data type but it's human to forget. You can find out the collation of a column with sp_help.
There is one more thing to say on this theme, which I will come back to in a minute. Inline, Multi-Statement and Temp Tables This far I have only said "table function", but there are different kinds of table functions, and you can use them in more than one way. And this can have a very significant impact on performance. Essentially there are five possibilities: SQL Inline. An inline table-function in T-SQL is only a function to its syntax. It is in fact a parameterised view. When a query includes an inline function, SQL Server expands the function as if it was a macro, and the optimizer works with the expanded query text. Thus, there is no overhead at all for an inline function. To this category I also count dynamic SQL, although it cannot be packaged into a function as such. Multi-statement Function. A multi-statement function, on the other hand, has a body that is executed on its own. A multistatement function is computed separately and returns its result a table variable. Consequently, there is an overhead for the intermediate storage. There are no statistics associated with table variables, so the optimizer is blind to what the function returns, and it can only apply standard assumptions. Opaque Inline. Non-SQL methods such as the CLR or XML fall in between the two T-SQL function types. On the one hand, the optimizer have no information about what they might return, but applies standard assumptions. On the other hand, the results of the operations are streamed into the query. That is, there is no intermediate storage as for multistatement functions, so they are still inline in that sense. Bounce Data over a Table Variable. Rather than joining with the function directly, you can insert the data into a table variable and then join with your target table. If you do this with an T-SQL inline function, this is very similar to rewriting that function as a multi-statement function, including the important note that in lieu of statistics the optimizer can only apply standard assumptions. (I get the impression from my test results that the overhead of a multi-statement function is higher than when you just bounce over a table variable, but I have not examined this in detail.) Bounce Data over a Temp Table. Instead of joining with a function directly, you can unpack your list into a temp table. It is often said that temp table incurs more overhead than a table variable, since it's fully logged, but from my tests I am not really able to confirm this statement. More importantly, temp tables have statistics, which means that the optimizer has more information and the odds for a good plan are better.
No method can be implemented in all five ways, but for most you at least have the choices of a multi-statement function or a temp table. Only some methods lend themselves to any of the two inline alternatives. It may sound from the above that SQL Inline is the most preferable, since there is no intermediate table and the optimizer has full information. Unfortunately, the latter point is not really true. To be able to estimate the best way to access the other tables in the query, the optimizer would need to know is: 1) how many rows will the input string generate 2) the distribution of the values. But the optimizer is not able to do that with the SQL inline functions in this article, because the information is buried too deep in the logic of these functions. (There is one exception: dynamic SQL, which has its own set of problems which makes it less palatable.) Thus, in practice the optimizer will apply blind assumptions no matter you use SQL inline, opaque inline, a multi-statement function or a table variable. So then it does not matter which one you use? Oh, no. The blind assumptions are different for the different inline methods, and if the function uses an auxiliary table of numbers, the size of that table will affect the blind assumptions. (Because the optimizer has information about that table, is able to use it.) And the blind assumptions for CLR functions and XML are different from each other and from those for T-SQL inline functions. The assumptions for multi-statement functions and table variables appears to be the same vis--vis each other, but yet different from the inline methods. With some luck, if the blind assumption for one method leads the optimizer astray, it may work better with another. Overall, you could say that the T-SQL inline functions have the potential for more "interesting" query plans. Here is an extract from a mail that I received in response to my old article for SQL 2000: After reading your article 'Arrays and list in SQL server' I tried to use the Fixed-Length Array Elements method in my application. Everything worked fine until I moved the code from a client batch to a stored procedure. When I looked at the query execution plan, I saw that the number of rows retrieved from the Numbers table was over 61 millions ! Instead of starting by joining the Numbers table with the source table to filter out the 500 rows included in the array, it processes the GROUP BY clause on the entire table (121000 rows) and then it uses a nested loop to match each entry with the Number table. With all other methods but T-SQL inline, the optimizer does not really have much other choice than to first compute the table from the list input, but some T-SQL inline functions opens for the possibility for a reverse strategy which is not likely to be successful. In this particular case, I suggested that he should try a multi-statement function instead, and that resolved his issue. But "interesting" does not always mean bad. Later in the text, I will discuss cases where T-SQL inline gives unexpectedly good performance on multi-CPU machines, because the optimizer finds a parallel plan. What about temp tables then? Initially, when the optimizer compiles a stored procedure, it makes a blind assumption about a temp table. But if a sufficiently amount of data is inserted into the table this will trigger auto-statistics, and this will in its turn trigger a recompile of the statements where the temp table is referenced. This recompile is both a blessing and a curse. It's a blessing, because
it gives the optimizer a second chance to find a better plan. But if the optimizer comes up with the same plan as it had before, it was just wasted cycles. In SQL 2000 where the entire procedure would always be recompiled this could be really expensive. SQL 2005 has statement recompile, so the effect may not be equally drastic. At this point the reader may feel both confused and uncomfortable over all these complications. In practice, it is not really that bad. Often, these blind assumptions work fairly well, particularly if your input lists are small. So go for a method that you think fit your needs, and stick with it as long as you don't run into problems. When do you run into bad performance, come back and read this section again, to get an idea of what alternatives you should try. One rule of thumbs is that the bigger the input list is, the more reason you have to consider using a temp table. Here is an overview which strategies that are possible with which methods: T-SQL Inline No No No Yes Yes Yes No Yes Yes Opaque Inline No Yes Yes No No No No No No Multi-Statement Yes No No Yes Yes Yes No No No Table Variable Yes Yes Yes Yes Yes Yes Yes No No Temp Table Yes Yes Yes Yes Yes Yes Yes No No
Iterative Method CLR XML Table of Numbers Fixed-Length Recursive CTE List to SELECT Dynamic SQL Real Slow
A Caching Problem with SQL Inline Consider this procedure:

CREATE SELECT FROM JOIN PROCEDURE test_sp @str nvarchar(MAX) AS t.col1, t.col2 testtbl t inline_split_me(@str) c ON t.id = c.Value
As you can guess from the name, inline_split_me is an inline function. SQL Server MVP Tony Rogerson discovered that there is a problem here: the query plan for the procedure is not put into cache as it should be, which means that the procedure is compiled each time is executed. This is bad not only for the cost of the unnecessary compilations: if there are multiple simultaneous calls to this procedure, these calls will be serialised, because if one process (re)compiles a procedure, no other process can execute that procedure but will be blocked until compilation has completed. There is no good reason for this behaviour; an SQL Server Developer has confirmed to me that this is a bug in SQL 2005 and nothing else. This problem appears only with T-SQL inline functions, not with multi-statement functions. And it only appears if the input variable is of a MAX data type. (Or the arcane text, ntext or image.) If the input variable is a regular nvarchar(4000), the issue does not arise. A workaround is to copy the input parameter to a local variable like this:
CREATE PROCEDURE test_sp @str nvarchar(MAX) AS DECLARE @copy nvarchar(MAX) SELECT @copy = @str SELECT t.col1, t.col2 FROM testtbl t JOIN inline_split_me(@copy) c ON t.id = c.Value
On the other hand, bouncing the data over a temp table or a table variable does not seem to help. The problem does not appear with all inline functions in this article, only with those for table-of-numbers and recursive CTE, but not with those for fixed length. However, this may be due to pure luck, so I encourage you to examine with SQL Profiler whether your procedure is victim to this bug. Enable all events for stored procedures. If everything is alright, you should see a CacheInsert event the first time you run the procedure, and on subsequent executions, you should see a CacheHit event. If you see a CacheMiss every time but no CacheInsert, you have encountered this bug.
MAX Types vs. Regular (n)varchar In a previous section I discussed the problems with joining nvarchar and varchar. When I ran my performance tests and investigated some unexpected results; I discovered a second problem of a similar kind. Consider this:
SELECT ... FROM tbl t JOIN list_to_table(@list) l ON t.indexednvarcharcol = l.nvarcharmaxcol
The list-to-table function is here written in such a way that its return type is nvarchar(MAX). This too leads to an implicit conversion of the indexed column. It may not be apparent that it has to be that way at first sight, but when SQL Server evaluates an expression, it always works with the same data type for all operands. And apparently, nvarchar(4000) and shorter is a different data type from nvarchar(MAX). The result of the implicit conversion is not fatal. The optimizer applies a range-seek operator and is still able to use the index, but nevertheless there is an overhead. When I initially ran my tests, I had not observed this issue, and my inline functions returned nvarchar(MAX) (of the simple reason that the input string was nvarchar(MAX)). As a consequence, my tests in some cases seemed to indicate that inline functions performed worse than the corresponding multi-statement solutions. Presumably, most of the time when you use list-to-table functions for a list of strings, the strings are short, just a few characters long. Therefore, there is all reason to make sure that your list-to-table function returns a regular varchar or nvarchar. Particularly, this means that for inline functions, you should make sure that the return value is explicitly converted to regular (n)varchar. You will see this in all inline functions in this article. Collations All functions in this article uses nvarchar both for parameters, output and internal variables. If you never work with Unicode data, you may think that you should rewrite the functions to use varchar instead, assuming that 8-bit characters are faster for SQL Server to work with than the 16-bit Unicode characters. This may or may not be the case, depending on which collation you are using. As I discussed above under varchar vs. nvarchar, there are two sorts of collations: Windows collations and SQL collations. If you use a Windows collation, you get a slight reduction in performance if you use varchar rather than nvarchar. This is due to that with a Windows collation, the Unicode rules and routines are always employed internally, so all using varchar buys you is some extra conversions. On the other hand, with an SQL collation you can get some 30 % improvement in execution time with using varchar instead. This is because SQL collations are 8-bit only, for which there exist a separate set of 8-bit only routines, and the rules for an 8-bit character set are far simpler than those for Unicode. If you have an SQL collation and use nvarchar, you are in fact using a Windows collation under the cover. Note here that the exact gain depends on the type of operation. 30 % is what you can expect from a plain equality test. There are situations where the difference between varchar and nvarchar in an SQL collation can be as much as a factor of 7. We will look at such case in the section on really slow methods. But there is an option for better performance with nvarchar: use a binary collation. Now, if you opt to use a binary collation throughout your database, you will have to accept that all comparisons are case-insensitive, that sorting is funky, particularly for other languages than English. So for most applications, a binary collation is not a viable option. However, there exists a second possibility: force the collation for a certain expression. I have employed this throughout this article where it makes sense. You will see a lot of things like: charindex(@delimiter COLLATE Slovenian_BIN2, @list, @pos + 1) Since @delimiter is cast to an explicit collation, this also happens with @list. (This is discussed in Books Online in the topic Collation Precedence.) When using charindex to find a delimiter, odds are good that you are looking for the exact delimiter and you have no need for case- or accent-insensitive searches. Thus, using a binary collation in this situation does not lead to any loss in functionality. When I tested this for the iterative method, I got some 10 % improvement in execution time. (Why Slovenian? And why BIN2? It's Slovenian because my test data is a list of Slovenian words for a spelling dictionary. BIN2 is a new type of binary collations in SQL 2005. I did not really grasp the difference to the old binary collations, but they appeared to be the better choice. Anyway, it should not matter much which binary collation you use.) Unpacking Lists in a Table Most examples in this article work with a single list being passed as a parameter into a function, and this is probably the most common case. But sometimes you find yourself working with a table like this one:
Modelid A200 A220 A230 B130 B150
Colours Blue, Green, Magenta, Red Blue, Green, Magenta, Red, White Blue, Green, Magenta, Red, Cyan, White, Black Brown, Orange, Red Yellow, Brown, Orange, Red
That is, the available colours for a model appear as a comma-separated list. Let me directly emphasise that this is an extremely poor design that violates a very basic principle of relational databases: no repeating groups. As a consequence of this, tables with this design are often very painful to work with. If you encounter this design, you should seriously consider changing the data model, so there is a sub-table with one row for each model and colour. Then again, to get there, you need to be able to split up these lists into rows. In SQL 2000, it was not possible to call a table-valued function and pass a table column as parameter, but SQL 2005 adds the APPLY operator that permits you to do this. Here is a script that creates the table above and then unpacks it with a query:
CREATE TABLE models (modelid char(4) NOT NULL, -- other columns like modelname etc. colours varchar(200) NOT NULL, CONSTRAINT pk_models PRIMARY KEY (modelid)) go INSERT models (modelid, colours) SELECT 'A200', 'Blue, Green, Magenta, Red' UNION SELECT 'A220', 'Blue, Green, Magenta, Red, White' UNION SELECT 'A230', 'Blue, Green, Magenta, Red, Cyan, White, Black' UNION SELECT 'B130', 'Brown, Orange, Red' UNION SELECT 'B150', 'Yellow, Brown, Orange, Red' go SELECT m.modelid, t.str AS colour FROM models m CROSS APPLY iter_charlist_to_tbl(m.colours, ',') AS t ORDER BY m.modelid, t.str
(The code for iter_charlist_to_tbl will appear shortly.) Just like the JOIN operator, APPLY takes two table sources as its input. A table source is anything that exposes columns like a table: a view, a table-valued function, a derived table, a rowset function or a common table expression. (The latter is another new feature that I will return to in this article). With JOIN the table sources have to be autonomous from each other: for instance a tablevalued function can not take a parameter from a table on the left side. But this is exactly what APPLY permits. The function is evaluated once for each row in the table source on the left side, and that row will be exploded to as many rows that the table source on the right side evaluates to. For completeness, I should add that there are two forms of APPLY: CROSS APPLY and OUTER APPLY. The difference lies in what happens when the table source on the right hand side returns no rows at all. With CROSS APPLY the row from the left side is lost, with OUTER APPLY it is retained. For more information on APPLY, see the topics FROM and Using Apply in Books Online. The Iterative Method I will now describe the various methods to unpack lists into a table, one by one in detail. I've opted to start with the iterative method. This is because, together with using a recursive CTE, the iterative method is the one requires least preparations. Just create the function and you are on the air. Most other methods requires some extra step. Another advantage is that the code for the iterative method is very easy to understand, not the least if you have a background with C or Visual Basic. That makes things easy if you need to adapt the code to a special input format. This is far from the fastest in the bunch, but as long as you mainly work with lists of reasonable length, you will find performance acceptable. If you often work with lists of several thousand elements, you should probably investigate some of the faster methods. List-of-integers
You have already seen an example of the iterative method in the beginning of this article, but I repeat it here:
CREATE FUNCTION iter$simple_intlist_to_tbl (@list nvarchar(MAX)) RETURNS @tbl TABLE (number int NOT NULL) AS BEGIN DECLARE @pos int, @nextpos int, @valuelen int SELECT @pos = 0, @nextpos = 1 WHILE @nextpos > 0 BEGIN SELECT @nextpos = charindex(',', @list, @pos + 1) SELECT @valuelen = CASE WHEN @nextpos > 0 THEN @nextpos ELSE len(@list) + 1 END - @pos - 1 INSERT @tbl (number) VALUES (convert(int, substring(@list, @pos + 1, @valuelen))) SELECT @pos = @nextpos END RETURN END
The idea is simple. We iterate over the string, look for commas, and then extract the values between the commas. Note the use of the third parameter to charindex, this specifies the position where to start searching for the next comma. The computation of @valuelen includes the only complexity: we must cater for the fact that charindex will return 0 when there are no more commas in the list. However, this function is slower than it has to be. When I wrote the same function for SQL 2000 some years back, I had to apply a technique where I broke up the input into chunks. This was necessary, because in SQL 2000 there is no nvarchar(MAX), only ntext, and charindex operates only within the first 8000 bytes of an ntext value. I had hoped that with nvarchar(MAX) chunking would not be necessary, but testing showed that by using chunks of nvarchar(4000) values, I could improve performance by 2030 %. The culprit is charindex: it's slower on nvarchar(MAX) than on nvarchar(4000). Why, I don't know, but since nvarchar(MAX) values can be up to 2 GB in size, I assume that charindex needs a more complex implementation for nvarchar(MAX). There is a second problem with iter$simple_intlist_to_tbl: if you for some reason feed it two consecutive commas, that will give you a 0 in the output, which isn't really good. While you can easy address this by adding some extra logic to the function, my preference is to avoid the problem with using space as a separator. The comma does not really fill any purpose for a list of integers. So here is a better implementation of the iterative method for a list of integers:
CREATE FUNCTION iter_intlist_to_tbl (@list nvarchar(MAX)) RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL, number int NOT NULL) AS BEGIN DECLARE @startpos int, @endpos int, @textpos int, @chunklen smallint, @str nvarchar(4000), @tmpstr nvarchar(4000), @leftover nvarchar(4000) SET @textpos = 1 SET @leftover = '' WHILE @textpos <= datalength(@list) / 2 BEGIN SET @chunklen = 4000 - datalength(@leftover) / 2 SET @tmpstr = ltrim(@leftover + substring(@list, @textpos, @chunklen)) SET @textpos = @textpos + @chunklen SET @startpos = 0 SET @endpos = charindex(' ' COLLATE Slovenian_BIN2, @tmpstr) WHILE @endpos > 0 BEGIN SET @str = substring(@tmpstr, @startpos + 1,
@endpos - @startpos - 1) IF @str <> '' INSERT @tbl (number) VALUES(convert(int, @str)) SET @startpos = @endpos SET @endpos = charindex(' ' COLLATE Slovenian_BIN2, @tmpstr, @startpos + 1) END SET @leftover = right(@tmpstr, datalength(@tmpstr) / 2 - @startpos) END IF ltrim(rtrim(@leftover)) <> '' INSERT @tbl (number) VALUES(convert(int, @leftover)) RETURN END
Here is an example on how you would use this function:

CREATE PROCEDURE get_product_names_iter @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN iter_intlist_to_tbl(@ids) i ON P.ProductID = i.number go EXEC get_product_names_iter '9 12 27 37'
This function has two loops. One which creates the chunks, and one that iterates over the chunks. The first chunk is always 4000 characters (provided that the input is that long, that is). As we come to end of a chunk, we are likely to be in the middle of an element, which we save in @leftover. We bring @leftover with us to the next chunk, and for this reason we may grab fewer characters than 4000 from @list this time. When we have come to the last chunk, @leftover is simply the last list element. Multiple spaces are handled by simply ignoring @str if it's blank. There are two things from the general considerations that I have added to this function that was not in iter$simple: 1. 2. The output table includes the list position. I use the COLLATE clause to force a binary collation to gain some further performance.
I also like to note that there is a minor performance improvement from the version that appears in the SQL 2000 version of this article. Sam Saffron pointed out to me that I kept reallocating the string rather than using the third parameter of charindex. List-of-strings Here is a similar function, but that returns a table of strings.
CREATE FUNCTION iter_charlist_to_tbl (@list nvarchar(MAX), @delimiter nchar(1) = N',') RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL, str varchar(4000) NOT NULL, nstr nvarchar(2000) NOT NULL) AS BEGIN DECLARE @endpos @startpos @textpos @chunklen @tmpstr @leftover @tmpval int, int, int, smallint, nvarchar(4000), nvarchar(4000), nvarchar(4000)
SET @textpos = 1 SET @leftover = '' WHILE @textpos <= datalength(@list) / 2 BEGIN SET @chunklen = 4000 - datalength(@leftover) / 2 SET @tmpstr = @leftover + substring(@list, @textpos, @chunklen) SET @textpos = @textpos + @chunklen SET @startpos = 0
SET @endpos = charindex(@delimiter COLLATE Slovenian_BIN2, @tmpstr) WHILE @endpos > 0 BEGIN SET @tmpval = ltrim(rtrim(substring(@tmpstr, @startpos + 1, @endpos - @startpos - 1))) INSERT @tbl (str, nstr) VALUES(@tmpval, @tmpval) SET @startpos = @endpos SET @endpos = charindex(@delimiter COLLATE Slovenian_BIN2, @tmpstr, @startpos + 1) END SET @leftover = right(@tmpstr, datalength(@tmpstr) / 2 - @startpos) END INSERT @tbl(str, nstr) VALUES (ltrim(rtrim(@leftover)), ltrim(rtrim(@leftover))) RETURN END
An example on how you would use this function:

CREATE PROCEDURE get_company_names_iter @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN iter_charlist_to_tbl(@customers, DEFAULT) s ON C.CustomerID = s.nstr go EXEC get_company_names_iter 'ALFKI, BONAP, CACTU, FRANK'
This function is similar to iter_intlist_to_tbl. I've added a parameter to specify the delimiter. When you invoke a user-defined function in SQL Server, you cannot leave out a parameter, not even if it has a default value, but you can specify the keyword DEFAULT to use the default value. Note also that I trim off space that appears directly adjacent to the delimiters, but space within the list elements is retained. I like to give attention to the use of datalength. There are two system functions in T-SQL to return the length of a string: len returns the number of characters, and does not count trailing spaces. datalength returns the number of bytes (whence all these / 2), and includes trailing spaces. I'm using datalength here, since there is no reason to ignore trailing spaces in the chunks they could be in the middle of a list element. Using the CLR Introducing the CLR SQL 2005 added the possibility to create stored procedures, functions etc in .Net languages such as C# and Visual Basic .Net, or any language that supports the Common Language Runtime. If you have never worked with the CLR before, you may find that this method goes a little over your head, and you may prefer to use a pure SQL method. On the other hand, if you are a seasoned C# or VB programmer, you will surely appreciate this method. Just like the iterative method, this method lends itself very easily to modifications to adapt to special input formats. There are many ways you can abuse the CLR and use it when you should not, but a list-to-string function is a prime example of what the CLR is good for: operations that do not perform any data access, but that perform complex manipulations of strings or numbers. The reason for this is two-fold: 1) The CLR gives you a much richer set of functions to work with, for instance regular expressions just to name one. 2) The CLR languages are compiled, while T-SQL is interpreted, leading to much better performance with the CLR. In the realm of table-valued functions there is another factor that improves performance: the output from a tablevalued CLR function is not written into any intermediate storage, but the rows are fed into the rest of the query as soon as they are produced. So in that sense they are inline functions, but in difference to T-SQL's own inline functions, the optimizer have no idea what they will produce, which is why I refer to them as opaque inline. A special quirk of CLR functions is that they cannot return varchar data you can only return nvarchar. This means that when you work with list of strings that you must always be careful to remember to convert the output to varchar when you join with varchar columns in tables, as I discussed in the section varchar vs. nvarchar. By default, SQL 2005 ships with the CLR disabled. You can enabled it from the Surface Area Configuration tool or by running
EXEC sp_configure 'CLR enabled', 1
RECONFIGURE
from a query window. In the following I will try to give a crash-course how to write a table-valued function in the CLR. Seasoned .Net programmers may in find it inaccurate in points I will have to admit that the only CLR table functions I've written are those I wrote for this article. (I would have preferred to refer you to Books Online, but I found the topic on CLR table-valued functions in Books Online to be far too terse.) In the interest of brevity, I'm only including examples in C#. CLR Functions Using Split We will look at two ways of implementing a list-to-table function in the CLR. The first one, with very little of our own code, serves as an introduction to CLR table functions. In the second alternative, we will roll our own, which opens for a higher degree of flexibility. The Code A complete C# file that implements two list-to-table functions, one for strings and one for integer has to be no longer than this:
using System.Collections; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; public class CLR_split { [SqlFunction(FillRowMethodName="CharlistFillRow")] public static IEnumerable CLR_charlist_split(SqlString str, SqlString delimiter) { return str.Value.Split(delimiter.Value.ToCharArray(0, 1)); } public static void CharlistFillRow(object row, out string str) { str = (string)row; str = str.Trim(); } [SqlFunction(FillRowMethodName="IntlistFillRow")] public static IEnumerable CLR_intlist_split(SqlString str) { return str.Value.Split((char[]) null, System.StringSplitOptions.RemoveEmptyEntries); } public static void IntlistFillRow(object row, out int n) { n = System.Convert.ToInt32((string) row); } }
Compile and Install To compile this, open a command-prompt window and make sure that you have C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727 or corresponding in your path. (Later versions of the .Net Framework will work as well. But 1.1 will not.) Assuming that the name of the file is CLR_split.cs, the command is: csc /target:library CLR_split.cs This gives you CLR_split.dll. If your SQL Server is not on your local machine, you will have to copy the DLL to the server box, or make the DLL visible from the server in some way. Then run from a query window:
CREATE ASSEMBLY CLR_split FROM 'C:\somewhere\CLR_split.dll' go CREATE FUNCTION CLR_charlist_split(@list nvarchar(MAX), @delim nchar(1) = N',') RETURNS TABLE (str nvarchar(4000)) AS EXTERNAL NAME CLR_split.CLR_split.CLR_charlist_split
go CREATE FUNCTION CLR_intlist_split(@list nvarchar(MAX)) RETURNS TABLE (number int) AS EXTERNAL NAME CLR_split.CLR_split.CLR_intlist_split go
(Note: it is also possible to deploy the functions from Visual Studio, but I can't show you how to do that, because I don't know how do it myself. Visual Studio mainly leaves me in a maze, and at the same time I find the command-line very simple to use. What I have been told is that VS may require you to add extra attributes to the functions if you want to deploy the functions that way.) You have now created the functions and can use them from T-SQL. Here is an example for both:
CREATE PROCEDURE get_company_names_clr @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN listtest..CLR_charlist_split(@customers, DEFAULT) s ON C.CustomerID = s.str go EXEC get_company_names_clr 'ALFKI, BONAP, CACTU, FRANK' CREATE PROCEDURE get_product_names_clr @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN CLR_intlist_split(@ids) i ON P.ProductID = i.number go EXEC get_product_names_clr '9 12 27 37'
As with iter_intlist_to_tbl, CLR_intlist_split takes a space-separated list of integers. What's Going On? If you have never worked with CLR table functions before, you may at this point wonder how this all works, and I will try to explain. CREATE ASSEMBLY loads the DLL into SQL Server. Note that it does not merely save a pointer to the file; the DLL as such is stored in the database. Since CREATE ASSEMBLY operates from SQL Server, the file path refers to the the drives on the server, not on your local machine. (If you are loading the assembly from a network share, it's better to specify the location by \\servername name than by drive letter.) It is also possible to load an assembly as a hex-string: The CREATE FUNCTION statements look just like the statements for creating multi-statement functions. That is, you specify the parameter list and the return table. But instead of a body, AS is followed by EXTERNAL NAME where you specify the CLR method to use. This is a three-part name where the first part is the assembly, the second part is a class within the assembly, and the last part is the name of the method itself. In this example, I'm using the same name for the assembly in SQL Server as I do for the class. There is one small detail on the return table: For a multi-statement function you can specify that a column is nullable, you can define CHECK and DEFAULT constraints and define a PRIMARY KEY. This is not possible for CLR functions. If we turn to the C# code, the table-valued function is implemented through two C# methods. The first method is the one that we point to in the CREATE FUNCTION statement. The second method is specified through the attribute that comes first in the definition of the first method. That is, this line: [SqlFunction(FillRowMethodName="CharlistFillRow")] This line specifies that the method is a table-valued function and points to the second method of the function. CLR_charlist_split is the entry point and is called once. The entry point must return a collection or an enumerator, and the CLR will call the method specified in FillRowMethodName once for every element in the collection/enumerator, and each invocation produces a row in the output table of the function. So this is what happens when you call CLR_charlist_split from T-SQL. The C# method calls the String method Split which splits the string into a collection over a delimiter. (For full details on Split I refer you to the .Net Framework SDK in MSDN Library.) Since you get a collection, you need do no more. The CLR calls CharlistFillRow for each element in the collection. And as I noted above, as soon as a row is produced, it can be consumed in the outer query, without waiting for the table function to complete. What about parameters? As you may guess, the parameter list of entry method must agree with the parameter list in the CREATE FUNCTION statement. The exact rules for mapping SQL data types to those of the CLR are beyond the scope of this text, please refer to Books Online for the full details.
The first parameter of the fill method (CharlistFillRow) is of the type object. This is the current element in the collection/enumeration and to use it, you will need to cast it to the real type. The remaining parameters to the fill method are all output parameters, and they map to the output table in the CREATE FUNCTION statement. One more thing calls for attention: the return type of the entry function. In this example it is IEnumerable, since Split returns a collection. The only other alternative is IEnumerator, which we will look at shortly. Back on Track After this excursion into CLR, let's try to get back to the topic of list-to-table functions. What are the characteristics of the list-totable functions that use Split? As you can see, we don't return the list position. As far as I know, you cannot get the list position this way, but I will have to admit that I have not dug into it. Overall, Split puts you quite much into a straight-jacket. You can be as flexible as Split is. That includes specifying an alternate string delimiter, which can be multi-character, and you can specify that empty elements should not be returned (which I make use of in the function for integers). But that's it. Rolling Our Own in the CLR Instead of relying on Split, we can do the work ourselves. The advantage with this is that we win flexibility. Here is a C# file that is longer than the previous one:
using using using using System; System.Collections; System.Data.SqlTypes; Microsoft.SqlServer.Server;
public class CLR_iter { private class stringiter : IEnumerator { string _str; char delim; int _start_ix; int _end_ix; int _listpos; public string str { get { return this._str; } } public int start_ix { get { return this._start_ix; } } public int end_ix { get { return this._end_ix; } } public int listpos { get { return this._listpos; } } public stringiter(SqlString str, SqlString delimiter) { this._str = str.IsNull ? "" : str.Value; this.delim = delimiter.IsNull ? '\0' : delimiter.Value.ToCharArray(0, 1)[0]; Reset(); }
public bool MoveNext() { this._start_ix = this._end_ix + 1; if (delim == ' ') { while (this._start_ix < this._str.Length && this.str[this._start_ix] == ' ') { this._start_ix++; } } if (this._start_ix >= this._str.Length) { return false; } this._end_ix = this.str.IndexOf(this.delim, this._start_ix); this._listpos++; if (this.end_ix == -1) { this._end_ix = this._str.Length; } return true; } public Object Current { get { return this; } } public void Reset() this._start_ix = this._end_ix = this._listpos = } } [SqlFunction(FillRowMethodName="CharlistFillRow")] public static IEnumerator CLR_charlist_iter(SqlString str, SqlString delimiter) { return new stringiter(str, delimiter); } public static void CharlistFillRow(object obj, out int listpos, out string str) { stringiter iter = (stringiter) obj; listpos = iter.listpos; str = iter.str.Substring(iter.start_ix, iter.end_ix - iter.start_ix); str = str.Trim(); } [SqlFunction(FillRowMethodName="IntlistFillRow")] public static IEnumerator CLR_intlist_iter(SqlString str, SqlString delimiter) { return new stringiter(str, delimiter); } public static void IntlistFillRow(object obj, out int listpos, out int number) { stringiter iter = (stringiter) obj; listpos = iter.listpos; string str = iter.str.Substring(iter.start_ix, iter.end_ix - iter.start_ix); number = System.Convert.ToInt32(str); } } { -1; -1; 0;
The key is the internal class stringiter. First note the class declaration itself:
private class stringiter : IEnumerator This means that the class implements the IEnumerator interface which is a requirement for a table-valued function. (That or IEnumerable.) Next follows internal class variables, and property methods to read these values from outside the class. Next piece of interest is the constructor: public stringiter(SqlString SqlString str, delimiter) {
This method creates an instance of the stringiter class. This constructor is called from the entry-point method of the table-valued function. Next follow MoveNext, Current and Reset. These are the methods that implement the IEnumerator interface, and they must have precisely the names and signatures that you see above. (For more details on IEnumerator, I refer you the .Net Framework SDK.) The interesting action goes on in MoveNext. It is here we look for the next list element and it is here we determine whether we are at the end of the list. As long as MoveNext returns true, the Fill method of the table function will be called. That is, MoveNext should not return false when it finds the last element, but the next time round. (I hope that I got that right. I was not really able to conclude that from the docs, but I had to play around myself.) What is interesting from a list-to-table perspective is that MoveNext handles space as separator in a special way: multiple spaces are collapsed into one. This does not happen with other delimiters. After the code for the stringiter class come two entry-point methods with accompanying fill methods, one for strings and one for integers. In difference to the previous example with Split, the entry-points here return IEnumerator, since we implement IEnumerator ourselves. But similar to the Split example, all the entry-points do is to create a stringiter object. The Fill methods, finally, extracts the data from current stringiter object. Most noticeably is that I grab hold of the list position, so it can appear in the output table. For completeness, here is the SQL declaration of the functions:
CREATE FUNCTION CLR_charlist_iter(@list nvarchar(MAX), @delim nchar(1) = ',') RETURNS TABLE (listpos int, str nvarchar(4000)) AS EXTERNAL NAME CLR_iter.CLR_iter.CLR_charlist_iter go CREATE FUNCTION CLR_intlist_iter(@list nvarchar(MAX), @delim nchar(1) = ' ') RETURNS TABLE (listpos int, number int) AS EXTERNAL NAME CLR_iter.CLR_iter.CLR_intlist_iter go
Are these latter functions so much better that the first pair that used Split? Eh, no. You get the list position, and the intlist function has a delimiter parameter. Performance is virtually identical in my test. The point is that the above gives you a framework to work with, if you need to handle more complex list formats or some other functionality not supported by the functions as given. Here, I have used the same enumerator class for strings and integers, but the fancier you want to be, it's probably better to use different classes. If you add more bells and whistles, will you have to pay in performance? Some, but unless you do something really complex, not much. For instance, if you want to support multi-string delimiters you will need to use a different overload if IndexOf which performs a culture-sensitive search, which is likely to be more complex. When I tried this, I got some 5-7 % increase in execution time. Which hardly is anything to be alarmed over. Even if you double the execution time, you will still be ahead of the iterative method. A Final Caveat on CLR Performance In my tests, CLR was one of the absolutely fastest methods. However, all my tests were on an idle server, with only one process running the test queries. SQL Server MVPs Paul Nielsen and Adam Machanic have both reported that when they have performed multi-process tests, the CLR method has scaled a lot worse than other methods. In fact, it scaled so badly, so that they discarded it in favour of other methods. XML
Using XML requires you create an XML-document for your input rather than a comma-separated list. This is a little more complex, so in most cases you may not find it worth the trouble. Where XML is really fantastic is when you need to insert many rows. You transform your data to an XML document in the client, and then you unpack it into your tables with the xml type methods. nodes and value Let's go straight to the matter. Here are our get_company_names and get_product_names using XML.
CREATE PROCEDURE get_company_names_xml @customers xml AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN @customers.nodes('/Root/Customer') AS T(Item) ON C.CustomerID = T.Item.value('@custid', 'nchar(5)') go EXEC get_company_names_xml N'<Root><Customer custid="ALFKI"/> <Customer custid="BONAP"/> <Customer custid="CACTU"/> <Customer custid="FRANK"/> </Root>' go CREATE PROCEDURE get_product_names_xml @ids xml AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN @ids.nodes('/Root/Num') AS T(Item) ON P.ProductID = T.Item.value('@num', 'int') go EXEC get_product_names_xml N'<Root><Num num="9"/><Num num="12"/> <Num num="27"/><Num num="37"/></Root>'
The two xml type methods we use are nodes and value. nodes is a rowset function that returns a one-column table where each row is an XML fragment for the given path. That is, in the two examples, you get one row for each Customer or Num node. As with derived tables, you must specify an alias for the return table. You must also specify a column name for the column that nodes produces. (You can name columns in derived tables in this way too, but it's not that common to do so.) The sole operation you can perform on T.Item is to employ any of the four xml type methods: exists, query, nodes and value. Of interest to us here is value. This method extracts a single value from an XML document and returns it as a T-SQL data type. value takes two parameters: The first a node specification for a single element or attribute. In this case we want an attribute custid and num respectively which is why we specify with a @ in front. (Without the @ it would be an element specification.) The second argument to value is the T-SQL data type to return the value as. List Position If you need the list position, the only way to do this with XML, is to embed that in the XML document itself. There is no xml type method that returns this information. (And row_number will not help you, because row_number requires an ORDER BY clause, and there is nothing to order by.) Creating the XML Document It may seem simple to create XML documents by adding brackets etc in code. Maybe all is needed is some replace in T-SQL? Please, don't even consider it. You need to use library routines to create your XML documents. With a simple comma-separated list, you have to watch out for a comma appearing in the data. With XML there are far more of special characters that needs to be encoded in some way. There is already code that knows about all that, so there is no reason to reinvent the wheel. Unfortunately, I cannot show you any examples, as I have never had the need to do this myself. But digging around in MSDN Library should give you something, no matter you are programming in .Net or in native code. Performance What about performance? Just like the CLR, the data from nodes streams into the rest of the query, so XML is also an inline method. But since parsing XML is more complex than parsing comma-separated lists, it's slower. In my tests, the execution times for XML are 40-60 % higher than for the CLR, but it is twice as fast as the iterative method. In fact, XML is the fastest method that does not require any preparations in the server: you don't have to activate the CLR and you don't need to create a table of numbers. On the other hand, if the client code already produces a comma-separated list, you need to change that code. There is probably a higher performance cost for creating an XML document than a comma-separated list, but I would assume that it's not that dramatic. In any case, since that is a client-side cost, you are scaling out.
The comparisons here apply to the pure parsing of the list/document into a table. Keep in mind that the optimizer has very little idea of what your document will produce. This applies to about all methods in this document, but XML appears to have a particularly bad reputation for resulting in poor query plans. If you run into this, you could try using an intermediate temp table. Inserting Many Rows I said that where XML is really good is when you need to insert many rows. So how would you do that? In fact you have already seen the basics. You use nodes and value. Here is an example where I unpack a document with orders and order details:
DECLARE @x xml SELECT @x = N'<Orders> <Order OrderID="13000" CustomerID="ALFKI" OrderDate="2006-09-20Z" EmployeeID="2"> <OrderDetails ProductID="76" Price="123" Qty = "10"/> <OrderDetails ProductID="16" Price="3.23" Qty = "20"/> </Order> <Order OrderID="13001" CustomerID="VINET" OrderDate="2006-09-20Z" EmployeeID="1"> <OrderDetails ProductID="12" Price="12.23" Qty = "1"/> </Order> </Orders>' SELECT OrderID = T.Item.value('@OrderID', 'int'), CustomerID = T.Item.value('@CustomerID', 'nchar(5)'), OrderDate = T.Item.value('@OrderDate', 'datetime'), EmployeeId = T.Item.value('@EmployeeID', 'smallint') FROM @x.nodes('Orders/Order') AS T(Item) SELECT OrderID = T.Item.value('../@OrderID', ProductID = T.Item.value('@ProductID', Price = T.Item.value('@Price', Qty = T.Item.value('@Qty', FROM @x.nodes('Orders/Order/OrderDetails') AS 'int'), 'smallint'), 'decimal(10,2)'), 'int') T(Item)
As you see, the document can include data for several tables, and we can extract them, by using the appropriate expression to nodes. Schema-bound XML In the examples above I have used untyped XML documents, but you can define an XML schema with CREATE XML SCHEMA COLLECTION, and instead of declaring your parameter as just xml, you could specify xml(mycollection). SQL Server will then validate that the document adheres to that schema. I don't really see the point with this for the corresponding to a comma-separated list. It could be more worthwhile for a procedure that is insert rows from a more complex document. Element-centred XML What you have seen above is attribute-centred XML. You an also use element-centred XML, in which the XML document for product IDs would look like this: <Root><Num>9</Num><Num>12</Num><Num>27</Num><Num>37</Num></Root> For the first parameter to value you would simply give a single period to denote the current element. However, unless you have some special reason to use element-centred XML, stick to attribute-centred. In my tests element-centred XML had 40-50 % longer execution times than attribute-centred. Furthermore, there is a serious problem with element-centred XML in SQL 2005 SP2. If you unpack an element-centred XML document into a table without a clustered index, the optimizer will choose a query plan which is a complete disaster when the document generates many rows. For more details, see a bug that I have filed on Microsoft's Connect site. The issue appears to be fixed in SQL 2005 SP3 and SQL 2008. OPENXML Already in SQL 2000 you could use XML thanks to the OPENXML function. OPENXML is still around in SQL 2005, but the only reason to use it would be that you need to support SQL 2000 as well. OPENXML is bulkier to use, and performance is nowhere near nodes. In fact, in my tests it's 30-50 % slower than the iterative method. If you want to see an example of OPENXML, please refer to the SQL 2000 version of this article.
Using a Table of Numbers This is the fastest way to unpack a comma-separated list of numbers in pure T-SQL. The trick is to use an auxiliary table of numbers: a one-column table with numbers from 1 and up. Here is how you can create a table with numbers from 1 to 999 999:
CREATE TABLE Numbers (Number int NOT NULL PRIMARY KEY); WITH digits (d) AS ( SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 0) INSERT Numbers (Number) SELECT Number FROM (SELECT i.d + ii.d * 10 + iii.d * 100 + iv.d * 1000 + v.d * 10000 + vi.d * 100000 AS Number FROM digits i CROSS JOIN digits ii CROSS JOIN digits iii CROSS JOIN digits iv CROSS JOIN digits v CROSS JOIN digits vi) AS Numbers WHERE Number > 0
See the end of the section on Fixed-Length Array Elements for an alternate way to create a table of numbers. An Inline Function I will introduce the method by presenting an inline function, as it serves well to explain what is going on, but there are a couple of things to watch out for with this function. Later I will show you a function that uses chunking and which is safer. Below is inline_split_me, a function I originally got from SQL Server MVP Anith Sen. The original version I got from Anith appeared in the SQL 2000 version of the article. The version below has a modification from Brian W Perrin:
CREATE FUNCTION inline_split_me(@param nvarchar(MAX)) RETURNS TABLE AS RETURN(SELECT ltrim(rtrim(convert(nvarchar(4000), substring(@param, Number, charindex(N',' COLLATE Slovenian_BIN2, @param + N',', Number) Number) ))) AS Value FROM Numbers WHERE Number <= convert(int, len(@param)) AND substring(N',' + @param, Number, 1) = N',' COLLATE Slovenian_BIN2) go
While the iterative solution was somewhat long-winding but straightforward, this approach is compact, but not all readers may grasp this SQL in the first go. (I had a hard time myself.) But let's try to get a grip on what this beast is doing, and let's start with the WHERE clause. The first condition: Number <= convert(int, len(@param)) is simple. Translated to traditional programming, this is a control loop that says "for all characters in the string". As for the convert, I will return to that. The second condition: substring(N',' + @param, Number, 1) = N',' filters out all positions where there is a delimiter. Or more precisely where a list element begins. The first element is not preceded by a delimiter, but we address that by prepending @param with a comma. Since that moves all characters one step forward, this means that the numbers returned are those where the elements starts; the character after the delimiter. Then in the SELECT list, this is the core: substring(@param, Number, charindex(N',', @param + N',', Number) - Number)
This extracts one element. As I said above, Number is where the element starts. Then we find the end of the element, by searching for the next delimiter, by specifying Number as the the starting position for the search, that is the third parameter to charindex. To be able to handle the end of string, we append a comma to @param. charindex returns a position, but the third parameter to substring is the desired length of the substring. To compute this, we simply subtract the current position from the position of the next delimiter. The SELECT also has this fluff: ltrim(rtrim(convert(nvarchar(4000) This takes care of trimming any leading and trailing spaces. And we convert the result to nvarchar(4000), since the return type from substring is the same as the type for in the input, that is nvarchar(MAX). And, as I noted previously, joining nvarchar(MAX) with a regular nvarchar column is not good for performance. Finally, to improve speed, I force a binary collation in two places. Here is an example on how to use this particular function:
CREATE PROCEDURE get_company_names_inline @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN inline_split_me(@customers) s ON C.CustomerID = s.Value go EXEC get_company_names_inline 'ALFKI,BONAP,CACTU,FRANK'
An obvious problem with this function is robustness. A comma-separated list that has more than one million characters is a very long string, but nevertheless it is bound to happen sooner or later. And in that case, this function will return incorrect results, but there will be no error message. Another issue is performance. As I've already noted, charindex is slower on nvarchar(MAX) than nvarchar(4000). And to add insult to injury, there is a string comparison for every character. The net effect is that this inline function has only half the speed of the chunking multi-statement function that we will look at in the next section. The full story about the performance of inline_split_me is a lot more complex and quite confusing, though. You see, when I ran my performance tests I noticed that I got a parallel plan when joining with my test table. For short lists, parallelism rather added overhead. But for my longest test strings with 10 000 elements, inline_split_me suddenly performed faster then the CLR on a 4-CPU machine. (For full details on this, please see the performance appendix.) Whether or not you will get a parallel plan depends on a whole lot of things, but the size of the input parameter is not of them. No matter the input string, the optimizer will assume that the first WHERE condition will hit 30 % of the rows in Numbers, i.e. 300 000 in this case and the second condition with substring will reduce the estimated number of rows to around 9 500. Apparently in combination with my test table, this triggered a parallel plan. From this follows that while the input parameter does not matter, the size of Numbers does. What also a matters is a seemingly miniscule detail as: Number <= convert(int, len(@param)) For the MAX data types, len and datalength returns bigint. Since Numbers.Number is int, this column would be implicitly converted without the convert on len. The optimizer would then resolve this with a RangeSeek function, similar to what happens when varchar is converted to nvarchar. The conversion itself is not costly in this case, but for some reason, it prevents the optimizer to pick the parallel plan. The bottom line of all this? That in most cases you should avoid using inline_split_me. It is only for long input that you are likely to gain by parallelism And if you expect multiple processes to call your stored procedure that uses inline_split_me, you don't really want all of them to try to use all CPUs simultaneously. Thus, if you consider using inline_split_me for performance reasons, you need to test it with your query and see what performance you get. There is a second performance issue with inline_split_me, to wit procedures that use this functions are likely to be victims to the bug I discussed in the section A Caching Problem with T-SQL Inline. There is a way to make this function to always be fast: if you know that your input string will always fit into an nvarchar(4000), you can use this data type for the input parameter. This would also resolve the caching issue. But of course, the function would not be very robust, and this is why I have not performed any tests with nvarchar(4000) for the input parameter. A Chunking Multi-Statement Function
As with the iterative method, breaking the input into chunks of nvarchar(4000) improves performance. Here is a multi-statement function that does that:
CREATE FUNCTION duo_chunk_split_me(@list nvarchar(MAX), @delim nchar(1) = N',') RETURNS @t TABLE (str nvarchar(4000) NOT NULL, nstr nvarchar(4000) NOT NULL) AS BEGIN DECLARE @slices TABLE (slice nvarchar(4000) NOT NULL) DECLARE @slice nvarchar(4000), @textpos int, @maxlen int, @stoppos int SELECT @textpos = 1, @maxlen = 4000 - 2 WHILE datalength(@list) / 2 - (@textpos - 1) >= @maxlen BEGIN SELECT @slice = substring(@list, @textpos, @maxlen) SELECT @stoppos = @maxlen charindex(@delim COLLATE Slovenian_BIN2, reverse(@slice)) INSERT @slices (slice) VALUES (@delim + left(@slice, @stoppos) + @delim) SELECT @textpos = @textpos - 1 + @stoppos + 2 -- On the other side of the comma. END INSERT @slices (slice) VALUES (@delim + substring(@list, @textpos, @maxlen) + @delim) ;WITH stringget (str) AS ( SELECT ltrim(rtrim(substring(s.slice, N.Number + 1, charindex(@delim COLLATE Slovenian_BIN2, s.slice, N.Number + 1) N.Number - 1))) FROM Numbers N JOIN @slices s ON N.Number <= len(s.slice) - 1 AND substring(s.slice, N.Number, 1) = @delim COLLATE Slovenian_BIN2 ) INSERT @t (str, nstr) SELECT str, str FROM stringget RETURN END
We first split up the text in slices and put these in the table variable @slices. When we created chunks for the iterative method, we did not bother if a list element was split over two chunks. But for this method, we need to take precautions to avoid that, and we must make sure that the last character in a chunk is a delimiter. We first get a preliminary slice of the maximum length we can handle. Then we find the last delimiter in the slice, by feeding charindex the result of the reverse function, a neat little trick. When we insert the slice into the table variable, we make sure that there is a delimiter both before and after, so that we don't need deal with that later. You may note that if the input text is within the limits of a regular nvarchar, we will never enter the loop, but just insert the text directly in to the @slices table. Once @slices is populated, we apply the logic from the inline function, although it looks slightly different, since we know that all strings in @slices start and end with the delimiter. Now the numbers filtered out through the JOIN are the positions for the delimiters, and the elements start one position further ahead. Note here that we do not need to iterate over @slices; we can join directly with Numbers. The thing that starts with WITH is a Common Table Expression. Here we only use it as a macro, so we don't have to repeat the complex expression where we extract the string. We will look more at common table expressions later in this article. As with iter_charlist_to_table, this function returns a table with both a varchar and an nvarchar column. My tests indicate that there is a cost of 10-15 % over returning a table with only a nvarchar column. This function is in most cases considerably faster than inline_split_me, up to a factor of 2. As I discussed in the previous section, inline_split_me can be very fast with a parallel plan and long input. The nice thing with the chunking multi-statement function is that you get consistent performance. Compared to the CLR, duo_chunk_split_me was 25-50 % slower than the CLR when just unpacking the list into a table in my tests.
Note that this function does not have the robustness problem of the inline function. By chunking we are protected against running out of numbers, as long as we make sure that there are 4000 numbers in the table. OK, if you get more slices than you have numbers in the table, you have a problem. But with a million numbers in the table, that would mean an input length close to four milliard characters which exceeds the capacity for nvarchar(MAX) with a wide margin. Concluding Remarks While faster than the iterative function, it is more difficult to grasp, and probably also more difficult to extend to more complex formats. These functions do not have any listpos column. You can use the row_number function for this: listpos = row_number() OVER (ORDER BY s.sliceno, N.Number) Where sliceno is an extra counter column you would have to add to the @slices table. It's likely that this would have some performance cost, but I have not examined this. Compared to other methods, the performance depends a lot on the length of the total string, since we compare every character in the input string with the string delimiter. That is, extra spacing and the length of the elements will matter. I did not include any function that returns a list of numbers. You could write a new function which uses convert in the right places, but you can also do like this: CREATE PROCEDURE get_product_names_tblnum @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN duo_chunk_split_me(@ids, DEFAULT) i ON P.ProductID = convert(int, i.str) go EXEC get_product_names_tblnum '9, 12, 27, 37' Despite the chunking, there are still some robustness issues with the table of numbers. Users that do not know better could delete numbers in the middle or add numbers which should not be there. If you are paranoid you can set up a check constraint for the minimum value, and add a trigger that cries foul if anyone meddles with the table. Then again, constraints and triggers can be disabled, so the really paranoid will probably prefer another method... That said, a table of numbers is something that comes in handy in several other SQL problems than just unpacking a comma-separated list. So it's a good idea to keep one available in your database. Fixed-Length Array Elements This is a method that was proposed by SQL Server MVP Steve Kass, inspired by an idea in Ken Henderson's book The Guru's Guide to Transact-SQL. Just like XML, this is a method that requires a special input format. Instead of using delimiters as in other methods, the list elements have fixed length. There are two advantages with this: 1) You can never run into problems with delimiters appearing in the data. 2) Performance. Except for extremely long input strings, this is the fastest of all methods in this article. The Core Here is a quick example where the method is employed directly, without a function:
CREATE PROCEDURE get_product_names_fix @ids varchar(8000), @itemlen tinyint AS SELECT P.ProductID, P.ProductName FROM Northwind..Products P JOIN Numbers n ON P.ProductID = convert(int, substring(@ids, @itemlen * (n.Number - 1) + 1, @itemlen)) AND n.Number <= len(@ids) / @itemlen go EXEC get_product_names_fix ' 9 12 27 37', 4
Each element in the "array" has the same length, as specified by the parameter @itemlen. We use the substring function to extract each individual element. The table Numbers that appears here, is the same table that we created in the beginning of the section Using a Table of Numbers.
Here is a function that embeds the method into a function:

CREATE FUNCTION fixstring_single(@str nvarchar(MAX), @itemlen tinyint) RETURNS TABLE AS RETURN(SELECT listpos = n.Number, str = substring(@str, @itemlen * (n.Number - 1) + 1, @itemlen) FROM Numbers n WHERE n.Number <= len(@str) / @itemlen + CASE len(@str) % @itemlen WHEN 0 THEN 0 ELSE 1 END)
The purpose of the expression on the last line is to permit the last element in the array to be shorter than the others, in case trailing blanks have been stripped. You can see that this function returns the list position, which simply is the number from the table. Here is an example using fixstring_single with a list of strings:
CREATE PROCEDURE get_company_names_fix @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN fixstring_single(@customers, 6) s ON C.CustomerID = s.str go EXEC get_company_names_fix 'ALFKI BONAP CACTU FRANK'
One possible disadvantage with fixed length is that it's more sensitive to disruption in the input. If you lose one character somewhere or pass the wrong value in @itemlen, the entire list will be misinterpreted. But assuming that you construct the list programmatically, this should not be a big deal. You may note that I did not convert len(@str) to int as I did in inline_split_me. While it would not be a bad idea, I was not able to detect any impact on performance in this case. In order to make the fixed-length procedures easier to read, I opted to leave out the convert in this case. You may also recall the cache bug that I discussed in the beginning of the article. Interesting enough, procedures using the fixedlength method do not seem to be subject to this bug. But there is all reason to verify before you go ahead, particularly in a multi-user environment. Unlimited Input When I presented inline_split_me, I noted that this inline function did in most cases not have very good performance, because of the charindex and substring operations on nvarchar(MAX). This is different for fixstring_single: it has no charindex operations, and it does not perform a substring for each character. In short, there is no need for a chunking version of fixstring_single; it is good as it is. The risk that we run out of numbers is still there, but it is smaller, since we only use one number per list element, not per character. So with the Numbers table above, we can handle one million list elements which is quite a few. Nevertheless, it is possible to write a function which is water-proof in this regard. Steve Kass proposed this function which self-joins Numbers to square the maximum number. That is, with one million numbers in the table, you get 1E12 numbers in total to play with.
CREATE FUNCTION fixstring_multi(@str nvarchar(MAX), @itemlen tinyint) RETURNS TABLE AS RETURN(SELECT listpos = n1.Number + m.maxnum * (n2.Number - 1), str = substring(@str, @itemlen * (n1.Number + m.maxnum * (n2.Number - 1) - 1) + 1, @itemlen) FROM Numbers n1 CROSS JOIN (SELECT maxnum = MAX(Number) FROM Numbers) AS m JOIN Numbers n2 ON @itemlen * (n1.Number + m.maxnum * (n2.Number - 1) - 1) + 1 <= len(@str) WHERE n2.Number <= len(@str) / (m.maxnum * @itemlen) + 1 AND n1.Number <= CASE WHEN len(@str) / @itemlen <= m.maxnum THEN len(@str) / @itemlen + CASE len(@str) % @itemlen WHEN 0 THEN 0 ELSE 1 END ELSE m.maxnum END )
This is a more complex function that fixstring_single, but to save space I leave it as an exercise to the reader to understand what's going on, and I only make a note about the line with CROSS JOIN: this saves me from hard-coding the number of rows in Numbers You may think that a self-join that results in 1E12 numbers must be expensive, and indeed if your lists are mainly short, less than 200 elements, the overhead is considerable. However in my tests, there were cases where fixstring_multi outperformed fixstring_single with a wide margin on a 4-CPU machine. Just as for inline_split_me, the reason is parallelism. In the case of fixstring_multi, the optimizer uses parallelism even for a query like: SELECT * FROM fixstring_multi ('000000123000000456', 9) As for inline_split_me, the optimizer does not use the length of the input parameter for its estimates. Another approach to permit unlimited input is of course to do chunking. And with the fixed-length method, it's possible to do this with an inline function:
CREATE FUNCTION fixstring_multi2(@str nvarchar(MAX), @itemlen tinyint) RETURNS TABLE AS RETURN( SELECT listpos = (s.sliceno - 1) * (s.maxnum / @itemlen) + n.Number, str = substring(s.slice, @itemlen * (n.Number - 1) + 1, @itemlen) FROM (SELECT m.maxnum, sliceno = n.Number, slice = substring(@str, (m.maxnum - m.maxnum % @itemlen) * (n.Number - 1) + 1, m.maxnum - m.maxnum % @itemlen) FROM Numbers n CROSS JOIN (SELECT maxnum = MAX(Number) FROM Numbers) AS m WHERE n.Number <= len(@str) / (m.maxnum - m.maxnum % @itemlen) + CASE len(@str) % (m.maxnum - m.maxnum % @itemlen) WHEN 0 THEN 0 ELSE 1 END) AS s JOIN Numbers n ON n.Number <= len(s.slice) / @itemlen + CASE len(s.slice) % @itemlen WHEN 0 THEN 0 ELSE 1 END )
Performance is very similar to fixstring_multi, including the good performance on multi-processor machines thanks to parallelism. Passing Numbers as Binary The fixed length method opens for a different method to pass a list of numbers, to wit as a binary string:
CREATE FUNCTION fixbinary_single(@str varbinary(MAX)) RETURNS TABLE AS RETURN(SELECT listpos = n.Number, n = convert(int, substring(@str, 4 * (n.Number - 1) + 1, 4)) FROM Numbers n WHERE n.Number <= datalength(@str) / 4 ) (Do I need to add that I originally got this idea from Steve Kass as well?) When using this from T-SQL it looks less appealing: CREATE PROCEDURE get_product_names_binary @ids varbinary(2000) AS SELECT P.ProductID, P.ProductName FROM Northwind..Products P JOIN fixbinary_single(@ids) b ON P.ProductID = b.n go EXEC get_product_names_binary 0x00000090000000C0000001B00000025
In my tests, the performance of fixbinary_single and fixstring_single was virtually identical. However, Alex Kuznetsov, who sent me a similar suggestion, pointed out that you save network bandwidth this way, something which is not covered in my tests. Instead of passing 10 bytes per number, as you would need with a string that can fit all positive integers, with a binary string you only need four bytes per number. This can make a difference for longer lists. To use this, you need move your integers into a byte array in your client code. Alex was kind to send me a C# function to do this for the bigint data type.
static byte[] UlongsToBytes(ulong[] ulongs) { int ifrom = ulongs.GetLowerBound(0);
int ito = ulongs.GetUpperBound(0); int l = (ito - ifrom + 1)*8; byte[] ret = new byte[l]; int retind = 0; for(int i=ifrom; i<=ito; i++) { ulong v = ulongs[i]; ret[retind++] = (byte) (v >> ret[retind++] = (byte) (v >> ret[retind++] = (byte) (v >> ret[retind++] = (byte) (v >> ret[retind++] = (byte) (v >> ret[retind++] = (byte) (v >> ret[retind++] = (byte) (v >> ret[retind++] = (byte) v; } return ret; }
0x38); 0x30); 40); 0x20); 0x18); 0x10); 8);
Note that fixbinary_single does not take an @itemlen parameter, that's a bit superfluous. To handle the other integer types, you would write new functions for that. Of course, a fixbinary_multi is perfectly possible to write, but I leave that as an exercise to the reader. Fixed Length and the CLR Since what is special with this method is the input format, not the algorithm itself, you could use the format with a CLR function instead. I wrote a CLR table-valued function that accepted fixed-length as input. There was very little difference in performance to the other CLR functions. Performance with Extremely Long Input In the beginning of this section I said that the fixed-length method is the fastest method, except for extremely long input. In my tests I found that on one machine, fixed length lagged behind several other methods when the input was a list of 10 000 strings. This puzzled me for a long time, and eventually I ran some special tests for this case. My conclusion is that SQL Server has some change in its internal handling of nvarchar(MAX), which causes processing of it to be slower above a certain limit. This limit is around 500 000 bytes on x64 machines, and 750 000 bytes on 32-bit machines. (I don't have access to any IA64 box, so I don't know what the limit is there.) When the input length exceeds this limit, the execution time for the fixed-length functions about doubles. I only saw this issue with fixed-length in my tests, because only fixed-length generated so long input strings. But other T-SQL inline functions like inline_split_me is certainly subject to this phenomenon as well. On the other hand, the routines that apply chunking are fairly immune, since they do relatively few operations on the long string. The issue does not arise at all with the non-SQL methods: CLR handles the long string in the .Net Framework, and XML also has its own internal handling. I have more details about this limit in the performance appendix. An Alternate Way to Populate the Numbers Table The query I presented to populate the Numbers table should be fairly simple to grasp, but it is not terribly efficient. Since you only need to run it once, that is not a big issuse. Nevertheless, here is an blazingly fast query, that I've taken from Itzik Ben-Gan's book Inside SQL Server 2005: T-SQL Programming:
CREATE FUNCTION dbo.fn_nums(@n AS bigint) RETURNS TABLE AS RETURN WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1), L1 AS(SELECT 1 AS c FROM L0 AS A, L0 AS B), L2 AS(SELECT 1 AS c FROM L1 AS A, L1 AS B), L3 AS(SELECT 1 AS c FROM L2 AS A, L2 AS B), L4 AS(SELECT 1 AS c FROM L3 AS A, L3 AS B), L5 AS(SELECT 1 AS c FROM L4 AS A, L4 AS B), Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY c) AS n FROM L5) SELECT n FROM Nums WHERE n <= @n; GO INSERT Numbers(Number) SELECT n FROM fn_nums(1000000)
Itzik suggests in his book that if creating tables is not within your powers, you can use the function directly in your query. When I tested this, the results indicate that the more numbers you need, the bigger is the overhead of the function. For a function like chunk_split_me, which never requires more than 8000 numbers, there was no significant difference at all. For fixstring_single which for 10 000 30-char strings requires 300 000 numbers, the overhead was 25 %, which amazingly low. There is a risk when using this inline function with a method that is also inline: when I ran a test with fixstring_single on SQL 2008, joining against the non-clustered index of my test table, the optimizer got lost completely and arrived at an plan that took several minutes to execute with only 20 entries in the list. On SQL 2005 I got a normal and fast plan, but keep in mind that for your query, it could be the other way round. Note: the performance appendix does not include any data using this function instead of the Numbers table, as I added the function to the article quite some time after the initial publication, and I no longer have access to all the test machines in the initial test. Using Recursive CTEs This is a method that is entirely new to SQL 2005. The method was originally suggested to me by SQL Server MVP Nigel Rivett. A function based on this idea also appears in SQL Server MVP Itzik Ben-Gan's book Inside Microsoft SQL Server 2005: T-SQL Querying. This method is not among the quickest; it beats the iterative method with a mere 15 %. Nevertheless, Nigel Rivett says he prefers this method, because when he works as a consultant, he wants to leave as little footprint as possible, preferably not even a function. And indeed, the CTE method is fairly easy to use directly in a stored procedure. Nevertheless, I will show the method packaged in an inline function:
CREATE FUNCTION cte_split_inline (@list nvarchar(MAX), @delim nchar(1) = ',') RETURNS TABLE AS RETURN WITH csvtbl(start, stop) AS ( SELECT start = convert(bigint, 1), stop = charindex(@delim COLLATE Slovenian_BIN2, @list + @delim) UNION ALL SELECT start = stop + 1, stop = charindex(@delim COLLATE Slovenian_BIN2, @list + @delim, stop + 1) FROM csvtbl WHERE stop > 0 ) SELECT ltrim(rtrim(substring(@list, start, CASE WHEN stop > 0 THEN stop - start ELSE 0 END))) AS Value FROM csvtbl WHERE stop > 0 go
The thing that starts with WITH is a Common Table Expression (CTE). A plain CTE is just like a macro that you define before the query, and which you then can use in the query as if it was a table. A bit fancier derived table if you like. But the CTE above is a special form of a CTE, it's a recursive CTE. A recursive CTE, consists of two SELECT statements combined by the UNION ALL operator. The first SELECT statement is the starting point. The second SELECT statement makes a reference to the CTE itself. You could see this as a long list of:
SELECT ... UNION ALL SELECT ... FROM CTE UNION ALL SELECT ... FROM CTE ...
where each SELECT statement takes its input for the CTE from the SELECT right above it. The recursion continues as long as the SELECT statements continue to generate new rows. The final result is the UNION ALL of all the SELECT statements. (This recursive query is a bit special since it does not refer to any tables. Normally, you use recursive CTE to wind up bills-of-materials and other hierarchical structures.) Thus, in this function, the first SELECT returns 1 where the list starts, and the position for the first delimiter. The second SELECT sets start to the position after the first delimiter and stop to the position of the second delimiter. The third SELECT, as implied by the recursion, returns the position after the second delimiter as the start position, and the position for the third delimiter. And so it continues until there are no more delimiters and stop is returned as 0, leading to that the last SELECT returns nothing at all.
All this produces a virtual table with the start and stop positions for all the list elements, and we can use the rows in this table as input to substring with special handling for the last list element. Here is an example of using this function: There is a little twist:
CREATE PROCEDURE get_company_names_cte @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN cte_split_inline(@customers, ',') s ON C.CustomerID = s.Value OPTION (MAXRECURSION 0) go EXEC get_company_names_cte 'ALFKI, BONAP, CACTU, FRANK'
Note the OPTION clause. Without this clause, SQL Server would terminate the function prematurely if there are more than 100 elements in the list, 100 being the default value for MAXRECURSION. MAXRECURSION serves as a safe-guard if you would happen to write a recursive CTE which never terminates. For more "normal" uses of CTE like employer-manager relations, 100 is a lot, but for our purposes 100 is a far too low number. Here we set MAXRECURSION to 0 which turns off the check entirely. But why is the OPTION clause not within the function? Simply because OPTION clauses are not permitted in inline functions. Recall that inline functions are not really functions at all, but just macros that are pasted into the referring query. The function does not return the list position, but it would be easy to do so, by adding a counter column to the CTE:
SELECT ..., listpos = 1 ... UNION ALL SELECT ..., listpos = listpos + 1
As you may guess, the function above takes a performance toll from the combination of nvarchar(MAX) and charindex. This can be addressed by writing a multi-statement function that breaks up the input into chunks, in a similar vein to duo_chunk_split_me. However, I found in my tests, that the overhead for bouncing the data over the return table in a multi-statement function slightly exceeded the cost for charindex on nvarchar(MAX). Since the selling point of the CTE method is its low intrusiveness, the chunking function is of less interest, and I leave it as an exercise to the reader to implemente it. Or peek at this link. Procedures that use cte_split_inline are subject to the caching problem I discussed earlier, so it may be a good idea to copy the input parameter to a local variable. Dynamic SQL For a list of numbers, this method appears deceivingly simple:
CREATE PROCEDURE get_product_names_exec @ids nvarchar(4000) AS EXEC('SELECT ProductName, ProductID FROM Northwind..Products WHERE ProductID IN (' + @ids + ')') go EXEC get_product_names_exec '9, 12, 27, 37'
But the full story is far more complex. There are several issues you need to consider. To start with, let's look an example with a list of strings:
CREATE PROCEDURE get_company_names_exec @customers nvarchar(2000) AS EXEC('SELECT CustomerID, CompanyName FROM Northwind..Customers WHERE CustomerID IN (' + @customers + ')') go EXEC get_company_names_exec '''ALFKI'', ''BONAP'', ''CACTU'', ''FRANK'''
The nested quotes make procedures using this method difficult to call. The next thing to consider is permissions. Normally, when you use stored procedures, users do not need direct permissions to the tables, but this does not apply when you use dynamic SQL. In SQL 2000 there was no way around it. In SQL 2005 you can arrange for permissions by signing the procedure with a certificate. I have a full article that describes how to do this, Giving Permissions through Stored Procedures, But to get there, you now have a method that requires a complexity that no other method in this article calls for.
Always when you work with dynamic SQL, you must be aware of SQL injection. That is, if the input comes directly from user input, a malicious user may be able execute SQL code that you did not intend to. This is something you have to be particularly aware of when you implement web applications for the Internet. SQL injection is a very common way for hackers to break into sites. The normal way to avoid SQL injection is to use parameterised statements, but since the number of parameters is variable, this is not workable here, but you need to employ other methods which are more complex, and less secure. I discuss SQL injection and other issues around dynamic SQL in my article The Curse and Blessings of Dynamic SQL. When it comes to robustness and flexibility, we can note that you cannot make a choice of the delimiter comma is the only possible. There is no way to get the list position. And you can only use the method inline there is no way to unpack the data into a temp table or a table variable. When it comes to performance, this can be the fastest method for your query. This is the only method where the optimizer has full information about the input, and this can lead to a better plan, than one that is based on blind assumptions or the statistics of a temp table. But it might also be the other way. In fact, in my tests dynamic SQL proved to be one of the slowest methods, way behind the iterative method. The reason for this is that with dynamic SQL, you get the cost for query compilation with almost every query and query compilation of long IN lists is anything but cheap. Nevertheless, the situation has improved significantly from SQL 2000, where a list with 2 000 elements took over 10 seconds to process on all my test machines. In my tests for SQL 2005 this only happened for 10 000 elements on the slowest machine. If you reuse the same list, there is likely be a plan in the plan in the cache, and in this case execution will be fast the second time. But keep in mind that it has to be exactly the same list, as the cache is both case-and space-sensitive. And how likely is this? Rather, this points to a secondary problem with dynamic SQL: cache bloat. You will fill the plan cache with plans that are very rarely reused. On machines with lots of memory, this can lead to severe performance problems because of hash collisions in cachelookups. There is however, a way to reduce the compilation costs for dynamic SQL. SQL 2005 adds a new database option, forced parameterisation. When this setting is in effect, SQL Server parameterises all ad-hoc statements. This has the effect that for a dynamically constructed IN expression like above, for a given number of list elements, there is only one entry in the cache that can be reused by all queries with the same number of elements. That is, there is one entry for one element, one for two elements etc. With this setting dynamic SQL is indeed very fast up to 2100 list elements that is. 2100 is the maximum number of parameters for a stored procedure or parameterised statement. When this number is exceeded, the string can not be fully parameterised, and there will again be one cache entry for each string, with the same poor performance as before. And note that with forced parameterisation you lose what is maybe the strongest selling point of dynamic SQL: the optimizer has full information. On the first invocation with, say, five list elements, you may get the best plan for these values. But if the next set of five values calls for a different plan, you will not get that plan. And since forced parameterisation is a database setting, you cannot rely on it on being on or off. (There are query hints to control the behaviour though.) The conclusion of all this is that as long you as you get good plans with other methods, there is absolutely zero reason to use dynamic SQL. Only if you have exhausted all other methods and have not been able to get good performance, it is worth looking at dynamic SQL. And even then you need to decide whether it is worth the hassle to handle permissions, and to take precautions to avoid SQL injection. Making the List into Many SELECT Inserting Many Rows For once, let's look at a method from the angle that you need to insert many rows. Say that you don't want to use XML for some reason. You realise that sending one INSERT statement per row to insert is not very efficient, if nothing else because of the many network round-trips, so you want to send away a batch of rows at a time. The fastest way to do this is to use bulk-copy. To this end you can use BCP, the BULK INSERT statement, or the bulk-copy routines in your client API. Alternatively you can use SQL Server Integration Services. Bulk-copy and SSIS lie completely beyond the scope for this article, though, so I will not go into any details. In any case, these are the methods you use when you import large data files. If you have 8 000 rows in a grid, using the bulk-copy API may be a bit overkill. A simple-minded approach is to run a batch with a lot of INSERT VALUES in it. Just remember to issue SET NOCOUNT ON, or else each INSERT will generate a (1 row affected) message to the client and you are back on the network round-trips. It is not going to perform fantastic, not the least since one row is inserted at a time. How could you insert many rows in one statement? Here is a way: INSERT tbl (...) SELECT val1a, val2a, val3a
UNION ALL SELECT val1b, val2b, val3b UNION ALL ... That is, instead of a lot of INSERT statement you have one big fat SELECT with lots of UNION ALL clauses. Alas, neither this performs well. For small batches around 20 elements, it's faster than many INSERT statements. But as the batch size grows, the time it takes for the optimizer to compile the query plan explodes, and the total execution time can be several times than a batch of INSERT VALUES. One way to handle this is to generate many INSERT SELECT UNION batches, with a reasonable batch size of 20-50 elements. That gives you the best of those two methods. However there is a third method, which is a lot faster. You get the benefit of a single INSERT statement, but you don't have to pay the price for compilation. The trick is to use INSERT-EXEC, and it was suggested to me by Jim Ebbers:
INSERT sometable (a, b, c) EXEC('SELECT 1, ''New York'', 234 SELECT 2, ''London'', 923 SELECT 3, ''Paris'', 1024 SELECT 4, ''Munich'', 1980')
Normally when you use INSERT-EXEC you have a stored procedure or a batch of dynamic SQL that returns one result set, and that result set is inserted in to the table. But if the stored procedure or batch of dynamic SQL returns multiple result sets, this works as long as all result set have the same structure. The reason why this method is so much faster than using UNION is that every SELECT statement is compiled independently. One complexity with this method is that you have to be careful to get the nested quotes correct when you generate this statement. It may come as a surprise that I propose dynamic SQL here, given how cool I was to it in the previous section. But the full story is that all three methods here presume that client would generate the full batch with INSERT statements with values and all, and what is that if not dynamic SQL? So you could only walk this route in situations when permissions are not an issue. (Which could be because the table inserted to is a temp table, and later a stored procedure inserts from the temp table to a target table.) And obviously you would need to apply methods to deal with SQL injection. Comma-Separated Lists It may not seem apparent, but it is possible to use the methods above as a foundation to unpack comma-separated lists, although it should be said directly that this more sorts under the title "Crazy things you can do in T-SQL". It's difficult to see a situation when this method would be the best choice. Originally this idea was suggested to me by Steve Kass. The procedure is that you use the replace function to replace the delimiter with binding elements. In Steve Kass's original proposal he used UNION ALL SELECT and then added some more plumbing to the string so that you have an SQL batch. Then you can use INSERT EXEC to insert the result of the SQL batch into a temp table, which you then use in your target query. Although UNION has very poor performance, I first show a stored procedure that uses UNION, for the simple reason that this code is simpler to understand than what is to follow:
CREATE PROCEDURE unpack_with_union @list nvarchar(MAX), @tbl varchar(30), @delimiter nchar(1) = N',' AS DECLARE @sql nvarchar(MAX), @q1 char(1), @q2 char(2) SELECT @q1 = char(39), @q2 = char(39) + char(39) SELECT @sql = 'INSERT INTO ' + @tbl + ' SELECT ' + replace(replace(@list, @q1, @q2), @delimiter, N' UNION ALL SELECT ') --PRINT @sql EXEC (@sql)
The inner replace is there to handle the potential risk that @list includes single quotes, which we double in an attempt to protect as against SQL injection. The outer replace replaces the delimiter with the SQL plumbing. The name of the table to insert to is passed in the parameter @tbl. The variables @q1 and @q2 saves me from having a mess of single quotes all over the place. (It will get worse than this.) Here is an example of how to use it:
CREATE PROCEDURE get_product_names_union @ids varchar(50) AS
CREATE TABLE #temp (id int NULL) EXEC unpack_with_union @ids, '#temp' SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN #temp t ON P.ProductID = t.id go EXEC get_product_names_union '9, 12, 27, 37'
The reason that this is a procedure and not a function is of course that you cannot use dynamic SQL in functions. However, the performance of this procedure and a similar that uses INSERT VALUES is not defensible. Decent performance is possible to achieve if we use the trick with INSERT-EXEC and many small SELECT statements, suggested by Jim Ebbers. Here is such a procedure, certainly more complex than the one above.
CREATE PROCEDURE unpack_with_manyselect @list nvarchar(MAX), @tbl varchar(30), @delimiter nchar(1) = ',' AS DECLARE @sql nvarchar(MAX), @q1 char(1), @q2 char(2) SELECT @q1 = char(39), @q2 = char(39) + char(39) SELECT @sql = 'INSERT ' + @tbl + ' EXEC(' + @q1 + 'SELECT ' + replace(replace(@list, @q1 COLLATE Slovenian_BIN2, @q2 + @q2), @delimiter COLLATE Slovenian_BIN2, ' SELECT ') + @q1 + ')' --PRINT @sql EXEC (@sql)
Here, the inner replace replaces a single quote within @list by no less than four single quotes. This is because the INSERT EXEC itself is nested into an EXEC(). (You may guess why I have that commented PRINT @sql there!) In this procedure I have also added a COLLATE clause. (The other procedure is so slow, that the COLLATE would not make any difference anyway.) If you look closer, you see that the procedures presented so far only works well with numeric values. If you want to feed it a list of strings, you need to quote the strings yourself. Here is a version for handling strings:
CREATE PROCEDURE unpackstr_with_manyselect @list nvarchar(MAX), @tbl varchar(30), @delimiter nchar(1) = ',' AS DECLARE @sql nvarchar(MAX), @q1 char(1), @q2 char(2) SELECT @q1 = char(39), @q2 = char(39) + char(39) SELECT @sql = 'INSERT ' + @tbl + ' EXEC(' + @q1 + 'SELECT ltrim(rtrim(' + @q2 + replace(replace(@list, @q1 COLLATE Slovenian_BIN2, @q2 + @q2), @delimiter COLLATE Slovenian_BIN2, @q2 + ')) SELECT ltrim(rtrim(' + @q2) + @q2 + '))' + @q1 + ')' --PRINT @sql EXEC (@sql)
Here is an example of usage:

CREATE PROCEDURE get_company_names_manyselect @custids nvarchar(2000) AS CREATE TABLE #temp (custid nchar(5) NULL) EXEC unpackstr_with_manyselect @custids, '#temp' SELECT C.CompanyName, C.CustomerID FROM Northwind..Customers C JOIN #temp t ON C.CustomerID = t.custid go EXEC get_company_names_manyselect 'ALFKI, BONAP, CACTU, FRANK'
Is this method really a good alternative for handling comma-separated lists? No. Performance for integer lists is in par with the iterative method, but it is significantly slower with lists of strings. The iterative method has the benefit that it is easy to adapt the
method to support different input formats, and getting information like the list position is simple. unpack_with_manyselect does not lend itself to any of that. The strength here lies in inserting many values. Note: all procedures here include an INSERT. But you could leave out the INSERT from the procedure and instead call the procedure from INSERT EXEC. In that case, you have the choice of using a temp table or using a table variable. Really Slow Methods In a Q&A column of an SQL journal, the following solution was suggested by one SQL Server MVP, referring to another MVP, who both shall go nameless:
CREATE PROCEDURE get_company_names_charindex @customers nvarchar(2000) AS SELECT CustomerID, CompanyName FROM Northwind..Customers WHERE charindex(',' + CustomerID + ',', ',' + @customers + ',') > 0 go EXEC get_company_names_charindex 'ALFKI,BONAP,CACTU,FRANK'
You may recognize the theme from when we used a table of numbers. By adding commas of both sides of the input string, we can use charindex to find ",ALFKI," etc. (Note that you cannot have embedded blanks here.) The author noted in his column that this method would not have good performance, since embedding the table column in an expression precludes the use of any index on that column, leading to a table scan. But that's only a small part of the story. A plain table scan on my test table takes 800 ms, once the table is entirely in cache. This method needs 65 seconds for an input list of 20 strings! Variations on this theme are illustrated by these WHERE clauses:
WHERE WHERE patindex('%,' + CustomerID + ',%', ',' + @customers + ',') > 0 ',' + @customers + ',' LIKE '%,' + CustomerID + ',%'
The solution with LIKE is equally slow or slower than charindex. I have never tested patindex for SQL 2005. There is a "go faster" switch here: the COLLATE clause. Add a COLLATE clause to force a binary collation, and performance improves with a factor from 7 to 10. But the problem is that if you pass a string like alfki,bonap,cactu,frank you may still expect a hit, which you would not get with a binary collation. So in my tests, I only forced the collation when the input was a list of integers, as in this example:
CREATE SELECT FROM WHERE PROCEDURE get_product_names_realslow @ids varchar(200) AS ProductName, ProductID Northwind..Products charindex(',' + ltrim(str(ProductID)) + ',' COLLATE Slovenian_BIN2, ',' + @ids + ',' COLLATE Slovenian_BIN2) > 0
go EXEC get_product_names_realslow '9,12,27,37'
If you use an SQL collation, you also get this huge performance improvement if you can stick to varchar (and then you could still be case-insensitive). But while an improvement with a factor ten is impressing, we still talking seven seconds for a list of 20 integers. When no other method needs as much as fifty milliseconds. Conclusion You have now seen a number of methods of passing a list of values to SQL Server, and then use the values to find data in a table. Most methods transform the list of values into a table. I have also discussed general considerations on how to apply these methods. If you want to know more about the performance of these methods, there is an appendix to this article where I present data from my performance tests of these methods.

Arrays and Lists in SQL Server 2005

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Arrays and Lists in SQL Server 2005

Загружено:

Авторское право:

Доступные форматы

Arrays and Lists in SQL Server 2005 Introduction In the public forums for SQL Server, you often

A Caching Problem with SQL Inline Consider this procedure:

Modelid A200 A220 A230 B130 B150

Here is an example on how you would use this function:

An example on how you would use this function:

Here is a function that embeds the method into a function:

0x38); 0x30); 40); 0x20); 0x18); 0x10); 8);

Here is an example of usage:

go EXEC get_product_names_realslow '9,12,27,37'

Вам также может понравиться