This is not the first time I found non-intuitive issues with collation, Unicode and 8-bit strings with SQL Server, but this time I discovered (more precisely: one of my customers discovered and asked me…) a strange behavior that affects the query plan and the overall query optimization strategy.
This was my original post to Microsoft Connect:
A SQL query with a WHERE condition does not use the existing index if the comparison is made against a Unicode constant string. It happens when the field has a collation setting different than the current one (i.e. field has SQL_Latin1_General_CP1_CI_AS where server default is now Latin1_General_CI_AS).
Query plan use Clustered Index Scan instead of Clustered Index Seek operation.
Performance become very slow when the query is made many times into a loop (not a good practice, but the program can’t be changed).
Using ANSI string (8 bit) the problem disappear and query plan shows the use of a Clustered Index Seek Operation
The answer is detailed:
This is not about the collation but about the difference between 8-bit and 16-bit character set.
In fact if you use the default collation (remove “COLLATE SQL_Latin1_General_CP1_CI_AS” from the column definition) then the plans do not change.
There is no generalized valid way to down-convert a 16-bit string to an 8-bit string but an 8-bit string is required to perform a seek because that is the type of the index.
The scan plan works because the 8-bit strings in the table are upconveted to 16-bit and compared to the 16-bit constant.
If you use nvarchar in the table then both forms of the predicate will support index seek (since the 8-bit constant in the predicate can be up-converted to 16-bit).
SQL Server dev.
I well understand all of this, and apparently the issue doesn’t affect many of us. However, think about it: when you build a Data Warehouse, for a lot of reasons you could have to manage a VARCHAR field instead of a NVARCHAR one. If they are on the staging database, you have to be careful in comparing them with constants (it appears to be simple). If you decide to build a dimension using VARCHAR types instead of NVARCHAR, the user who access the database could use a query tool which use Unicode constants for user input. Hum… very dangerous, angry users could blame on you for this.
Now, guess what? SQL Server Reporting Services 2005 build parametric queries using sp_executesql statement, putting all constants with a N (for Unicode) prefix. Can you hear the user screaming now?
If I wasn’t clear, you can read the feedback here: a repro script is included.
Now, while I can find a workaround here (probably a stored procedure can convert the Unicode constant into the corresponding 8-bit version), I am asking if this only issue should encourage you to move dimension fields to NVARCHAR type. I know all implications and I know that dimensions size