Today I implemented a classification through a NTILE equivalent function. The T-SQL NTILE implementation has very bad perfomance when used against large datasets of rows, so I and Davide Mauri realized a better implementation that Davide posted some weeks ago on his blog. No reaction from Microsoft guys and no comments on Davide blog, so I retry here to catch more attention on this topic.

Instead to use NTILE function you can use this faster query:

SELECT 
   CustomerKey,
   CEILING(ROW_NUMBER() OVER (ORDER BY YearlyIncome ASC) / ((SELECT CAST(COUNT(*) + 1 AS FLOAT) FROM DimCustomer) / @n) ) AS NTileFast
FROM 
   DimCustomer

Note that I prefer to use the RANK function instead than the ROW_NUMBER: even if I could get non-uniform distribution of data among clusters, I get all elements with the same value in the same cluster and in many scenario you can prefer this technique to avoid cases with random cluster assignment for items with value equal to a cluster limit. This is different from NTILE result, but I prefer it for my cluster assignment.

SELECT 
   CustomerKey,
   CEILING(RANK() OVER (ORDER BY YearlyIncome ASC) / ((SELECT CAST(COUNT(*) + 1 AS FLOAT) FROM DimCustomer) / @n) ) AS NTileFast
FROM 
   DimCustomer

Please read this post of Davide Mauri for a complete explanation and let us know what do you think about it

 

CEILING

Rounds a number up, to the nearest integer or to the nearest unit of significance.

CEILING ( <Number>, <Significance> )

COUNT

Counts the number of rows in the table where the specified column has a non-blank value.

COUNT ( <ColumnName> )

RANK

Returns the rank for the current context within the specified partition sorted by the specified order or on the axis specified.

RANK ( [<Ties>] [, <Relation>] [, <OrderBy>] [, <Blanks>] [, <PartitionBy>] [, <MatchBy>] [, <Reset>] )