To join or not to join: that is the question (in LINQ)

A comment received by one reader of Programming LINQ suggested me to underline a concept that is not so intuitive using LINQ, especially if you come from years of SQL coding.

The idea is very simple. Two entities in LINQ might be related in the model. Whenever this happen, usually it is better to leverage on this existing relationship and not to write the join syntax in an explicit way. If you are using LINQ to SQL, the generated SQL code might be more performant or at least correspondant to the one generated by writing an explicit join in your LINQ query. The less constraints in your query, the better.

Let’s look at an example on the Northwind database. Imagine you want to see a list of all categories with a flag set for the one which a particular product belongs to. This is a SQL query we could write:

SELECT

    c.CategoryID,
    c.CategoryName,
    CASE WHEN p.ProductID IS NULL
        THEN 0
        ELSE 1
    END AS Selected
FROM Categories c
LEFT JOIN Products p
    ON p.CategoryID = c.CategoryID
    AND p.ProductID = 10
ORDER BY CategoryName

Ok, we can write the same query in many other ways, but there are several more complex situations where a LEFT JOIN is used to test the presence of an element in a related table. A correspondant LINQ query might be the following one:

from

c in dc.Categories
orderby c.CategoryName
join p in dc.Products.Where(p => p.ProductID == 10)
    on c.CategoryID equals p.CategoryID
    into pj
from x in pj.DefaultIfEmpty()
select new {
    c.CategoryID,
    c.CategoryName,
    Selected = x != null
};

The LINQ query above will generate a SQL query containing a LEFT JOIN statement. However, a relationship exists between Categories and Customer, and you can leverage on this relationship in the point where you really need to traverse the relationship (in the projection statement). The following one is a better way to get the same result:

from

c in dc.Categories
orderby c.CategoryName
select new {
    c.CategoryID,
    c.CategoryName,
    Selected = c.Products.Any( p => p.ProductID == 10 ) ? true : false
};

This new version has two advantages. First, it is shorter and express its intent more explicitly. Second, it generates a SQL query with an EXISTS statement, similar to the following one.

SELECT

CategoryID, CategoryName,
    (CASE
        WHEN EXISTS(
            SELECT NULL AS [EMPTY]
            FROM Products AS p
            WHERE (p.ProductID = 10) AND (p.CategoryID = c.CategoryID)
            ) THEN 1
        ELSE 0
    END) AS Selected
FROM Categories AS c
ORDER BY CategoryName

The execution plan used by SQL Server might be similar if not equal. However, using the implicit relationship between Categories and Products in the LINQ query is usually better, because it gives more freedom to the LINQ provider to generate a more efficient SQL code.