LINQ to SQL, Aggregates, EntitySet, and Quantum Mechanics

Monday 10 September, 2007, 11:32 AM

LINQ to SQL lets you write queries in C#, and it will turn them into SQL queries. This is not limited to the C# 3.0 query syntax – it also applies when you use LINQ through ordinary function call syntax. (This should come as no surprise, as the compiler transforms all query expressions into function calls.) For example, consider this code, running against the AdventureWorks example database:

MyDataContext ctx = new MyDataContext();
decimal maxPrice = ctx.Products.Max(product => product.ListPrice);

The aim here is to find out the highest list price of any product. (Not hugely useful, but I want to illustrate the use of aggregate functions such as Max.) Logically speaking, Max evaluates the function you pass it for each item in the source collection (the Products table in this case) and returns the highest result. Here we’ve used a lambda expression that picks out the ListPrice member of the Product class, so this example will get the highest ListPrice. The Product class was built by importing the Product table in Visual Studio 2008’s LINQ to SQL O/R designer, which created properties to represent the table’s columns, including ListPrice.

LINQ’s Max operation is an extension method, meaning that it’s not a member of the object returned by the Products property, despite how it looks. It’s a static method defined by LINQ’s Queryable class, so the call to Max is effectively doing this:

decimal maxPrice = Queryable.Max(ctx.Products, 
                       product => product.ListPrice);

On the face of it, we’re asking the Max function to do this:

// Note: it doesn't actually do this
decimal maxPrice = 0;
foreach (Product product in ctx.Products)
{
    // Evaluate the lambda body:
    decimal currentValue = product.ListPrice;

    // Keep the highest result so far
    maxPrice = Math.Max(maxPrice, currentValue);
}

But as the comment says, that’s not really what happens. In fact, it ends up getting SQL server to do the work, by building a SELECT that uses the MAX aggregate function. How was it able to do that, given the code we supplied?

The Max function takes an argument of type Expression<Func<Product, decimal>>. The significance of the Expression part is that this lambda will not be turned into code. The C# compiler will instead turn it into an expression tree – a tree of objects representing the structure of the expression. (Note: Max is declared as taking an Expression<Func<TSource, TResult>>, where TSource and TResult are generic function type parameters. The arguments for these parameters, Product and decimal, have been inferred by the C# compiler. The Product argument comes from invoking Max on a source of Product objects. The return type of decimal comes from the item returned by our lambda expression: Product.ListPrice.)

Because the lambda is passed in expression form, Max can delve into the expression and work out what we’re asking it to do. It can see that we are simply retrieving the ListPrice field, and this makes it possible for the LINQ to SQL infrastructure to convert that into a suitable SQL query. Here’s the query it ends up running:

SELECT MAX([t0].[ListPrice]) AS [value]
FROM [Production].[Product] AS [t0]

In other words it does what you would hope – it gets SQL server to calculate the maximum. This is good, because downloading the whole table and calculating the maximum on the receiving end would be a disaster!

However, I’ve run into a couple of situations where things don’t go quite so swimmingly with aggregates in LINQ to SQL. They’re both easy to deal with, you just need to be aware of how LINQ to SQL works in order to avoid running into trouble.

Aggregates and Null

The first issue is that an aggregate might return null. How would that happen? One way is if your query finds no rows. Here’s an example:

SELECT MAX(ListPrice) m FROM Production.Product
  WHERE Color='Puce'

It turns out that the AdventureWorks sample database doesn’t offer any products whose colour is Puce, so the only reasonable result of MAX here is NULL. You could also end up with a NULL result if the column itself was nullable, and all of the rows your query finds have a NULL for that column. Here’s the equivalent query as we might try to write it in C#:

decimal maxPrice = ctx.Products.
    Where(product => product.Color == "Puce").
    Max(product => product.ListPrice);

Obviously we have a problem – the query evaluates to null, but you can’t store that in a decimal. So if we run this we get:

Unhandled Exception: System.InvalidOperationException: The null value cannot be assigned to a member with type System.Decimal which is a non-nullable value type.

This is exactly the kind of scenario for which .NET provides nullable types. We might not be able to store a null in a decimal, but decimal? can store either a decimal or a null. (The ‘?’ suffix is C# shorthand for Nullable<decimal>.)

However, if you try the obvious fix – adding a ‘?’ to the decimal where we declare maxPrice – it compiles, but perplexingly we get exactly the same error. The reason is that the exception was actually being thrown deep inside the guts of LINQ to SQL – it fails long before it tries to give us back the result. It can’t see that we’ve changed the maxPrice declaration, and it thinks it’s supposed to be returning us a normal decimal. Why does it think that? It’s because of the type of the lambda we’ve supplied – it turns out that the return type of the lambda we pass to Max gets to dictate the return type of Max. You can see this from the signature:

public static TResult Max<TSource, TResult> (
    this IQueryable<TSource> source,
    Expression<Func<TSource, TResult>> selector
)

Note that the TResult type parameter appears as both the return type of Max, and also as the type parameter that indicates the return type of the expression. C# infers TResult from the lambda expression, so the lambda’s return type flows out as Max’s return type.

This is unfortunate. As we’ve seen in SQL, even when the column type passed to MAX is non-nullable, the result can still be NULL. So it would seem to make more sense for the return type of Max to be TResult? rather than just TResult. But the problem with that is that you can only use the ‘?’ suffix on a value type – this would prevent Max from working on reference types. And unfortunately you can’t overload generic functions by constraint. (This stops you providing two different versions of a generic method, one for reference types and one for value types.) If we want Max to return us a nullable type, we need to make sure the selector we provide returns one. That’s fairly straightforward:

decimal? maxPrice = ctx.Products.
    Where(product => product.Color == "Puce").
    Max(product => (decimal?) product.ListPrice);

We’re forcing the lambda passed to Max to return a nullable decimal. This in turn causes Max to return a nullable decimal. So now, when this aggregate query evaluates to null, maxPrice gets set to null – no more error.

This is one of those situations where dynamic language advocates like to point out the ‘tax’ static typing sometimes makes you pay. Certainly it’s a little irritating to have to type in an extra 10 characters here. On the other hand, lambdas get rid of a lot of manifest typing (i.e. most of the time you don’t need to declare the type of the lambdas) and yet because the compiler infers the type statically, IntelliSense continues to work. So static typing still seems like a net win here, despite the wart in this example.

Querying Relationships and EntitySet

The second problem can arise if we take advantage of a convenient feature of LINQ to SQL: its ability to present relationships between tables as properties on the entity objects. For example, suppose we also bring in the ProductModel table from AdventureWorks. The O/R designer in Visual Studio 2008 will see from the database schema that a foreign key relationship exists between these two tables, and will add a member to the generated Product class to provide access to it. In this case, a given Product will be related to exactly one row of the ProductModel table, so it appears as a simple property of Product. Here’s a query that uses it:

var names = from product in ctx.Products
  where product.Color == "White"
  select new
  {
    ProductName = product.Name,
    ModelName = product.ProductModel.Name
  };

foreach (var name in names)
{
    Console.WriteLine(name.ProductName +
        " - " + name.ModelName);
}

Again, this generates some reasonable SQL:

SELECT [t0].[Name], [t1].[Name] AS [Name2]
FROM [Production].[Product] AS [t0]
  LEFT OUTER JOIN [Production].[ProductModel] AS [t1]
     ON [t1].[ProductModelID] = [t0].[ProductModelID]
WHERE [t0].[Color] = @p0

How about traversing the relationship is in the other direction? Let’s use the same two types, but we’ll start by searching for a product model, and then work out from there to the products that are related to that product model:

ProductModel productModel = ctx.ProductModels.Single(
    model => model.Name == "LL Road Frame");
foreach (Product product in productModel.Products)
{
    Console.WriteLine(product.Name);
}

The Single method simply picks out the first item that matches the criteria. (In SQL this becomes a SELECT TOP 1 query.) Then we use the ProductModel.Products property – this returns an EntitySet<Product>. This is used as a collection, but the first time we retrieve its contents, it goes and fetches all the Product rows that are related to the product model in question. (It’s possible to arrange for this to be prefetched, rather than being fetched on demand.)

So far so convenient. Now what if we decide we want to execute a query on that related collection? Suppose, instead of printing out all the products for a product model, we just wanted to find the most expensive –a product model specific version of the query we started out with. We could do this:

ProductModel productModel = ctx.ProductModels.Single(
    model => model.Name == "LL Road Frame");

decimal? maxPrice = productModel.Products.
    Max(product => (decimal?) product.ListPrice);

Console.WriteLine(maxPrice);

The only obvious difference between this and the first query (aside from the nullable trick) is that we’re now running the query against the products related to a particular product model, as returned by the ProductModel.Products property.

Good, right? Well, let’s not leap to conclusions. Here’s the SQL that runs as a result of that last query to get the maximum:

SELECT [t0].[ProductID], [t0].[Name], [t0].[ProductNumber],
  [t0].[MakeFlag], [t0].[FinishedGoodsFlag], [t0].[Color],
  [t0].[SafetyStockLevel], [t0].[ReorderPoint],
  [t0].[StandardCost], [t0].[ListPrice], [t0].[Size],
  [t0].[SizeUnitMeasureCode], [t0].[WeightUnitMeasureCode],
  [t0].[Weight], [t0].[DaysToManufacture], [t0].[ProductLine],
  [t0].[Class], [t0].[Style], [t0].[ProductSubcategoryID],
  [t0].[ProductModelID], [t0].[SellStartDate],
  [t0].[SellEndDate], [t0].[DiscontinuedDate], [t0].[rowguid],
  [t0].[ModifiedDate]
FROM [Production].[Product] AS [t0]
WHERE [t0].[ProductModelID] = @p0

Yikes! Why’s it fetching all those columns? And why’s it bringing down all the product rows for that model instead of using MAX like it did earlier? This time, it really is going to work like my earlier example with the “it doesn’t actually do this” comment! If there are a lot of rows, then as I said, that’s a disaster.

This is one of those leaky abstraction moments.

Let’s take a step back so we can work out what just happened.

One of the nice things about LINQ is that the same basic techniques apply across LINQ to SQL, LINQ to Objects, LINQ to XML, and LINQ to Any Other Technology That Someone Felt Like Adding LINQ Support For. But that can be a double-edged sword. It is the root of this problem, for example: this final Max query turned out to be using LINQ to Objects.

The query that fetches the product model was against the ProductModels member of the data context. That property is of type Table<ProductModel>. (This generic class is defined in the System.Data.Linq namespace.) This implements an interface called IQueryable<T>, which is at the heart of the clever expression support, where C# code gets transformed into something else, such as SQL. This interface is critical to the process that ends up generating the MAX-based SQL query we want.

However, the query for maxPrice in our current example runs against the Products property of the ProductModel class, and as we saw earlier, that property is of type EntitySet<Product>. The EntitySet class does not implement IQueryable<T>. In fact the entity objects you get back from a LINQ to SQL query are pretty much detached from the original data context. (Deferred loading is an exception to this rule, but even then, entities track the bare minimum to enable the feature.)

So if ProductModel.Products doesn’t implement IQueryable<T>, how come we were still able to do LINQ stuff with it? The answer is LINQ to Objects. This implements the LINQ operations on top of anything that implements IEnumerable or IEnumerable<T>. The EntitySet<T> implements those, so that’s what we end up using.

Of course, LINQ to Objects just works by iterating through collections and accessing properties of objects. It tries to be smart whenever it can, but when it comes to implementing the Max function, it really has no choice but to examine the whole collection – IEnumerable<T> doesn’t offer an abstract mechanism for implementing Max. (And even if it did, the EntitySet downloads the whole collection in one go the first time you ask it for even a single item.)

Accessing Relationships within Queries is Different

We hit this problem by applying a query to the result of another query that had already run. But if we had used the following subtly different approach, we wouldn’t have had the problem:

var productModelQuery =
    from model in ctx.ProductModels
    where model.Name == "LL Road Frame"
    select model.Products;

decimal? maxPrice = productModelQuery.
    SelectMany(product => product).
    Max(product => (decimal?) product.ListPrice);

The critical difference here is that we don’t evaluate the first query. Before, we used the Single method to go and fetch the product, but here we just leave the query in query form. After the first statement executes, nothing has gone to the database yet – the statement simply builds an object that represents our query.

The second line uses SelectMany. This flattens multiple collections into one. We’re using this because the first query evaluates to a collection of collections: the select part of our productModelQuery picks model.Products, which is a collection – it’s all the products related to the given model. So this query will return a collection of collections of products. As it happens, we only expect the query to return one result – there should only be one product model with that name – so we’re getting back a collection of length 1 which contains a collection. But it doesn’t really matter – SelectMany lets us take as many collections as we have, and concatenate them into one single collection. And it’s not that SelectMany is particularly special. There’s one particular feature that we need here, and SelectMany is one way of getting that. Here’s the critical part:

SelectMany doesn’t destroy the queryness of the query.

What this means is that if you apply SelectMany to an IQueryable<T>, its output is also an IQueryable<T>. (Possibly a different T... In our case, the lambda we’ve provided is the identity function, so it’ll be the same T, but you have the option to project as you flatten.) This is different from Single – even if the input to Single is IQueryable<T>, the output is a single object.

By using SelectMany, we ensure that when we apply Max, we’re doing so on an IQueryable<T>, not an IEnumerable<T>. The difference this makes would be clearer in a world without extension methods – we are doing this:

decimal? maxPrice = Queryable.Max(
    Queryable.SelectMany(
      productModelQuery.
      product => product),
    product => (decimal?) product.ListPrice);

as opposed to this:

decimal? maxPrice = Enumerable.Max(
    Queryable.Single(productModel.Products),
    product => (decimal?) product.ListPrice);

The upshot is that the query-based version of Max can see it all – it has access to the whole structure of the query, enabling it to calculate a reasonable SQL query. It can see that although the first part of the query selects model.Products, we don’t need all that data to come back from the server, because the final result we care about is just the Max function applied over one of the columns that comes back. Here’s what it comes up with:

SELECT MAX([t2].[value]) AS [value]
FROM (
    SELECT [t1].[ListPrice] AS [value],
       [t0].[Name], [t1].[ProductModelID],
       [t0].[ProductModelID] AS [ProductModelID2]
    FROM [Production].[ProductModel] AS [t0],
         [Production].[Product] AS [t1]
    ) AS [t2]
WHERE ([t2].[Name] = @p0) AND
  ([t2].[ProductModelID] = [t2].[ProductModelID2])

That’s not the most readable thing in the world, but it has done the right thing. It has told the database server to do the work and just return the one value we care about.

Notice that we were still able to use the ProductModel.Products property here. The crucial point was to make sure we didn’t try to perform the query until all the pieces were in place. That way, LINQ to SQL can work out that we don’t actually want to see the ProductModel object or its Products – it can see that those are intermediate steps in the query, so it doesn’t need to realise those values into entity objects. The reason it all went pear-shaped in the previous attempt was that we resolved the first query to an actual result before executing the second part.

LINQ to Schrödinger’s Cat – Collapsing the Wave Function

Evaluating a query is a bit like looking inside the box in the Schrödinger’s Cat thought experiment. This query:

var productModelQuery =
    from model in ctx.ProductModels
    where model.Name == "LL Road Frame"
    select model;

is not yet committed to a particular outcome. Just as the Schrödinger’s box does not resolve to a live or dead cat until we observe its contents, a query does not resolve to specific entities until we evaluate it, e.g.:

ProductModel productModel = productModelQuery.Single();

This returns an entity object, and in doing so, it collapses the wave function, i.e. the entity does not retain the ‘queryness’ of the query. And that’s why using an entity object as the starting point for further queries is sometimes less effective than either building a query on top of an existing query, or creating a new query from scratch.

Obviously the analogy is not perfect. In particular, we haven’t really collapsed the query, in the sense that the query object still exists, and we are free to evaluate it as many more times as we like. The main point is that ‘observing’ (evaluating) a query yields something of a different nature from the query itself.

Notes for Google Cut and Paste Warriors

I know from experience that some people find code samples on the internet in books and paste them into their code without thinking about what they’re doing. Many of these people believe that when their code appears to work, they’re done. I have some notes for the benefit of these people, because you probably shouldn’t just copy that ‘combined query’ code – there’s a simpler solution. If you’re not the type to copy without thinking, feel free to skip to the next step. For the rest of you:

Note 1: stop being so lazy.

Note 2: existing code on the web is generally unlikely to be an ideal fit for your problem. That’s especially true when the code was posted as an illustration to aid understanding, rather than as part of a complete solution. This example is a case in point, leading me to...

Note 3: the query-on-query example earlier is rather contrived and you shouldn’t use it. Its purpose is to illustrate that the use of collection-like properties that represent relationships aren’t necessarily problematic in LINQ queries. But in practice, you’d never implement that particular query that way. The following example does the same job (fetches the maximum price for a particular product model) and is much simpler:

var productQuery = from product in ctx.Products
    where product.ProductModel.Name == "LL Road Frame"
    select product;
decimal? maxPrice = productQuery.Max(
    product => (decimal?) product.ListPrice);

This also happens to generate slightly less convoluted SQL – it uses a JOIN as you might expect, where the previous example generates a nested SELECT. (It’s still not as succinct as it could be. But it’s not a big deal because SQL uses the same execution plan for both of these generated queries as it would for the simplest possible hand-built query.)

Same Syntax, Different Technology

I know some people prefer to skim for the interesting bits by looking at the titles of the article. Given the typical length of my blog entries, I can understand that. So I’d like to highlight an important point that might otherwise get buried:

Two near-identical LINQ-based code fragments can behave in fundamentally different ways.

This is by design. LINQ defines a set of standard operators such as ‘Where’ or ‘Max’. Different LINQ implementations (LINQ to SQL, LINQ to Objects etc.) are free to implement these in different ways as long as they generate the expected results. It’s a form of Duck Typing – consumers of LINQ implementations only demand that methods with certain names and signatures are present. The benefit is a uniform programming model (with, in some cases, language-integrated support, such as C# 3.0’s query syntax). The price is that code might turn out not to be doing quite what you thought it did. And if you think this is hypothetical scaremongering, bear in mind that I wrote this blog post because I ran into this problem when using LINQ for real.

And this isn’t the first time I’ve been savaged by a duck.

Know Your Tools

The critical point here is to know what LINQ is doing for you. The ability to follow relationships between tables using simple property syntax in C# can simplify some code considerably. With appropriately-formed queries, LINQ to SQL’s smart transformation from C# to SQL will generate efficient queries when you use these relationships.

If you use powerful technology without understanding what it does, then it’s easy to cause big performance problems. LINQ is no exception, so if you’re learning it I would recommend using SQL Profiler, or the DataContext.Log property to make sure you know what’s really happening.

As always, a sound understanding of your tools is essential.

April (2018)	(1 item)
August (2014)	(1 item)
July (2014)	(5 items)
April (2014)	(1 item)
March (2014)	(1 item)
January (2014)	(2 items)
November (2013)	(2 items)
July (2013)	(4 items)
April (2013)	(1 item)
February (2013)	(6 items)
September (2011)	(2 items)
November (2010)	(4 items)
September (2010)	(1 item)
August (2010)	(4 items)
July (2010)	(2 items)
September (2009)	(1 item)
June (2009)	(1 item)
April (2009)	(1 item)
November (2008)	(1 item)
October (2008)	(1 item)
September (2008)	(1 item)
July (2008)	(1 item)
June (2008)	(1 item)
May (2008)	(2 items)
April (2008)	(2 items)
March (2008)	(5 items)
January (2008)	(3 items)
December (2007)	(1 item)
November (2007)	(1 item)
October (2007)	(1 item)
September (2007)	(3 items)
August (2007)	(1 item)
July (2007)	(1 item)
June (2007)	(2 items)
May (2007)	(8 items)
April (2007)	(2 items)
March (2007)	(7 items)
February (2007)	(2 items)
January (2007)	(2 items)
November (2006)	(1 item)
October (2006)	(2 items)
September (2006)	(1 item)
June (2006)	(2 items)
May (2006)	(4 items)
April (2006)	(1 item)
March (2006)	(5 items)
January (2006)	(1 item)
December (2005)	(3 items)
November (2005)	(2 items)
October (2005)	(2 items)
September (2005)	(8 items)
August (2005)	(7 items)
June (2005)	(3 items)
May (2005)	(7 items)
April (2005)	(6 items)
March (2005)	(1 item)
February (2005)	(2 items)
January (2005)	(5 items)
December (2004)	(5 items)
November (2004)	(7 items)
October (2004)	(3 items)
September (2004)	(7 items)
August (2004)	(16 items)
July (2004)	(10 items)
June (2004)	(27 items)
May (2004)	(15 items)
April (2004)	(15 items)
March (2004)	(13 items)
February (2004)	(16 items)
January (2004)	(15 items)

IanG on Tap

Blog Navigation

Writing

Other Sites