I faced an interesting situation today. I was getting a collection of objects from a database using a stored proc, but I only needed some of the objects that were returned. And I had to return the objects as a List.
Sounds like a good job for a LINQ where clause, right?
public List<object> DoWork() => return GetObjects().Where(ShouldKeep).ToList();
private bool ShouldKeep(object o)
{
// Do a bit of work here...
}
Ah, nice. Short. Readable. Beautiful. And unfortunately, not a good idea in this case.
When LINQ is not appropriate
Listen, I love LINQ. When Java devs ask me what makes me love C# so much, LINQ is at the very top of my list. That said, I think that one of the biggest sins we .NET developers commit is the overuse of LINQ. My use case today gave me several reasons not to use my favorite feature. Anyone of them individually would probably be enough to disqualify LINQ as a valid tool for this problem. Here are a few:
Two heads are better than one. But two collections aren't.
Because of the way our data repository is implemented, that GetObjects()
method doesn't return an IEnumerable<object>
. It doesn't yield return
the elements one at a time. It uses no delayed execution. It returns a list. A list that is taking up memory on my heap. A list that can get pretty big.
And unfortunately, I need to return a list, too. Just a (potentially) smaller one. And here's the rub: That .Where(ShouldKeep).ToList()
creates a second list that is anywhere up to the same size as the potentially large list I fetched from my database call. You will rightly point out that the first list, in this case, becomes just an intermediate list. It will fall out of scope and be marked for garbage collection the moment the method returns the final results list. And that's true. But while that final results list is being built by my beloved LINQ, you can't escape the fact that will have two potentially large lists putting memory pressure on your process.
LINQ is best for atomic operations
OK, I lied. My ShouldKeep()
method wouldn't just take the element as its parameter. The decision of whether any given element should be returned in the final list is a lot more complex and nuanced (and very stateful). It depends not only on the nature of the element itself, but also on features of other elements in the list, elements of other lists, and even (gulp) potentially another database call.
private bool ShouldKeep(List<object> list, object o, int index, DataProvider database, CustomConfiguration cfg, ...)
{
// Do a LOT of work here...
}
All of these factors bring in their own weight of timers and logging and business logic. I will grant that some developer out there can probably break it into a bunch of separate predicates and use some lesser-known LINQ overloads that return the index with the element and perform all sorts of wonderful voodoo, but that's not a game I want to play right now. It makes the code harder to read and even harder to set breakpoints in. LINQ excels at handling atomic operations for its predicates, but if you try to get too fancy, you're probably going to end up sacrificing something big.
IEnumerable is awesome. But only once.
This next point isn't strictly related to my particular use case since I already mentioned that the database method I'm calling returns a list, but since I'm already on my soapbox, I might as well. Don't treat IEnumerable<T>
as an actual collection. Here's an example:
public static void Main()
{
Console.WriteLine("Starting");
var myNumbers = GetNumbersFromDatabase();
if (myNumbers.Any())
{
Console.WriteLine("Got back " + myNumbers.Count() + " results. Here they are:");
foreach (var number in myNumbers)
{
Console.WriteLine(number);
}
}
else
{
Console.WriteLine("Didn't get any results");
}
Console.WriteLine("Done");
}
At first glance during an over-the-shoulder code review, this may look fine. Maybe you even know that GetNumbersFromDatabase()
hits 3 different tables so it's kind of expensive, but that's OK, you're only hitting it once, so it's fine, right?
Sure it's all fine. Fine until it hits production and your performance slows to a crawl. And your users start complaining that they are seeing, "Got back 0 results. Here they are:" instead of "Didn't get any results." So, after hours of looking for the bug, you finally add logging to the GetNumbersFromDatabase()
method:
private static IEnumerable<int> GetNumbersFromDatabase()
{
Console.WriteLine("In GetNumbersFromDatabase. Let's hit the database!");
for (var i = 1; i <= 3; i++)
{
Console.WriteLine("Querying table " + i);
yield return i;
}
}
And you see the following in the console:
Starting
In GetNumbersFromDatabase. Let's hit the database!
Querying table 1
In GetNumbersFromDatabase. Let's hit the database!
Querying table 1
Querying table 2
Querying table 3
Got back 3 results. Here they are:
In GetNumbersFromDatabase. Let's hit the database!
Querying table 1
1
Querying table 2
2
Querying table 3
3
Done
Whenever you call a LINQ extension method on anything that implements IEnumerable<T>
, it will re-evaluate the provider of the collection. If the underlying object is an array or a list, that's usually fine since it will just work on the object that already exists in memory. But if the implementation is doing something fancy like yield-returning its elements one at a time, you end up doing whatever work that method does each and every time.
The fix? If you need to know anything about (or do anything to) a bunch of objects other than iterate through it exactly once, put the elements in a collection. Add .ToArray()
to the end of the call to GetNumbersFromDatabase()
fixes it right away.
Another tip: I like the var
keyword in general. But in this case, the var
hid the fact that the returned object from the database call was an iterator instead of a collection. So maybe rethink your use of var in that case.
Back to the use case
So what did do I do for my GetObjects()
case? Well ultimately I wanted to do something like this:
List<object> listFromDB = GetObjects();
foreach (var element in listFromDB)
{
if (!ShouldKeep(element, ...))
{
listFromDB.Remove(element);
}
}
return listFromDB;
But alas, if I try to run that I will get the good old "Collection was modified; enumeration operation may not execute" exception when I run it. It's almost like the CLR is begging you to duplicate the collection (which was reason #1 for not using LINQ in the first place).
Ditching LINQ and foreach
So how can we get around this? Well, we already ditched LINQ. Let's ditch our other C# favorite, foreach
. If we access the elements by index, the responsibility of keeping track of iteration state falls on us instead of the runtime, so .NET won't complain and yell at us when we try to change something:
List<object> listFromDB = GetObjects();
for (var i = 0; i < listFromDB.Count; i++)
{
if (!ShouldKeep(element, ...))
{
listFromDB.RemoveAt(i);
}
}
return listFromDB;
Let's try that with a toy example where we filter out numbers between 3 and 7 inclusive:
public static void Main()
{
var list = Enumerable.Range(1, 10).ToList();
for (var i = 0; i < list.Count; i++)
{
if (list[i] >= 3 && list[i] <= 7)
{
list.RemoveAt(i);
}
}
Console.WriteLine(string.Join(",", list));
}
A collection removal gotcha
Let's look at the output:
1,2,4,6,8,9,10
Whoah, what's with that 4,6,8 nonsense? Well, like I said, when we access the elements by index, the responsibility of keeping track of iteration state falls on us instead of the runtime. And we're doing it wrong. When we remove an element at an index, the indices of all elements past this one shift down by one. So we remove the element at i, then the for loop takes us on to element i + 1. But what used to be i + 1 has dropped down to become i, so we're now evaluating what used to be i + 2, skipping i + 1 entirely. Whenever you remove at the current index, make sure to decrement the iterating index so you don't skip anything:
public static void Main()
{
var list = Enumerable.Range(1, 10).ToList();
for (var i = 0; i < list.Count; i++)
{
if (list[i] >= 3 && list[i] <= 7)
{
list.RemoveAt(i);
i--;
}
}
Console.WriteLine(string.Join(",", list));
}
Tha's better:
1,2,8,9,10
Now we have a method that doesn't create an additional collection, and in fact, will only shrink the existing collection.
So next time you're faced with filtering a large data set in a non-trivial way, take a hard look at your LINQ and foreach and see if you can't easily release some memory pressure on your application by getting back to your roots.
Top comments (8)
You should always be careful when you're using LINQ.
There is huge difference between using LINQ on collection which are implementing
IEnumerable<T>
andIQueryable<T>
. Long story short. When we're using IEnumerable interface, as you've written, we're working on objects which are stored in memory. In other hand IQueryable uses expression trees which can be modified as long as we don't execute them (calling ToList() or ToArray() method). More about expression trees.Common mistake is work on IEnumerable interface. Let's take an example. We've a repository which contains method GetAll() which returns IEnumerable. When we'll work with those implementation like in example below, we'll first load whole collection into memory, then we'll filter results, after that we'll skip 10 elements and at the end we'll take only 5 results.
But if our repository would implement method GetAll() which would return IQueryable expression won't be executed until we explicity call ToList() method. For instance, when we'll work with Entity Framework on a database our expression will be translated into proper SQL query and we'll return only 5 elements (or less if condition won't be full fit) to our program.
To sum up. Be aware during work with LINQ. Check if you're working on IEnumerable or IQueryable interfaces. It matters. Cheers.
BTW difference between IEnumerable and IQueryable is one of my favorite interview questions :)
This comment sums the thoughts behind using LINQ up. 👍 Thank you!
LINQ allows you to focus on what you want to get rather than how you can get this. Compare two pieces of code:
and
It is almost takes nothing to understand first piece of code, while you need to get through five lines to understand second one. And as you have noted, the second one is error prone - do not forget to decrease
i
and so on.As for performance and memory pressure, LINQ is far better than your approach. Here is the gist with benchmark:
And results are:
LINQ is 50 times faster while memory allocation is almost the same. Well, you'll create a new List, right. But complexity of FilterEnumerable is O(N) while FilterList is O(N2).
As any other tool, LINQ is great in right hands. You need to know it pitfalls to use it more efficiently, but LINQ allows you not only write more efficient code, but gives you better tools for decomposition of your code as well.
You said that you were getting a collection of objects from a database using a stored proc but you only needed some of the objects that were returned. To me, this indicates that you needed to refine that proc so that it would handle this for you via extra parameters or create a new proc that did it.
To me, one of the biggest problems with LINQ and, by extension Entity Framework, is that it encourages ignoring the power of the database engine, be it SQL Server, Oracle or something else. DB engines are specifically designed and tuned to extract and sort large amounts of data. Sure, you can do this on your middleware or client side using LINQ but why not use the DB's strength to its fullest?
I've found LINQ most useful when working with non-DB data (CSV, XML, etc) where there was no database engine in play or smaller sets of data that needed a quick and minor refinement, like sorting line items in a specific invoice.
Correct. Unfortunately, modifying the proc (or writing a new proc) is not feasible for this. Heck, it's not even like EF where my filter criteria could be to some degree translated into SQL. I'm just stuck with this potentially massive list of objects and no say over which ones get returned for the foreseeable future.
Turf wars on data sources can be a huge problem and create frustration and ugly hacks. I've run into this myself from time to time. Sometimes it's somewhat justified and other times it's just someone being a control freak.
Of course, the best solution is to get everyone functioning and cooperating on the same team/page but when you have dysfunctional management, this can be quite difficult.
If I'm not mistaken, the additional memory used by a larger list, minus the objects referenced by said list, is quite minimal - perhaps nothing more than a stored count. A list with a count of one million, but all of the elements pointing to the same object, will consume no more memory than a list with a single element pointing to said object,
The main problem is the materialization of all those objects, to which you have a number of solutions:
IQueryable<T>
backed by a database LINQ provider, on which you could use theQueryable
extension methods which would be converted to SQL by the provider.but obfuscating the intent of your code by avoiding LINQ and foreach seems rather pointless to me.
Update: I've just looked at the definition (.NET Framework) of
List<T>
, and none of the fields consume more memory as the size of the list grows (unlessT
is a value type).Good article which inspires to look more closely at what LINQ calls do under the hood.
For removing items in a loop I like reversing the loop: