.NET & Architecture Thoughts: November 2009

This is the fourth part of my on-going series on Language Integrated Query or LINQ. I have been away from this series for a while. However my following posts (including this one) are aimed at completing this series. In the second and third posts, we had an overview of the features which are important to understand to have a full grasp of LINQ. In this post, we will have a detailed look at LINQ syntax and explore its different features. So let us get started.

Introduction

LINQ is a set of language extensions added to C# and VB.NET which makes queries a first-class concept to these languages. It provides a unified programming model to different data domains for data management. As I mentioned in my first post, using LINQ we can query and operate on different data domains including relational databases, XML, custom entities, DataSets or any third party data source. Above all, the concept of queries is now applicable to in-memory data as opposed to using queries with a persistent medium only. From a developer’s point of view, the interface to each data domain remains the same. But the LINQ engine is responsible for converting the queries to target the domain being referenced.

Since LINQ is a first-class concept in C# and VB.NET, these languages come loaded with support for LINQ. LINQ queries take advantage of features including IntelliSence and compile-time syntax checking. LINQ queries rely on standard query operators (discussed shortly) which are a set of functions used to fetch, parse, sort and filter the data.

Query Expression and Method-Based Queries

A LINQ query is a reminiscent of standard SQL-Query syntax. The purpose of LINQ is to add data querying capabilities to the .NET Framework such that any data-domain can be processed with the same ease. LINQ relies on concepts including Extension Methods, Anonymous Types, Anonymous Methods and Lambda Expressions discussed in earlier posts.

A LINQ Query is also known as Query Expression or Query Syntax. A query expression is a declarative syntax for writing queries which allows us to filter, group and order data. According to MSDN, “a query expression operates on one or more information sources by applying one or more query operators from either the standard query operators or domain-specific operators”. This means that LINQ can operate on different data-domains using operators specific to LINQ or developed by a third party. To me this is a polymorphic behavior of LINQ. The result of a query expression is an in-memory sequence of elements (or objects). Any object which implements the IEnumerable <T> interface is a sequence. The resultant sequence can be iterated through by built-in language iterators.

Let us see a simple example in listing 1 which shows a LINQ Query in action:

Listing 1


int [] prime = new int [8] {1, 3, 5, 7, 11, 13, 17, 19}; // Line 1

var primeNumbers =  // Line 2
        from p in prime      // Line 3
        where p > 0         // Line 4
        select p;            // Line 5

foreach (int p in primeNumbers) // Line 6
{
    // use p
}

In the above listing, a list of prime numbers is queried and the result is assigned to a variable primeNumbers. The LINQ Query (line 2–5) resembles a SQL statement with the standard from, where and select clauses. But it is being applied to an in-memory collection of numbers rather than a persistent medium (such as database or XML). Although simple but I am sure you can visualize the strength of LINQ upfront from the above example. The rest of the example is what we talked about (var, foreach) in the previous post.

A query expression can also be represented by a Method-based Query syntax. A method-based query utilizes extension methods and lambda expression. It is a rather concise way of writing query expressions. There is no performance difference between the two. A query expression is more readable while a method-based query is concise to write. It really comes down to your choice of syntax. But do keep in mind that all query expressions are translated into method-based queries. Using method-based query, listing 1 can be written as following:

Listing 2


IEnumerable <int> primeNumbers =
                                    prime           
                                    .Where (p => p > 0)
                                    .Select  (p => p);

Avid readers must have figured out the reason for applying the foreach loop on the variable primeNumbers in listing 1. This is because the type primeNumbers is converted to IEnumerable <T> which represents a collection of elements. This collection be iterated using foreach loop.

I am sure by now you can spot many of the features explained in part 2 and part 3 of this series. The above queries are just making use of concepts including var keyword, extension methods, lambda expressions and enumerators.

LINQ Syntax

LINQ is a reminiscent of standard SQL Language and thus has a sharp resemblance to it. Like its counterpart, query expression consists of clauses. There are three main clauses in a LINQ expression including from, select and the where clause. The general syntax of a LINQ query is as following:

var [query] = from …
where …
select …

The first clause in a LINQ Query is the from clause. You may be wondering why a LINQ query begins with the from clause unlike a standard SQL query which is preceded by the select clause. The reason for this precedence is to support Intellisence when working with Visual Studio IDE. Since the from clause specifies the data source ahead of the query, the compiler becomes data-source aware and hence supports Intellisence. The from clause is then followed by the where and select clauses. You can find the full syntax of a LINQ Query here .

Let us see listing 3 to analyze a LINQ query piece by piece. We start by defining a simple class followed by object initialization:

Listing 3


public class Car
{
        public string Type { get; set; }
        public string Color { get; set; }
}

Car[] cars = 
{
        new Car { Type = "SUV",   Color = "Green" }, 
        new Car { Type = "SUV",   Color = "Black" }, 
        new Car { Type = "4x4",   Color = "Red" }, 
        new Car { Type = "Truck", Color = "Orange" }, 
        new Car { Type = "Jeep",  Color = "Black"}
};

Now that we have an array of cars with their properties initialized, we use a LINQ Query to find all the cars with a specific make:


IEnumerable<Car> search =
            from myCar in cars
            where myCar.Type == "SUV"
            select myCar;

The query begins with the from clause. A from clause only operates on sequences implementing the IEnumerable interface. This clause is actually made up of two parts; from and in. The in part points to the source-sequence which must be of type IEnumerable while the from part is a variable used for iterating through the source-sequence.

Next is the where clause used for filtering. Behind the scene, this clause is converted to Where Query Operator which is a member of the Enumerable class. This method accepts a lambda expression as parameter to apply the filter.

Next in the sequence is the select clause. This clause defines an expression which is assigned to a variable. The expression can be of any type including an instance of a class, string, number, boolean etc. Indeed this clause lets a type be created on the fly and assigned to a variable.

Finally, we can iterate through the variable ‘search’ since it is of type IEnumerable using the following code:


foreach (Car c in search)
{
// use c.Type, c.Color
}

Standard Query Operators

So far we have hardly scratched the surface of LINQ syntax and have seen some very simple LINQ queries, but in reality; the discussion of query expressions is incomplete without Standard Query Operators. The standard query operators represent an API defined in the Enumerable and Queryable classes under the System.Linq namespace. These operators are extension methods which accept lambda expressions as argument. These operators operate on sequences where any object which implements the IEnumerable<T> interface qualifies for a sequence. These operators are used to traverse, filter, sort, order and perform various functions on the given data. In other words they provide many of the features of a standard SQL Query including Distinct, Group, Set, Order By, Select etc.

I have stated above that a query expression is converted to a method-based query. In a method-based query, a Clause is converted to its respective Standard Query Operator (an extension method) . For example, the where clause is converted to a Where operator while the select clause is converted to a Select Operator. To keep it simple, just remember that the same clause in a query expression is represented by an operator when converted to a method-based query.

According to LINQ’s official documentation, Standard Query Operators can be categorized into the following:

• Restriction operators
• Projection operators
• Partitioning operators
• Join operators
• Concatenation operator
• Ordering operators
• Grouping operators
• Set operators
• Conversion operators
• Equality operator
• Element operators
• Generation operators
• Aggregate operators

A detailed explanation of each of the above is beyond the scope of this post. However, in the following sections, we will look at some of these operators and their use.

Select / SelectMany – Projection Operators

A Select operator performs a projection over a sequence and returns an object of type IEnumerable<T>. When this object is enumerated, it enumerates through the source sequence and produces one output element for each item in the sequence. The signature of the Select operator is as following:

public static IEnumerable<S> Select<T, S> (
this IEnumerable<T> source,
Func<T, S> selector);
public static IEnumerable<S> Select<T, S> (
this IEnumerable<T> source,
Func<T, int, S> selector);

The first argument of the selector predicate is the source sequence while the selector argument is a zero-based index of elements within the source sequence. I will skip an example for this operator as all the above examples make use of this operator :)

The SelectMany operator is used with nested sequences or in other words sequence of sequences. It merges all the sub-sequences into one single enumerable sequence. The SelectMany operator first enumerates the source sequence and converts its respective sub-sequence into an enumerable object. It then enumerates each element in the enumerable object to form a flat sequence. The operator has the following signature:

public static IEnumerable<S> SelectMany<T, S>(
this IEnumerable<T> source,
Func<T, IEnumerable<S> > selector);
public static IEnumerable<S> SelectMany<T, S>(
this IEnumerable<T> source,
Func<T, int, IEnumerable<S>> selector);

The source is the sequence to be enumerated. The selector predicate represents the function that that is applied to each element in the sequence.

Listing 4 shows the use of the SelectMany operator:

Listing 4


public class Region
{
     public int RegionID;
     public string RegionName;
     public List<Product> Products;
}

public class Product
{
    public string ProductCode;
    public string ProductName;
}

Now we will initialize a list of type Region with a child object of type Product:


List<Region> products = new List<Region>
        {
            new Region { RegionID = 1, RegionName = "North",
                            Products = new List<Product> { 
                                    new Product { ProductCode = "EG", ProductName = "Eggs" },
                                    new Product { ProductCode = "OJ", ProductName = "Orange Juice" },
                                    new Product { ProductCode = "BR", ProductName = "Bread" }
                            }
            },

            new Region { RegionID = 2, RegionName = "South",
                            Products = new List<Product> { 
                                    new Product { ProductCode = "CR", ProductName = "Cereal" },
                                    new Product { ProductCode = "HO", ProductName = "Honey" },
                                    new Product { ProductCode = "MI", ProductName = "Milk" },
                            }
            },

            new Region { RegionID = 3, RegionName = "East",
                            Products = new List<Product> { 
                                    new Product { ProductCode = "SO", ProductName = "Soap" },
                                    new Product { ProductCode = "BS", ProductName = "Biscuits" },
                            }
            }
        };

Now we apply the SelectMany operator to select products from the North and East region:


var ProductsByRegion =
            products
            .Where (p => p.RegionName == "North" || p.RegionName == "East")
            .SelectMany (p => p.Products); // using SelectMany operator

foreach (var product in ProductsByRegion)
{
    string code, name;

    code = product.ProductCode;
    name = product.ProductName;
    // use code & name
}

The above listing begins by defining two classes, Region and Product. The region class consists of a child collection property Products. Next a list of regions is initialized such that each Region in turn has multiple products. The LINQ Query is applied using the SelectMany operator. This will flat out the sub-lists (products in this case) for the selected regions (North and East) and create a single sequence to be iterated. If you run the above code, you get the following result:

EG: Eggs, OJ: Orange Juice, BR: Bread, SO: Soap, BS: Biscuits

Where– Restriction Operator

The Where operator, also known as restriction operator, filters a sequence based on a condition. The condition is provided as a predicate. The Where operator enumerates the source sequence and yield those elements for which the predicate returns true. We can also use ‘where’ keyword in place of the Where operator. The signature for the Where operator is as follows:

public static IEnumerable<TSource> Where<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate);
public static IEnumerable<TSource> Where<TSource>(
this IEnumerable<TSource> source,
Func<TSource, int, bool> predicate);

The source represents the sequence to be enumerated while the predicate defines the condition or filter to be applied on the given sequence.

Join / GroupJoin - Join Operators

The Join operator is a counter part of Inner Join used in SQL Server and serves the same purpose. It returns a sequence of elements from two different sequences with matching keys. This operator has the following signature:

public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult> (
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner, TResult> resultSelector);
public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult> (
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner, TResult> resultSelector,
IEqualityComparer<TKey> comparer);

In the above overloads, outer and inner represent the two source sequences. The predicates outerKeySelector and innerKeySelector represent the keys on which the join will be performed. They should be of the same type. The resultSelector predicate represents the projected result for the final output. In the second overload, we can also use a custom comparer to perform the join between the two sequences based on custom logic.

Listing 5 shows the use of Join operator joins two different lists based on a condition:

Listing 5


public class Developer
{
    public string name {get; set; }
    public string projecttitle { get; set; }
}

public class Project
{
    public string title { get; set; }
    public int manDays {get; set; }
    public string company { get; set; }
}

List<Developer> developers =
                new List<Developer>
                {
                    new Developer { name = "Steven", projecttitle = "ImageProcessing" },
                    new Developer { name = "Markus", projecttitle = "ImageProcessing" },
                    new Developer { name = "Matt",   projecttitle = "ImageProcessing" },
                    new Developer { name = "Shaza",  projecttitle = "GraphicsLibrary" },
                    new Developer { name = "Neil",   projecttitle = "WebArt" },
                };

        List<Project> projects =
                new List<Project>
                {
                    new Project { title = "ImageProcessing",    company = "Future Vision",  manDays = 120 },
                    new Project { title = "DatabaseFusion",     company = "Open Space",     manDays = 30 },
                    new Project { title = "GraphicsLibrary",    company = "Kid Zone",       manDays = 88  },
                    new Project { title = "WebArt",             company = "Web Ideas",      manDays = 57 },
                    new Project { title = "GamingZone",         company = "Play with Us",   manDays = 50},
                };

        var ProjectDetails =
            from dev in developers
            join pro in projects
            on dev.projecttitle
            equals pro.title
            select new
            {
                Programmer = dev.name,
                ProjectName = pro.title,
                Company = pro.company,
                ManHours = pro.manDays
            };

        foreach (var detail in ProjectDetails)
        {
            // use detail.Programmer, detail.Company, detail.ProjectName, detail.ManHours
        }

In the above code, the one thing to notice is the use of ‘equals’ rather then the ‘=’ sign. This is different from what we use in regular sql join statement. The example returns a sequence with matching ‘titiles’ from both sequences.

The above example works well for a 1:1 mapping between keys. However, if we wanted information on all ‘projects’ irrespective of any matching ‘developer’ then the above query doesn’t work. In a sql environment, a left join would do the trick since it will return all ‘projects’ and matching ‘developer(s)’. However it will return a ‘null’ for all ‘developers’ which do not have an associated ‘project’. In case of LINQ, the same purpose is surved by the GroupJoin operator.

The GroupJoin operator does not return a single sequence of elements returns a hierarchical sequence of elements. This sequence consists of one element each from the outer sequence. For each element in return, matching elements from the inner sequnce are grouped and attached with it. So it represents a hierarchical grouping of all elements from the outer sequence with each having a child-sequence (grouped together) of matching values from the inner sequence. This operator has the following signature:

public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey,
TResult> (
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, IEnumerable<TInner>, TResult> resultSelector);
public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey,
TResult> (
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
IEqualityComparer<TKey> comparer);

In the above overload, the arguments are similar to the one for the Join operator but how it works is interested. When the sequence returned by GroupJoin is iterated, it first enumerates through the inner sequence based on the innerKeySelector and groups them together. Grouping elements are stored in a hash table against they key. Next elements from the outer sequence are enumerated based on the outerKeySelector. For each element, matching elements from the hash table are searched. If found, the sequence from the hash table is associated with the element in the outer sequence. This way we get a parent-child grouping of elements. Listing 6 shows how to use GroupJoin operator (the data sample is from listing 5):

Listing 6


var query = projects.GroupJoin( // outer sequence
                    developers,           // inner sequence 
                    p => p.title,          // outer key selector 
                    d => d.projecttitle, // inner key selector 
                    (p, g) => new
                    {     // result projector
                        ProjectTitle = p.title,
                        Programmers = g
                    });

foreach (var detail in query)
{
        // use detail.ProjectTitle
        foreach (var programmer in detail.Programmers)
        {
                // use programmer.name, programmer.projecttitle;
        }
}

You must have noticed that we are using nested loops to access the elements. This is further proof that the elements are arranged in a parent-child hierarchy such that for each element in the outer sequence, we have matching elements (grouped together) from the inner sequence. The above example will produce the following resultset where all the ‘projects’ are displayed irrespective of a developer(s) assigned to them:

ImageProcessing
Steven, Markus, Matt
DatabaseFusion
GraphicsLibrary
Shaza
WebArt
Neil
GamingZone

OrderBy..ThenBy / OrderByDescending..ThenByDescending - Ordering Operators

The OrderBy operator is used for ordering the elements in a sequence by one or more keys. It also determines the direction of the order i.e. in an ascending order. The signature of this operator is as following:

public static IOrderedSequence<T> OrderBy<T, K>(
this IEnumerable<T> source,
Func<T, K> keySelector);
public static IOrderedSequence<T> OrderBy<T, K>(
this IEnumerable<T> source,
Func<T, K> keySelector,
IComparer<K> comparer);

In both the overloads, The source is the source sequence on which the operator will operator. The keySelector represents a function that extracts a key of type K from each element of type T from the source sequence. In the second overload, comparer is a custom comparer where we can write custom code to perform the comparison.

You must have noticed that the return type of this operator is IOrderedSequence and not IEnumerable. Before I explain this, let me mention that the OrderBy operator is supported by the ThenBy operator. In a regular sql command, we can order the resultset by any number of fields in addition to the primary field. The same concept is supported by the ThenBy operator. The primary ordering is done by the OrderBy operator followed by the ThenBy operator. The ThenBy operator defines the seconary ordering and can be used n-number of times in a LINQ Query. Both operators together work as following:

source-sequence.OrderBy ().ThenBy ().ThenBy ()…

The above shows that the output of OrderBy is input to the ThenBy operator. Going back to the return type of IOrderedSequence, the ThenBy operator can only be applied to IOrderedSequence and not IEnumerable<T>. For this reason, the return type of OrderBy operator is IOrderedSequence. The example in listing 7 will sort the projects by manHours (OrderBy) and then by title (ThenBy):

Listing 7


List<Project> projects =
                new List<Project>
                {
                    new Project { title = "ImageProcessing",    company = "Future Vision",  manDays = 120 },
                    new Project { title = "DatabaseFusion",     company = "Open Space",     manDays = 30 },
                    new Project { title = "GraphicsLibrary",    company = "Kid Zone",       manDays = 88  },
                    new Project { title = "WebArt",             company = "Web Ideas",      manDays = 57 },
                    new Project { title = "GamingZone",         company = "Play with Us",   manDays = 50},
                };

IEnumerable<Project> details =
    projects.OrderBy (p => p.manDays).ThenBy (p => p.title);

foreach (var project in details)
{
            // use project.manDays , project.title , project.company
}

The concept of ordering can also be achieved by using the OrderByDescending and ThenByDescending operators. In this case, as the name implies, the direction of ordering is descending. The signature of OrderByDescending operator is as following with identical arguments to OrderBy operator:

public static OrderedSequence<TSource> OrderByDescending<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector);
public static OrderedSequence<TSource> OrderByDescending<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer);

Distinct / Union - Set Operators

The Distinct operator removes duplicate items from a given sequece. It is just the counterpart of the Distinct keyword used in regular SQL statements. It has the following signature:

public static IEnumerable<TSource> Distinct<TSource>(
this IEnumerable<TSource> source);
public static IEnumerable<TSource> Distinct<TSource>(
this IEnumerable<TSource> source,
IEqualityComparer<TSource> comparer);

In both the overloads, the source is the sequence on which the operator will operator. Using the second overload, we can specify a custom comparer to compare an element. Listing 8 shows the use of the Distinct operator:

Listing 8


public class Fruit
    {
        public string name { get; set; }
    }

List<Fruit> fruits =
                new List<Fruit>
                {
                    new Fruit { name = "Orange"},
                    new Fruit { name = "Orange"},
                    new Fruit { name = "Orange"},
                    new Fruit { name = "Apple"},
                    new Fruit { name = "Apple"},
                    new Fruit { name = "Pappaya" }
                };

        var query =
            (from fruit in fruits
             select new { fruit.name}
            ).Distinct();

        foreach (var fruit in query)
        {
            // use fruit.name
        }

The output of the above query will be following:

Orange
Apple
Papaya

Another of the Set Operators includes the Union operator. A Union operator returns unique elements from two sequences while ignoring the duplicates. The Union operator has the following signature:

public static IEnumerable<TSource> Union<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second);
public static IEnumerable<TSource> Union<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second,
IEqualityComparer<TSource> comparer);

In both the overloads, the first and second represents the two sequences on which the Union operator is applied. The third parameter in the second overload is a custom comparer for comparison. Listing 9 shows the use of Union operator:

Listing 9


int[] A = { 1, 2, 3, 4, 5 };
int[] B = { 4, 5, 6, 7, 8};

var union = A.Union (B);

foreach (var n in union)
{
        // use n
}

The result of the above query will be 1, 2, 3, 4, 5, 6, 7, 8. It will ignore the duplicates 4 and 5 and yield one element each.

Summary

In this post, we had a look at the basic LINQ Syntax. A LINQ query is also known as a Query Expression. A query expression is a declarative way of writing query where we can perform different operations such as filtering, sorting, grouping etc. The yield of a query expression is a sequence.

Another way of writing a query expression is a Method-based Query which is just a concise way of writing LINQ Query. It makes use of Lambda Expression and Extension Methods. In the background, every query expression is converted to a method-based query. However there is no performance difference between the two and it comes down to the preference of usage.

A query expression makes use of Standard Query Operators. These operators represent an API defined in the System.Linq namespace. They operate on a sequence to perform different functions such as sorting, filtering, projection, grouping and much more.

With this we come to an end of this post. In the next post, you will see the use of LINQ in real world applications. We will begin with LINQ-to-SQL (a provider of LINQ) which is used to query relational databases. So stay tuned for more…

.NET & Architecture Thoughts

Pages

Thursday, November 26, 2009

LINQ Explained - Index

LINQ Explained – Part 4