Eager-loading RFC

There are two more features I want to get into Simple.Data before releasing 1.0beta1: upserts, and eager loading. Upserts are pretty straight-forward, but there are a few different approaches to eager loading and I want to get some input from the community before I pick one.

In case you don’t know, eager loading is a pattern for getting data from the database in an optimal fashion, with the minimum number of requests or least-possible server load. It’s the opposite of lazy loading, which Simple.Data already does.

The implementation syntax will look like this:

    IEnumerable<Customer> customers = _db.Customers.FindAllByRegion("South")
                .With(_db.Customers.Orders)
.Cast<Customer>();

This will create the customer objects and populate an ICollection<Order> Orders property in each one, assuming there is one.

There are two directions in which a class may have relationships with other classes: it may have “parent” (or “lookup”) classes, where there is at most a single instance of that parent class within the instance of the class in question; and it may have “child” classes, where there is a collection of instances of the child class. Eager loading of parent classes is trivial; you just add the necessary join(s) to the master table(s) and marshal the data into the right places in code.

Eager loading of child classes is similarly easy when there is only a single child collection to be loaded; you join to the detail table and then process the results so that a single instance of the main table row is created, and instances of the detail table rows are grouped into collections.

For example, loading user objects with a list of phone numbers:

SELECT User.Id, User.Name, Phone.Number
FROM User
LEFT JOIN Phone ON User.Id = Phone.UserId

Returned rows:

User.Id User.Name Phone.Number
1 Alice 0123456789
1 Alice 0738473284
2 Bob 0129934845
2 Bob 0729478594

From this we create a User object for Alice with her two phone numbers, and an object for Bob with his. Even if there are a lot of rows in the detail table, this is still the preferred way of handling eager loading for this scenario as it involves the fewest round-trips to the database. Often, you can handle a parent-child-child relationship using this algorithm, too, although beyond that point you’re getting a bit scary.

The complexity arises when you want to load two unrelated child collections, say, load Customers with their Invoices and Orders:

SELECT Customer.*, Invoice.*, Order.*
FROM Customer
LEFT JOIN Invoice ON Customer.Id = Invoice.CustomerId
LEFT JOIN Order ON Customer.Id = Order.CustomerId

This creates what is technically called an “outer Cartesian product”, where a row is returned for every possible combination of the rows from the two outer-joined tables. So if you’re grabbing the details for a single valuable customer who has placed a thousand orders and generated a thousand invoices, you get a million rows back from the database. I have a hunch this might not be optimal.

So this is my Request for Comments: what would you like to see done here?

The options are:

  1. Only allow one detail-table With clause to be specified per query. Throw an exception if there is more than one.
  2. Allow multiple detail-table With clauses, and use additional SELECT operations for all except the first.
  3. Add a parameter to the With method allowing the developer to specify the technique on a per-detail-table basis:
  4.     var customers = _db.Customers.FindAllByRegion("South")
                    .With(_db.Customers.Orders, WithUsing.Join)
                    .With(_db.Customers.Orders.Items, WithUsing.Join)
                    .With(_db.Customers.Invoices, WithUsing.ExtraQuery);
    

    If only one detail table is specified, or one detail with a detail below it (e.g. _db.Customers.Orders.Items), the default behaviour when no WithUsing parameter is supplied will be to use JOIN. Multiple detail tables without a WithUsing will cause the ADO adapter to throw an exception (other adapters, such as the one for MongoDB, can use the With clause as they see fit).

I think I’ve rubber ducked here a bit, as Option 3 is looking like a no-brainer, but I’ve typed it all now, so if you do have any comments or suggestions, please do leave them below. I’m planning on starting code for this on Sunday morning, so there’s plenty of time to change my mind.

Comments

  1. For this scenario:-
    var c = _db.Customers.FindAllByRegion(“South”)
    .With(_db.Customers.Orders)
    .With(_db.Customers.Invoices)

    Wouldn’t you be better of to fire 2 queries to the database?
    1st query customers left join orders
    2nd query customers left join invoices
    and only hydrate customers once, orders once and invoices once. Mind you the more I think about this the more I would like to swap out “left” with “inner”

    • That’s basically what would be happening with Orders WithUsing.Join and Invoices WithUsing.ExtraSelect, but the extra queries would use inner join and only select the PK from the main table and all the columns from the joined table.

      • So ExtraSelect (or ExtraQuery) will force a second query. This should work fine for any parent->child relation.

        Its the parent->child.grandchild that I am not sure about, I remember seeing a blog from Ayende that shows NHibernate doing a similar thing using futures.

  2. I probably wouldn’t force them to specify. Obviously, MongoDB doesn’t need this type of thing (in terms of join types), so making this column optional always with obvious defaults is the way to go.

    First With gets a Join, second, third, (etc…) get a second query.

  3. Even though I doubt what is the real usage scenario of 3+ level of cascading, I think the best way to handle this is to allow the developer to specify how to map the query to sql and the framework takes care how to map the result to object hierarchy, otherwise you’ll always find some1 complaining.

  4. Frode N. Rosand says:

    Option 3 is definitely the way to go. Defaulting to a join on the first child table only, and extra calls for all other child tables, with the option to override this. Your fluid syntax can provide:

    var customers = _db.Customers.FindAllByRegion(“South”)
    .WithJoin(_db.Customers.Orders)
    .WithJoin(_db.Customers.Orders.Items)
    .WithQuery(_db.Customers.Invoices);

    A WithJoin method could return an interface or class that would allow it to be followed by either WithJoin or WithQuery. WithQuery would return a different interface/class that would only allow sub-sequent WithQuery methods class. (Other methods like your Cast may be allowed of course.)

    • The method will just be called “With”, since the technique is an implementation detail that applies only to relational databases. For MongoDB it will just be a means of aliasing nested collection names, and for the upcoming OData adapter it will allow the query to specify collections to include.

Trackbacks

  1. [...] Simple.Data protects against SQL Injection & Eager-loading RFC – Mark Rendle (not to be confused with his evil twin Merk Rendle) discusses the origins of [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,740 other followers

%d bloggers like this: