Tech blog (41)

NHibernate queued adds on lazy-load collections

I don’t know if this behaviour is documented (well, yeah, the (N)Hibernate documentation is pretty thin but it’s improving), I wasn’t fully aware of it and this caused a bug… I’m posting this in hope it may save for someone else the time I have lost today :).

In our software we use a custom NHibernate collection type that is derived from AbstractPersistentCollection and keeps track of “back references”: that is, references to the record that owns the collection. It does this automatically when an object is added to the collection.

Now, on collections mapped as lazy-loading, Add() operations are allowed even if the collection is not initialized (i.e. the collection just acts as a proxy). When adding an object to a collection in this state, a QueueAdd() method is called that stores the added object in a secondary collection. Once a lazy initialization is performed, this secondary collection is merged into the main one (I believe it’s the DelayedAddAll() method that does this). This can be hard to debug because lazy load is transparently triggered if you just touch the collection with the debugger (providing the session is connected at that moment), and everything gets initialized properly.

Our backreference was initialized at the moment the object was really added into the main collection. But this is not enough, we had to support queued adds - that is, the cases when QueueAdd returns true. The other alternative is to disable delayed adds by commenting out the places where QueueAdd is called – I don’t know if this is possible, there seems to be some code that supports it. We decided to support delayed add, and it seems to work. The modification looks something like this (this is the PersistentXyz class):

int IList.Add(object value) 
    if (!QueueAdd(value)) 
        return ((IList) bag).Add(value); 
        // if the add was queued, we must set the back reference explicitly 
        if (BackReferenceController != null) 
        return -1; 

Debugging SQL Express Client Sync Provider

How to finally get the SQL Express Client Sync Provider to work correctly? It’s been almost a year since it was released, and still it has documented bugs. One was detected by Microsoft more than a month after release and documented on the forum, but the fix was never included in the released version. We could analyze this kind of shameless negligence in the context of Microsoft's overall quality policies, but it’s a broad (and also well documented) topic, so we’ll leave it at that. It wouldn’t be such a problem if there were no people interested in using it, but there are, very much so. So, what else is there to do than to try to fix what we can ourselves… You can find the source for the class here. To use it, you may also want to download (if you don’t already have it) the original sql express provider source which has the solution and project files which I didn’t include. (UPDATE: the original source seems to be removed from the MSDN site, and my code was updated - see the comments for this post to download the latest version). The first (and solved, albeit only on the forum) problem was that the provider was reversing the sync direction. This happens because the client provider basically simulates client behavior by internally using a server provider. In hub-and-spoke replication, the distinction between client and server is important since only the client databases keep track of synchronization anchors (that is, remember what was replicated and when). I also incorporated support for datetime anchors I proposed in the mentioned forum post, which wasn’t present in the original source. But that is not all that’s wrong with the provider: it seems that it also swaps client and server anchors, and that is a very serious blunder because it’s very hard to detect. It effectively uses client time/timestamps to detect changes on the server and vice versa. I tested it using datetime anchors, and this is the most dangerous situation because if the server clocks aren’t perfectly synchronized, data can be lost. (It might behave differently with timestamps, but it doubt it). The obvious solution for anchors is to also swap them before and after running synchronization. This can be done by modifying the ApplyChanges method like this:

foreach (SyncTableMetadata metaTable in groupMetadata.TablesMetadata)
    SyncAnchor temp = metaTable.LastReceivedAnchor;
    metaTable.LastReceivedAnchor = metaTable.LastSentAnchor;
    metaTable.LastSentAnchor = temp;

// this is the original line
SyncContext syncContext = _dbSyncProvider.ApplyChanges(groupMetadata, dataSet, syncSession); 

foreach (SyncTableMetadata metaTable in groupMetadata.TablesMetadata)
    SyncAnchor temp = metaTable.LastReceivedAnchor;
    metaTable.LastReceivedAnchor = metaTable.LastSentAnchor;
    metaTable.LastSentAnchor = temp;

This seems to correct the anchor confusion but for some reason the @sync_new_received_anchor parameter still receives an invalid value in the update/insert/delete stage, so it shouldn’t be used. The reason for this could be that both the client and server use the same sync metadata and that the server sync provider posing as client probably doesn’t think it is required to leave valid anchor values after it’s finished. I promise to post in the future some more information I gathered poking around the sync framework innards. Note that this version is by no means fully tested nor without issues, but its basic functionality seems correct. You have to be careful to use @sync_new_anchor only in queries that select changes (either that or modify the provider further to correct this behaviour: I think this can be done by storing and restoring anchors in the metadata during ApplyChanges, but I’m not sure whether this is compatible with the provider internal logic). Another minor issue I found was that the trace log reports both client and server providers as servers. If you find and/or fix another issue with the provider, please post a comment here so that we can one day have a fully functional provider.

Who cares about C# interfaces?

I remember that when the first version of Java was released one of the frequent questions was: why doesn’t it have multiple inheritance? This also happened later with C#. The answer in both cases was: you can use interfaces instead. It is not the same, you don’t inherit the logic, but you can at least simulate it by implementing multiple interfaces.

Well, this was obviously forgotten by the people designing C# classes: even worse, they seem not to have been aware of it because most of the classes in v1.0 onward don’t have equivalent interfaces that would allow this.

There’s a real-life example (or should I say everyday example? Because I’ve encountered it a million times, you probably did too) that perfectly illustrates the point: why isn’t there an IControl interface that represents a (windows, web, whichever) control type? The logical answer (and it was probably the real reason the framework designers came up with) would be: well, it would be too difficult to implement. Anything that is supposed to work as a control should really have to be derived from Control. True: but the purpose of interfaces is not only to implement them.

Let’s say I want to create my control type that implements some interface, IMyInterface. How do I reference this component, using a variable of Control type or IMyInterface? Whichever I use it isn’t complete, I have to cast it into the other one and I lose compile-time type safety. I could create a ControlWithMyInterface base class that implements the interface and use this instead of both, but then I’d have to inherit everything from it, which in most cases would not be possible. No, the only solution would be to support the “interfaces instead of multiple inheritance” principle and derive IMyInterface from IControl, but IControl doesn’t exist. Although it wouldn’t be too difficult to implement, one could just put all of Control’s public members into the interface.

There could really be a programming rule enforcing the existence of an equivalent interface for every class, even something that generates them on the fly from its public members. I wonder if there is a pre-processor that can help with this?

Supporting triggers that occasionally generate values with NHibernate

NHibernate supports fields that are generated by the database, but in a limited way. You can mark a field as generated on insert, on update, or both. In this case, NHibernate doesn’t write the field’s value to the database, but creates a select statement that retrieves its value after update or insert.

Ok, but what if you have a trigger that updates this field in some cases and sometimes doesn’t? For example, you may have a document number that is generated for some types of documents, and set by the user for other types. You cannot do this with NHibernate in a regular way – but there is a workaround…

It is possible to map multiple properties to the same database column. So, if you make a non-generated property that is writable, and a generated read-only property, this works. You have to be careful, though, because the non-generated property’s value won’t be refreshed after database writes.

A more secure solution would be to make one of the properties non-public and implement the other one to support both functionalities. Like this:

// This field is used only to send a value to the database trigger:
// the value set here will be written to the database table and can be consumed by
// the trigger. But it will not be refreshed if the value was changed by the trigger.
private int? setDocumentNumber;	

private int? _documentnumber;

// The public property that works as expected, generated but not read-only
// public int? DocumentNumber { get { return _documentnumber; } set { // NHibernate is indifferent to this property's value (it will not // be written to the database), so we have to update the setDocumentNumber // field which is regularly mapped _documentnumber = value; setDocumentNumber = value; } }

Here’s the NHibernate mapping for these two:

<property name="DocumentNumber" generated="always" insert="false" update="false"/>
<property name="setDocumentNumber" column="DocumentNumber" access="field"/>

Model-Controller pattern for business rules

The model-view-controller (MVC) pattern has an interesting, let’s call it sub-pattern, that could be more broadly used. The basic purpose of MVC is to “clean up” the data (model) and interface (view) by delegating all the in-between “dirty” logic to the controller. The controller is aware of both the data and the interface and knows how to handle both: it controls what is happening in them, but from the outside. The controller, if written correctly, can be reused for multiple types of model or view. If we go a step further, we could program multiple controllers that know how to handle different aspects of the functionality: in this way we very much get add-ons that we can attach onto non-intelligent components and make them smarter. By choosing which controllers we attach, we can customize the behaviour of our application. And because the controller is made to work “from the outside” it is isolated from the internal implementation of the components: this could make it instantly reusable.

But it could be interesting to generalize the “add-on” controller pattern and make it work in other situations as well. One interesting area of application could be controlling data: if you use an Object-relational manager (ORM) to read your data (and if you don’t, you should), you already have your data in the perfect form for this - that is, as objects. Likewise if you use a traditional data-access layer and then create business entities as object-oriented structures. Usually, these classes contain much of the business logic inside and this is quite natural to do because business logic represents the behaviour of the data. But it’s not good for reusability: whether you combine the data access and business entities or implement them separately, the business logic is tightly coupled with data/hardcoded inside data so it’s not easy to customize. It would be nice if we could externalize at least some parts of it, but keep it as close to data as possible. Even better would be if we could split the logic and implement different aspects as different components.

If we attached controller objects to our data, they could take the responsibility of providing its behaviour. We could have different controller types for different types of behaviour (status changes, automatic calculations etc.), and if we instantiate them using some kind of configuration engine (or dependency injection framework), they will be easy to replace and thus make your application highly customizable.

Of course, these “data controllers” sound similar to what rules engines do, only more lightweight. But how much of an overhead would they add? As always, it depends on what we do, but I would say not too much. They would contain the code we already have, with probably one layer of isolation because it would be split into distinct classes. Would the instantiated rules waste memory? Probably some: but when you consider that the ORM instantiates each record as an object and has a complex infrastructure for tracking its changes; that for each data binding there is a couple of objects created in the interface; that there can possibly be a control (which is quite a large object) bound to each field… It seems that it shouldn’t make a big difference.

The most sensitive bit is collections: there we can have hundreds of records with one control (a data grid) bound to them. In this case, the overhead of the rules engine could be significant compared to the overhead of the interface – but not in comparison to ORM overhead. Again, this depends on how we implement the rules: the rule could be optimized to utilize a single instance for the whole collection – this can be done using an IBindingList implementation where the collection transmits events for everything happening inside it… Of course, here it is the collection which incurs the overhead, but let’s say we would use it anyway for data binding purposes.

Obviously, there’s a lot of technical issues here. This is partly because the base framework doesn’t have built-in support for what we need. But, it is to be expected that having an ORM (which can be interrogated for data structures and from which we can receive notifications when data is loaded/saved) generally helps.

Of course, I’m still talking hypothetically here: but I’ve already made a couple of small steps in the direction of having a living prototype (in fact, I’m testing an minimal working something as we – uhm, speak). More on that in another post.

Common Table Expression For Sorting Hierarchical Records By Depth

I tried to find this on the net, and was unable to find a satisfying and simple solution. How do you retrieve hierarchical records (say, from a self-referencing table) sorted by their depth in the hierarchy? Microsoft SQL Server 2005/2008 has a new SQL construct,  Here’s one example: a Category table with an ID (primary key) and ParentID (pointing to the hierarchical parent).

WITH CategoryCTE(ID, ParentID, Depth) 
  SELECT ID, ParentID, 0 
  FROM Category 
  WHERE ParentID IS NULL – root records


  SELECT cRecursive.ID, cRecursive.ParentID, cCte.Depth+1 
  FROM Category AS cRecursive JOIN CategoryCTE AS cCte 
      ON cRecursive.ParentID = cCte.ID 
FROM CategoryCTE 

Running Composite UI Application Block inside a windows service

This was a brain-twister: not a lot of work but hard to figure out. How does one use CAB in a windows service?

Is this a reasonable requirement, a Composite UI framework in an application with no UI? Well, it is, since CAB is not only about UI… If you have a framework of components that use CAB services and need to run them in unattended mode, it would be much easier to implement CAB support in the service instead of modifying everything to run with as well as without CAB.

I’m going to present one solution that worked for me, but I believe there are other variations. Since your requirements may vary, I’ll describe the general idea so you can modify or improve it.

Microsoft Sync Framework: where will it end?

It seems that the Microsoft Sync Framework is being developed in a hurry. It is a quite a big task - or at least it should become big if we're to have a serious data distribution framework - therefore it probably merits some patience, but the first thing that is sacrificed in similar cases is documentation. So what we now have is a piece of software for which it is not easy to figure out how it works, and once you do, there are cases when you’re left to your own power of deduction to figure out how it’s supposed to work.

I haven't dug into the framework deeply enough, but I'm digging. And I have the intention to document the findings, even if it means just sketching everything in short sentences. There are many things not obvious until you start disassembling (and Reflector Ilspy-ing, of course) the innards of the dlls. And even in that case, you have to keep notes because it's not a simple system. I intend to come back to this subject in the posts to come (and get way more technical), be sure to check back if you're interested.

I cannot say for sure, but given the complexity of problem the Sync Framework set out to solve, it is commendably (somewhat bravely, even) comprehensive, well thought out - and quite stable for a Microsoft V1. There's a V2 "on the air" right now, but it's a technology preview and it mostly contains a more mature version of the stuff we've already seen before. But even V2 or V3 would only be the small first step: what it currently does, copying database rows back and forth between PCs is not a mechanism that will allow us to one day easily build distributed systems. Even Sync Framework guys themselves acknowledge that the biggest obstacle is replication conflicts - irregularities that occur when the same piece of data is changed in multiple locations at the same time. Microsoft cannot help but give us a simplified solution in the form of record-by-record detection and resolution, and this is because the framework is in its very early stages: I don't know even if (or when) it will grow smart enough to handle more serious conflict resolution.

The thing is, record-by-record resolution cannot help you enforce business rules: for example, if your business logic depends on an invoice not containing the same product multiple times, how do you prevent this from happening in a distributed system? Two users working with two different databases can each add a record for a single item, but when the records replicate you get two of them. This really needs to be detected, and not in such way that would require programming a separate copy of validation logic for synchronization issues (which, when you think of it, should validate data that was already succesfully written to the database… I shudder to think of it). The synchronization framework would really need to somehow integrate with validation logic: in this aspect (and this is probably the biggest issue but only one of the issues present), Microsoft Sync Framework is much closer to the start line than to the finish. But at least it's moving...

The IT industry has so far moved on mostly in a step by step fashion, by implementing better solutions than the existing ones. This is where the Sync Framework will be of most use, to finally help us start thinking in terms of distributed data. Also, once there's a working system for data distribution, most will be interested in having it. And once they do, it will be much easier to persuade them that they need to structure their data and/or applications differently. Hopefully we'll be moving onto an application design philosophy in which it is a "good thing" to have distributed data just like it currently is a "good thing" to have object-orientedness, layered structure etc. There’s a good chance CRUD will be the one of the things that we'll start getting rid of. Because, once you look at it, storing the current state of data (which is the essence of CRUD - Create, Read, Update, Delete - philosophy) is the major factor in causing replication conflicts. If the databases stored operations - that is, changes to the data – besides data itself, it would be much easier to resolve conflicts, many of them automatically. The logic would know what the two mentioned users did - added the same product to the invoice - and act with this knowledge. In this concrete example there would be a much clearer situation for conflict resolution, the system could replay the operations so that the second one gets a chance to detect there already is a record present and act accordingly - be it to add the second quantity to the first or raise an error. Note that now a common validation logic for the operation could be employed... This is light years away from getting "fait accompli" duplicated rows and having to do a Sherlock Holmes to discover what has happened. Of course, this is also light years from where we currently are, but when you think of it, the database servers are way overdue for serious feature upgrades – and in any case, they already store something that resembles this in transactional logs.

So, it seems we're making the first step in the right general direction, even if we're not sure what precise direction we should move in. Trying to wrap our heads around distributed data philosphy is good – and seeing this practice widely deployed will be even better.

Subscribe to this RSS feed