Implementing Equals() and GetHashCode() in ORM classes with autoincrement keys
- Written by Boris Drajer
- Published in NHibernate
The requirements for implementing Equals() and GetHashCode() in .Net are very hard to satisfy and in some areas nearly impossible. There are some situations where an object's identity inevitably changes, and saving an ORM-mapped object with autogenerated key is a case in point. Making its hash code conform with requirements would require an inordinately large quantity of code.
The plot goes something like this: loading ORM objects on the same session (in NHibernate speak) or context (Entity Framework) guarantees that for one record only one object instance will be created. But, if you use multiple sessions/context, no such guarantee exists. (And you want to do this if objects have different lifespans: for example, you fill a combo box once and then bind it to different records... Obviously, I'm talking about WinForms here, but the principle applies to server-side logic although it's probably not as frequent). In case of multiple instances, .Net components don't know it's the same record unless you override Equals() and get them compared by primary key values. In WinForms, for example, this means that a combo box won't know which record in the dropdown is equal to the one in the bound property, and won't select it.
Ok, so we override Equals(): usually, the record has an autoincrement key called, say, ID. We implement it so that it compares ID's of different objects (and type, obviously)... And now we run into the .Net requirement which says that two objects that are equal must have the same hash code, and that the hash code must never change for an object. We can override GetHashCode() to return the hash of the ID, but if the object gets saved to the database, the ID - and therefore the hash - will change.
Here's an example of how it would work: create a new ORM object instance, it's ID is NULL or zero. Use it as index in a dictionary, the dictionary retrieves the index's hash code and stores data in a special bucket for this hash. Save the record - the ID changes. If the hash code changes now, you won't be able to retrieve your data from the dictionary anymore. But if you load this record on a different session/context, it will have a different hash code unless we somehow notify it to use the already generated one... Which would probably mean using a static instance of a component that tracks all objects. Way too much work to get a hash code right, isn't it...
A couple of details that could lead us closer to a solution:
- On a new object instance (unsaved - with ID null or zero or whatever), we cannot use the ID in our overrides. Two objects with the empty ID are not equal nor will they ever be: if they both get saved into the database, it will create two separate records, and their IDs will acquire new values that were never used before... An unsaved object is only equal to itself. We could generate same hash codes for unsaved objects, but this wouldn't resolve our problem if a saved object gets its hash from the ID - it would still be different.
- While we're at it, it's useful knowing how the Dictionary works: it calls GetHashCode() on the given index and stores the entry in a bucket tagged with this hash code. Of course, there may be multiple objects with the same hash, and a bucket may contain multiple entries. Therefore, when retrieving data, the dictionary also calls Equals() on indexes in the bucket to see which of the entries is the right one. This means we have to get both Equals() and GetHashCode() right for the dictionary to work: Equals() should be OK in its simplest form if we always use the same instance for the index - basically, Equals() must be able to recognise the object as equal to itself.
- Other components, like grids and combo boxes, also use hash codes to efficiently identify instances, so a dictionary isn't the only thing we're supporting with this.
One part of the solution seems mandatory: we need to remember the generated hash on the object. This is mandatory only on an object whose ID may change in the future: if a saved object gets a permanent ID (as they usually do), caching is not necessary. If an unsaved object gets saved, we still use the hash we generated using the empty ID. We do this on-demand: when GetHashCode() is called, the hash is generated and remembered. This is probably the only meaningful way to do this, but it's worth pointing out one detail: if the object isn't used in a dictionary, it's hash won't be generated and won't change when it's saved. Thus, we narrowed down our problem to where this feature is actually used.
But there's still the possibility to have two objects that are equal (same record loaded on different sessions) but have different hash codes (the first one saved into the database and then the same record loaded afterwards). I'm not sure what problems this would create, but one is obvious: we wouldn't be able to use both of them interchangeably in a dictionary (or, possibly, a grid). This is not entirely unnatural, at least to me: if I used an object as an index in a dictionary, I'm regarding the stored entry as related to the object instance and not the record that sits behind it. I'm unaware of other consequences - please comment if you know any.
Note that we can also avoid the dictionary problem by using the IDs themselves instead of objects... But still, it would remain for grids and elsewhere. Also I'm not sure if it could be resolved by not using the data layer (ORM) objects in the dictionaries and grids but having data copied into business layer object instances: if we did this, we'd still need a component that tracks duplicates, only it would track business objects instead of data objects.
Can we narrow this further down? A rather important point is that the ID gets changed only when saving data - and ORMs usually save the whole session as one transaction. If we discarded the saved session and started afresh, we'd get correct hash codes and unchanged IDs. We'd only have a brief period of possible irregularity in the time after the old data is saved and before new data is loaded, and only if we load data on different sessions and use it in a dictionary or some other similar component. In client-side applications, this is a risky period anyway because different components get the new data at different times, and care must be taken not to mix it. At least some kind of freeze should be imposed on the components - suspending layout, disabling databinding etc. Also, reloading data is natural if you have logic that runs in the database (usually triggers) that may perform additional changes to data after we saved ours (and it may do this on related records, not just the ones we saved)... But that is a different, as they say, can of worms: it's just that these worms often link up to better plot our ruin.