« I love Vancouver | Main | StrokeDB: what's up? »

July 13, 2008

Is "save" obsolete?

Leaving my hangover apart, I want to talk about some recent "realization" I had. I am not sure it is necessarily a good idea, but it might be worth some investigation. So here we go.

In those ORMs I know (I do not pretend to know all of them, and I had an extensive experience only with few of them, mostly in Ruby) your way to operate with your objects' persistency is basically a read-edit-save loop (RESL). There is nothing bad about this way, especially given that you have no built-in versioning in typical relational databases and ORMs.

But as you may know, we at idbns team are experimenting with some weird ideas and prototypes (like StrokeDB) and one of the things we definitely love about our approach to data management is a built-in versioning.

At this moment, StrokeDB implements this RESL thing, too. It is a quite common approach, but it isn't any fun. It does work with versioning pretty well — it just increments document version once you save — nothing really tricky.

But there is one thing. My own viewpoint that I've developed within past few years is that your persistency mechanism should not let you "separate" your objects from your programming environment. So why the hell should I remind my programming environment to persist object's change every time I modify it? Wouldn't it be nice to persist data transparently?

May be. There is nothing new about this idea, but as far as I understand, there is not much public use of it in the industry.

Let's try to see where will it lead us to. Given we have built-in versioning, every change (like slot's value change) will cause versions change quite frequently, and, what is more important, these versions will be pretty much pointless. You will have a great history of every single change, but you wouldn't be able to say "and here we did that" for any more-than-one-slot update.

Unless you describe it explicitly. What if we'll make a record for every "business operation", something like a document that says:

  • this operation was performed at Jul 13, 2008 04:17AM PST
  • this operation was "week expenses adjustment"
  • this operation was performed on document 61fe324e-3e6e-49e8-9427-6ebab7c31ff9
  • this operation starts at version 164c2a4b-294a-4253-97e7-124cc1e4a1cc and ends at version 9d888a2a-7f69-40ae-83c6-c55262d89d99

It seems that having such kind of an explicit records will also allow us to run some kind of smart and safe compaction on a database.

I am not sure about the whole idea, but it still sounds interesting for me. What do you think?

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/881167/31193198

Listed below are links to weblogs that reference Is "save" obsolete?:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

What about atomic transactions? Sometimes we have to build an object (or series of objects) through various invalid states before we want to flag it as "done" and available for reading by someone else. Validation in general becomes difficult when you don't have a clear save point.

Having said that you could do all this if you just implemented transaction blocks. The end of a transaction would be an ideal place to trigger any validation checks for newly persisted objects. I would definitely prefer it to explicit save which can let you end up in all sort of catch-22s with co-dependnet validity and all sort of crazy junk.

If all we had was transactions, then all I have to think about is, when am i "done" building some objects. And I wouldn't even have to do that all the time - object changes outside a transaction can simply be persisted immediately.

Another potential problem would be managing conflict resolution when 2 clients edit the same object. With explicit save there is a clear point where you can intercept conflicts (the save call), whereas with implicit save you'll probably have to resort to exception catching and the flow control is going to get difficult to reason about.


Aaaaand finally there's the price you pay every time you add a layer of abstraction: it's more difficult to reason about the underlying mechanism from outside the box which can make performance VERY difficult to tune and you pretty much have to rely on the abstraction's author(s) to do it for you.

I imagine implicit save when your data store is a high-latency remote wouldn't fly at all ;)

>What about atomic transactions? Sometimes we have to build an object (or series of objects) through various invalid states before we want to flag it as "done" and available for reading by someone else.

That's where these operation log records might come in handy. We can have something like record with open "end version", meaning this operation is in progress.

> Validation in general becomes difficult when you don't have a clear save point.

The same — validate on operation completion only?

Speaking about transactions, these operation log records ARE some sort of transactions, though explicit and persistent.

>Another potential problem would be managing conflict resolution when 2 clients edit the same object.

Again, lets merge data when operations are closed. I think it makes sense to do it only after that, isn't it?

Performance is currently out of scope, I am more into investigating logical side of an idea at the moment. I am too lazy to think about performance issues while I am not sure about the whole idea at all. "It is a problem when it is a problem".

High-latency remote — not an issue at all with my viewpoint on decentralized databases. You just maintain your local copy and replicate whenever you need.

The important point of the whole idea is that you actually end up having very definite log of operations and complete time-based changes log. Isn't it nice? :)

Yes the operation log markers can definitely be thought of as just a way to implement transactions, transactions being the more familiar higher-level abstractions (and activity log markers being the more granular handle for them).

I think the merge thing would become an issue when you're not really working with transactions (which would be the more common case I think, except for particularly hair object instantiations).


Although, coming to think of it perhaps using transaction blocks everywhere SHOULD be the more common case. Ok so you still have to call "save" (kind-of - except you end your transaction instead of calling save) but it's still better then having to call .save on every object AND the storage layer itself has an easier time figuring out what needs saving (since the answer is simply "whatever isn't already saved").

Also I have a feeling this is not impossible to do using good ol' ActiveRecord ;)

Yeah, so to summarize

* you don't have to "save" each object, which is definitely nice
* you can have a clean log of operations, which may come in handy at some point
* you operate within transaction boundaries (which is basically good)

and yes, I wasn't speaking about ActiveRecord, you know. I am more into StrokeDB and other things we are experimenting with.

Sounds more like git than an ordinary DB
Why not?
How would you like such an approach:
application asks server for documents and modifies them or just creates new documents, then it asks server to pull it, and server can accept if there's no collisions or decline if there are any.
This transaction will have common comment/date/version, so than we do not have to store version info for each type of documents.
I think git is coming exactly that way and so most of DB's will do sometimes, it's application's turn to work with that approach.

Philipp,

Actually there should not be any specific "server", application should work with its (partial) clone and updates should be somehow distributed across global database.

Yes, it somehow reminds git, but in a more general "sense".

You'd better have a way to mark the end of "operation", so persistent store has better granularity than changing any component of the object stored.

BTW, while thinking on validation issues, Git model came up in my mind.

Imagine, you have 2 repositories: one for production use and one for experiments. If the feature you're working on is complex enough, you save intermediate states into experimental branch. And push all the stuff to production only when all the tests are passed.

In case of DB, exactly same thing came up. We can transform "transaction" term into "branch" term. You may save your invalid data smoothly into some temporary / on-memory storage, and then push the changes atomically into the storage, where validation must be performed.

Now it seems, that validations (as well, as constraints in general) are not a property of a document or the document type, but rather a property of the repository, which hosts a document.

This approach reminds me of that of the classic Smalltalk object databases, in particlar, Gemstone (which begat Maglev).

In Gemstone, you have explicit transactions, and various concurrency primitives (so you can do something intelligent instead of invalidating a transaction when two transactions modify the same object). Further, things are automatically saved, and you can use smalltalk (and in the case of Maglev, Ruby) as a language for stored procedures.

All told, if you're implementing a new ORM, you may want to look at Gemstone as an example. (or, better yet, just store the objects in Gemstone and deal with the easier problem of mapping between two object models)

Post a comment

If you have a TypeKey or TypePad account, please Sign In