Databases

August 05, 2008

StrokeDB: what's up?

I must confess. We've committed a serious sin. We haven't had a single commit in StrokeDB repository for almost two months. Sorry.

So... one may wonder whether StrokeDB is still alive. Well, yes it is. Though we have significant changes in our plan and approach.

As for now, we consider existing StrokeDB as a first prototype — it has both cool things and a lot of crappy code. So most probably the existing version will perish.

However, we are learning lessons of that awesome development cycle and trying to produce something more clean, both conceptually and code-wise. One of the versions we've started working on, is Oleg's strokedb-core, an attempt to minimize StrokeDB's essential core functionality and therefore get a way more modular and simpler thing.

That said, I am still trying to assemble my thoughts about my initial StrokeDB experience and come out with cures for some problems. That might mean yet another "rewrite" branch I can start anytime soon in order to consolidate my updated vision.

Another change we're working on is getting rid of our custom storages and usage of more proven things, like TokyoCabinet. As far as you can understand that leads to elimination of skiplist and therefore less code. Which is pretty good anyway.

So, to recap, our plan is to:

  • Modularize core
  • Consolidate updated data organization vision
  • More reliable storage

Unfortunately, the deadline date is open. We have no idea when it will be done.

I have a bit busy summer and I still can't work full-time on StrokeDB (and will not be able for a substantial period, unfortunately), but be assured, I am definitely not going to throw this stuff away. Neither does Oleg.

Stay tuned! :)

July 13, 2008

Is "save" obsolete?

Leaving my hangover apart, I want to talk about some recent "realization" I had. I am not sure it is necessarily a good idea, but it might be worth some investigation. So here we go.

In those ORMs I know (I do not pretend to know all of them, and I had an extensive experience only with few of them, mostly in Ruby) your way to operate with your objects' persistency is basically a read-edit-save loop (RESL). There is nothing bad about this way, especially given that you have no built-in versioning in typical relational databases and ORMs.

But as you may know, we at idbns team are experimenting with some weird ideas and prototypes (like StrokeDB) and one of the things we definitely love about our approach to data management is a built-in versioning.

At this moment, StrokeDB implements this RESL thing, too. It is a quite common approach, but it isn't any fun. It does work with versioning pretty well — it just increments document version once you save — nothing really tricky.

But there is one thing. My own viewpoint that I've developed within past few years is that your persistency mechanism should not let you "separate" your objects from your programming environment. So why the hell should I remind my programming environment to persist object's change every time I modify it? Wouldn't it be nice to persist data transparently?

May be. There is nothing new about this idea, but as far as I understand, there is not much public use of it in the industry.

Let's try to see where will it lead us to. Given we have built-in versioning, every change (like slot's value change) will cause versions change quite frequently, and, what is more important, these versions will be pretty much pointless. You will have a great history of every single change, but you wouldn't be able to say "and here we did that" for any more-than-one-slot update.

Unless you describe it explicitly. What if we'll make a record for every "business operation", something like a document that says:

  • this operation was performed at Jul 13, 2008 04:17AM PST
  • this operation was "week expenses adjustment"
  • this operation was performed on document 61fe324e-3e6e-49e8-9427-6ebab7c31ff9
  • this operation starts at version 164c2a4b-294a-4253-97e7-124cc1e4a1cc and ends at version 9d888a2a-7f69-40ae-83c6-c55262d89d99

It seems that having such kind of an explicit records will also allow us to run some kind of smart and safe compaction on a database.

I am not sure about the whole idea, but it still sounds interesting for me. What do you think?

May 16, 2008

We Don't Need a "Database"

I’ve been trying to formulate what StrokeDB is recently. And here is my summary: StrokeDB is not a database; it is a programming environment on top of Ruby (until we’ll have it ported to other languages). And here are my thoughts about “database” concept.

Do we really need “databases”? Well, I mean, we surely need some toolset to be able to store and retrieve data, but who said we need it in a form of pure datasets to be stored and retrieved? Who said that there should be a database server to interact with? Who said we might need special domain languages designed to manipulate arbitrary data?

What we really need is a persistence-aware programming environment, aren’t we? We just need to be able to store and retrieve data no matter how its persistence handled internally. There is nothing new about it, actually — MUMPS and GemStone/S (or even PL/SQL) were around for decades. What we really might need is to be able to create your-application-data-domain specific languages without any hassles — since we need to manipulate application’s data, not just any data (like you basically do with SQL).

It is quite popular in Rails world to say that we need a stupid database, just a kind of storage and let Ruby do the rest. Basically, they have a point. They use RDBMS as a data storage layer and their database is actually smart, because what is really important for data handling is actually implemented in Ruby. It is still usually limited by RDBMS design constraints, though.

My point is that your data should be as close to your main programming environment as it is possible. Your structures should be as native as it is possible — and they should be handled within the same environment. That’s reminds things like PL/SQL. Basically, PL/SQL is not THAT bad, but the thing with it is that usually you was using not ONLY PL/SQL, but, say, some Java code to interoperate with Oracle database.

Your application itself IS a smart database. So, I’d say we’re in the beginning of the long way “back to the future” — persistence-aware programming environments, not just databases.

Viva smart databases!

P.S. I am going to blog about data organization concepts within such environments soon — that’s an interesting topic to talk about and it is surely more concrete than this one :)

P.P.S. This article got 17 comments initially, so to not to lose them after my major blog cleanup, you can enjoy comments at google's cached version

April 26, 2008

Top 10 Reasons to Avoid Document Databases FUD

This article is written in response to Top 10 Reasons to Avoid the SimpleDB Hype

First of all I’d like to note that the below answers are not about SimpleDB but rather to prevent FUD about document-based databases.

  • Data integrity is not guaranteed.

This could be the case with SimpleDB, but overall nothing prevents document databases from managing data integrity very well.

Regarding the constraints, there is nothing that prevents defining validations in a document or its related “meta” document (this is pretty much how StrokeDB works — you can define your validations within meta document and they will let your document stay validated)

More interesting are the concerns about the conflicts. I’d say that this problem is hardly addressed in a common RDBMS approach. All you usually get is either user’s A or user’s B most recent update — there seems to be no easy way graceful conflict resulution. On the contrary, since document databases approach is rather novel there is certainly enough room to adopt ways to deal with conflicts. For example, with different and configurable algorithms — like merging them slot-by-slot 3-ways, or even some special programmer-defined algorithms. I can hardly imagine how to do this sort of stuff with traditional RDBMS in a relatively easy manner.

  • Inconsistency will provide a terrible user experience.

First of all, it should noted that described inconsistencies are also quite possible with distributed RDBMS setups — they too are constrained by a certain lag before the data is going to be propagated through replicas.

The actual problem is not with lag — it is more about leaving documents in a consistent state.

This problem could be easily addressed in any kind of database, either relational or document-based.

  • Aggregate operations will require more coding.

Again, while this seems to be true for SimpleDB, other document-based databases address this problem pretty well with Views approach (CouchDB, StrokeDB [Views is WIP]) — so you can define any kind of aggregation, even such that are simply not supported by RDBMS.

  • Complicated reports, and ad hoc queries, will require a lot more coding.

I’d refer to Views approach once again — it is quite a nice way to produce complicated reports as quickly as well-known RDBMS indexes do.

“Views” could be viewed as subroutines with a special well-defined API — and we can use these subroutines to index specific “queries” even in runtime. That’s pretty interesting.

  • Aggregate operations will be much slower if you don’t use an RDBMS.

This is a dubious statement. First, for the majority of the queries speed is defined by the speed of the index (all that B+ trees stuff). Document-oriented database views are indexed the very same way.

Speaking of those RDBMS “rows” and objects I wouldn’t say they are much different. An Object with key/value pairs slots is definitely a “row” in that sense. So what’s so different about them?

On the other hand, “real” relational database should actually use aggregating operations (joins) far more frequently than typical document database. Relational database is basically about storing short “facts” with relations between them and using lots of join operations to aggregate synthetic data. That wouldn’t be efficient/easy enough to program though — that is why most of relational database in the “real world” are organized in the form of fairly wide tables.

And, finally, for the well-done DODBs it is possible to use nice Map-Reduce API to build and incrementally update very complex aggregations.

  • Data import, export, and backup will be slow and difficult.

“There are no such tools for key-value data stores, because these products are so new.”

Is lack of maturity a good reason to blame new technologies?

SimpleDB implementation in particular might have its own flaws in this area — but nothing prevents it from improving things in theory and practice.

  • SimpleDB isn’t that fast.

Since this this post I am talking about document databases in general, I’d skip those “internet latency” issues. It’s kinda irrelevant.

  • Relational databases are scalable, even with massive data sets.

The main argument here is that “those guys do scale relational database, so they are scalable”. True. They are scalable. But at what cost? “Those guys” were able to do a lot of great stuff utilizing manpower before letting machinery do this back tens years ago. But is it a good excuse to manufacture goods without machinery these days just because it is possible? I doubt it. Throwing man power at a problem is not always the best approach.

And… you said “relational”? Facebook and others do a lot of denormalization, they don’t ever use JOIN, they’d rather do several consequent requests and build intermediate results on a webserver (when you have 20 times more webservers than DBs it’s obviously good to move some load there). They treat good old MySQL as object storage with very fast B+ tree indexes. Finally, the resulting database is not a relational one. One thousand of MySQLs is just a distributed object storage with simple fast indexes and a bunch of hand-written code in php/ruby/python/whatever around it.

  • Super-scalability is overrated. Slowing the pace of your product development is even worse.

Super-scalability issue is not really overrated. The problem with the approach of “why not wait and address super-scalability once you’ve created a super product” is that once you will address super-scalability, it will be quite a different product.

The issue with scalability these days is that less scalable applications are quite different from the the ones that are hugely scalable — and that is why writing a scalable application from the scratch is definitely a waste of time and money.

But what if scaling from SQLite-like backend to 2 datacenters will be quite painless and will not require you to rethink database interactions in your application? With the right database API design it is quite possible. BigTable, Amazon Dynamo, CouchDB, StrokeDB approaches are all about addressing this need.

  • SimpleDB is useful, but only in certain contexts.

Same can be said for relational databases. In the real world, data is not really well structured — it is rather versatile and it’s repsentation depends on point of view. This problem is very well addressed by document databases (and StrokeDB in particular was created in attempts to solve this problem).

“Amazon SimpleDB, Apache CouchDB, and the Google Datastore API aren’t bad products. But we do them a disservice when we construe them to be replacements for general-purpose databases. Used carefully, they can help your organization. But used indiscriminately, you’ll create a lot more work for your programmers and you’ll make your application perform even worse”

Relational databases are not bad products either. Used carefully, they can help your organization. But used indiscriminately, you’ll create a lot more work for your programmers and you’ll make your application development even more complex.