StrokeDB persistable incremental views

Posted by yrashk

This weekend StrokeDB got so called “persistable incremental views”. What is this?

Well, lets start from View concept. It is basically a map-reduce filter with map and reduce functions defined in Ruby.

By default, it maps all documents and lets you reduce them (lets say we want to find users with age > 21):

 
   my_view = View.create!(:name => "my view").reduce_with {|doc| doc.is_a?(User) && doc.age > 21 }
 

Or, you can specify your own map block (if you need to create new documents set to be reduced):

 
   my_view = View.create!(:name => "my view").map_with do |doc|
                     new_doc = Document.create!(:doc => doc)
   end.reduce_with {|doc| doc.doc.is_a?(User) && doc.doc.age > 21 }
 

To get results, simply use

 
  my_view.emit.to_a # or my_view.emit.documents, that's the same
 

Okay, that’s simple. We map documents to documents and then reducing them using some criteria. Also I would like to mention that Views could be argument-polymorphic. If you’ll define your map and reduce blocks having more than one argument, you can emit results using some parameters:

 
   my_view = View.create!(:name => "my view").reduce_with {|doc,age| doc.is_a?(User) && doc.age > age }
   my_view.emit(21).to_a
 

I think that’s simple and nice :)

Now incremental views come in. When you call my_view.emit View emits first “view cut” which is a set of documents map/reduced for the whole database. Now, you can use this view cut to get new view updates:
   
     first_cut = my_view.emit 
     # ... work with database, add some new documents, update old documents
     next_cut = first_cut.emit
   

next_cut view cut will contain only newly created/updated documents — so, you get updates incrementally.

Now, what about persistency declared above? That’s really simple — View and ViewCut are documents themselves — so you can easily save them and reuse later!

P.S. Currently Views are pretty slow — but things will change hopefully

P.P.S. Incremental views are really, really young in StrokeDB so I can’t promise that they are bug-free. Also API isn’t stable by any means (yet!).

Get StrokeDB

StrokeDB goes public

Posted by yrashk

For the past two weeks Oleg Andreev and me spent most of our time working on a stuff we enjoyed really a lot — StrokeDB project

What’s it?

StrokeDB is a lightweight approach to document-oriented database, currently implemented in Ruby. The concept is pretty much simple:

  • each document is uniquely identified by UUID
  • each document has a set of slots, which are basically key/value pairs, where key is a string and value is a simplistic data structure (boolean, number, string, array, hash — like in JSON)
  • each time you update documents, its version is updated. Version is basically a hash-function for document content.
  • reference to previous version is automatically maintained by StrokeDB
  • each document may reference 1+ “meta documents”, which are the documents that declaratively describe an essence of a particular document

One of the motivations for StrokeDB was my desire to decentralize some databases. Currently databases are pretty much centralized, like in SaaS you use — you basically host your data at some company’s data center. I believe that in some cases it is not a proper way of managing your data. Due to centralization you put your data security at risk, you need their database software to be really shining fast (because there a lot of clients working with their data), etc. But what I really want is to have my data right where I am working with it (i.e. on my laptop), be able to share it with other parties in a secure way, back it up, etc.

So, yes, I just want to return some data to the client’s computer.

That’s how I came to StrokeDB, which was greatly inspired by Git and my previous experiments in metaframe databases.

Why another document database?

Why not CouchDB/ThruDB/SimpleDB? Well, we had a number of reasons to launch own project:

  • We want it to be really lightweight, and basically, embeddable. That’s how it is implemented now — it is just a Ruby library.
  • We want to workaround natural limitations of the mentioned DBs. CouchDB does not support code injection to the database core, indexes in particular (like in PostgreSQL). SimpleDB is hosted elsewhere, supports very primitive queries, not extendable. ThruDB supports only keyword-based search index (no special indexes). Also, partitioning and distribution is done via SimpleDB.
  • We want to build a system on the top of concept of asynchronous operation. We do not rely on locking or a synchronous conflict resolution (aka optimistic locking). Well-designed asynchronous workflow leads to several useful features: unlimited data distribution, offline work, replication-based load balancing, data consistency, availability and fast access altogether.

Metadocuments?

Here is a simple example of metadocuments usage: Imagine you have document that represents some concrete apple:


some_apple: 
        weight: 3oz 
        color: green 
        price: $3 

it could have three metadocuments that “describe it”: Apple, Fruit and Product:


some_apple: 
        __meta__: [Apple, Fruit, Product] 
        weight: 3oz 
        color: green 
        price: $3 

Upon this document load ruby object will be extended by three modules (Apple, Fruit and Product).

For example, you have them defined as


Apple = Meta.new
Fruit = Meta.new do 
        def green? 
                color == 'green' 
        end 
end 
Product = Meta.new do 
        def sell! 
                # ... 
        end 
end 

So when you load that some_apple document (by finding it with slot-based search, or by its UUID), you will have an object that also responds to #green? and #sell! methods.

It will also will respond positively to #is_a?(Apple), #is_a?(Fruit), #is_a?(Product)

Some examples?

Here you go:


config = StrokeDB::Config.new(true)

config.add_storage :mem, :memory_chunk
config.add_storage :fs, :file_chunk, 'test/storages/test'

config.chain :mem, :fs
config[:mem].authoritative_source = config[:fs]

config.add_storage :index_storage, :inverted_list_file, 'test/storages/index'
config.add_index :default, :inverted_list, :index_storage

config.add_store :default, :skiplist, :mem, :cut_level => 4

User = Stroke::Meta.new
unless u = config.indexes[:default].find(:__meta__ => User.document, :email => "someemail@gmail.com").first
  puts "User not found, creating new user" 
  u = User.new :email => "someemail@gmail.com" 
  u.save!
else
  puts "We've found him!" 
end
puts u

config[:mem].sync_chained_storages!

What do we still miss?

A lot:

  • Transactions (though we have some building blocks ready to build them)
  • Replication (but again, we have building blocks for streaming replication already)
  • Efficient indexes
  • Nice API (time cures this disease!)

But hey, it was only two weeks of hacking — so stuff is definitely coming.

Questions? Ideas?

Join our mailing list