Tagging – various approaches in CouchDB

I finished my last post by raising the issue that there are many ways to approach structuring data in CouchDB. The same applies to the standard RDBMS of course, but in that case the wrongs and rights are clearer even if that’s only because we’ve seen it all before.

On a related note, I get the impression a lot of people are jumping into CouchDB because they don’t like or can’t get their head around the RDBMS. That’s not a great surprise, since in many cases (mostly web apps) the RDBMS is being both misused, and in an application it’s not best suited for in the first place. My advice is this – if you “hate SQL” or “can’t get your head around it”, go back and figure it out, because there’s no doubt that the RDBMS is one of the finest and most well developed tools we have in software today. If you don’t see that, you probably aren’t best placed to make the decision about what kind of database is appropriate to your application, and worse still I predict you’ll end up having the exactly same kind of relationship with CouchDB.

Anyway, I set out to give some concrete examples of how you might approach tagging in CouchDB. In my last post I created a simple MySQL database, where the tagging structure boils down to this:

RDBMS tagging structure

We have a table for our items, a table for our tags, and each tag applied to an item results in an entry to the link table. For an extreme application (either very small or very big) something else might be appropriate, but for the most part I would argue this is “the right way” to do it. The key point about this layout is that given a document, we can efficiently find all the tags applied to it, and vice versa.

So what are the options for representing this same data in CouchDB? Starting at the very simplest level, we can just define our document like this:

A simple tagged document

The tags field is just a list of textual tags. This would be a heinous crime in an RDBMS, but in CouchDB we can query this very efficiently by defining a view to do it. An obvious drawback though is that we can’t have any metadata about our tags. More importantly, we have to define our views to search for a specific tag. If we know what tags we’ll be using at design time, this could be a very good approach, but if they’re arbitrary and numerous, it’s no good at all. The limiting factor is that views can’t be parameterised in any way (although see below), so if you’re going to search for tag ‘x’ you’ll have to define a view for it. It’s worth noting that temporary views are cached in the same way as permanent ones, so using that approach would be efficient except for the first time you queried a particular tag, but still it’s not good for most situations.

Here’s another effort:

A second effort at couch tags

Now we can have metadata for our tags, and if we have the relevant tag document we have a list of items tagged with that tag. The simplest method would be to get all the tag documents back, which we can do easily by defining a view like this:

function(doc) {
  if(doc.type=="tag")
    return doc;
}

This might be ok if we don’t have a lot of tags and associated metadata, but if we do, we might just want that specific tag document. Although I said above that you can’t have parameterised views, in fact, you can achieve pretty much the same thing as follows:

function(doc) {
  if(doc.type=="tag")
    return {key:doc.name};
}

The above view uses the name as the key, rather than the usual ID, and that allows use to use the startkey and endkey query parameters when retrieving the view. Setting them both to the desired tag value will return either one row (the tag we want) or zero if it doesn’t exist. The row only contains the document ID, revision and key though, so a second lookup is then necessary to get the actual document.

Needless to say, the above layout doesn’t help much when we want to go the other way around and given an item document, find what tags are attached to it. In fact, if you take this to its logical conclusion you end up with the same structure as the original RDBMS table layout I started with, and you will be making a lot of extra effort on the client side that was dealt with in the RDBMS world with a simple SQL query.

This is far from being a criticism of CouchDB. Firstly, I’ve taken an inherently relational problem which is obviously not what CouchDB is going to excel at. Secondly and more importantly, the increased effort necessary is the cost of the other benefits of CouchDB. I think I’d still find it worthwhile in this scenario in exchange for easy scaling, redundancy and fault tolerance.

An option not considered here, though workable in both the RDBMS and CouchDB scenarios, is using full-text search to implement tagging. A further possibility is that I’m missing something about CouchDB, which is not impossible since I’m new to it. Time will tell on that one, but so far I am liking what I see.

  1. Sam’s avatar

    > there’s no doubt that the RDBMS is one of the finest and most well developed tools we have in software today

    Unless you are Second Life. Or Bloglines. Or Flickr. Or NASA. Or Craigslist.

    http://www.oreillynet.com/databases/blog/2006/04/database_war_stories.html

    Reply

  2. Jan’s avatar

    Excellent article, thanks!

    Reply

  3. CiaranG’s avatar

    Sam, I daresay that if you are any of those examples you still agree with my statement but realise that it may not be the right tool for that particular job.

    Reply

  4. Marcus Breese’s avatar

    One of the things I’ve wondered about CouchDB documents is whether or not it would be a good idea to allow for non-versioned attributes. If you don’t really care when a tag was applied to a document, it would be nice to allow for document annotations. This would be non-versioned data that could be easily queried… such as tags (or parent document id’s, or pre-calculated values)

    I also think that the view functions maybe a little weak. I’d like to see arguments in view and the ability to add a sort function. For arguments with views, the easiest way around it is to just use adhoc views; however, this isn’t an optimal solution… it’s too hard to index. (Of course the problem is how do you index when you don’t know what the passed arguments will be).

    For example, if you have a blog and are using CouchDB, you’d have each entry stored as a CouchDB document. Each would have a date, a title, some content, and tags. If you used non-versioned annotations, you could add/remove tags quickly w/o making multiple revisions. If you had parameterized views, you could quickly search for articles by year or month. If you had a sort function, you could retrieve the last 5 added entries for your front page.

    As it is, if you want the documents for a year or month, you have to use an adhoc query. If you want to add/remove tags, you need a new document revision. If you want to display the last 5 added entries, you’ll have to retrieve them all, sort the list, and then get the top 5.

    (Re: sorting, you might be able to do this is you can use the key:attribute trick you mentioned above… but you’d still have to make 2 trips)

    Reply

  5. CiaranG’s avatar

    Marcus: Sorting is definitely a job for the key: syntax, and works well. I don’t think it *should* be necessary to make two trips – rather, the other fields you’re interested in would be returned in the same view. I haven’t managed to make that work yet though, either due to ineptitude or a bug. Bear in mind CouchDB is still very much at alpha stage.

    The key: syntax also covers the ‘last 5 added entries’ issue, though currently keys are dealt with as strings only. The plan is to allow keys to be any JSON object though.

    Documents for a year or month – again, use key: in conjunction with startkey and endkey.

    Reply

  6. Sho’s avatar

    Sigh. Thanks for the article – but I still just can’t get my head around CouchDb. I feel like I’m missing something, something important, because the software as currently extant seems completely useless to me. I love its design goals, but just can’t understand how it’s supposed to work.

    I don’t understand how relations are supposed to work. I don’t understand how I am able to search for a record without hardcoding the search into a view.

    I don’t get it. I’m willing to make some big sacrifices to get the kind of replication and RESTfulness that couch promises. I’ve worked on a similar project for 6 months now, and was delighted to find couchdb a few weeks ago .. but .. how is this possibly usable? Am I missing some big secret?

    Startkey and endkey, unless I’m reading it wrong, work by iterating over every record until they reach the key. Great for 100 record, rather less great for 100,000.

    Relationships, by which I mean the ability to associate a record with some property and then view keyed on that property, are not some optional extra – they are the essential core of the whole concept of a database. What am I missing?

    Reply

  7. Randy’s avatar

    I was under the impression that you could pass parameters view your views.

    The documentation / examples that I have seen seemed to suggest that.

    Reply

Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">