More CouchDB

At the end of the my last post I hadn’t managed to get all my test data uploaded to the CouchDB database. It turns out there was not one problem, but two.

The first was due to xmerl returning unexpected data when an ampersand was present in an attribute value. Say, for example, the attribute has a value of “A&ampB”, I would expect Element#xmlAttributeValue to return the list [65,38,66]. What it actually returns is [65,[38],66]. That’s not just my expectation either – ejson, on which erlcouch is based, goes into a terminal sulk when asked to encode that. That’s surely a bug, it should at least fail rather than sitting there indefinitely. I’m inclined to think the xmerl output is wrong in the first place, as I can’t find anything in the documentation that says it would return anything other than a flat list.

In any case, the simple solution to that problem was to lists:flatten(Element#xmlAttribute).

The second problem was a bit more predictable – UTF-8 encoding. (Damn that Björk). I thought this must have been the same bug reported by Sam Ruby the other day, and I can see some UTF-8 related stuff in SVN since I last took at copy, but updating to latest revision (184) didn’t solve my problem. To get it working I had to do this:

cp -r /usr/local/lib/erlang/lib/xmerl-1.1.4/ $HOME/couchdb/lib

For some reason, CouchDB couldn’t find the xmerl_ucs library until I did. Presumably something wrong with my setup, although it works fine from the Erlang shell. (Update: see comments for a better fix.)

After that, it transpired that I had to UTF-8 encode the data myself BEFORE handing it to erlcouch, so with all the above in mind (and a bit of de-windification of the filename field) the new import code looks like this:

-module(djimportc).
-export([start/0]).
 
-include_lib("/usr/local/lib/erlang/lib/xmerl-1.1.4/include/xmerl.hrl").
 
start() ->
 
  % Connect to CouchDB and create database...
  Host = couch:get_host([{host, "localhost"}, {port, 8888}]),
  DB = couch:create_db(Host, "dj"),
 
  % Parse the document and add record records to the table...
  { Xml, _Rest } = xmerl_scan:file("../media/LwDJState.xml"),
  Records = xmerl_xpath:string("/lwdj/record",Xml),
  process_record(Records,DB).
 
% Process a list of records...
process_record([],DB) -> ok;
process_record([Head|Rest],DB) ->
  process_record_attrs(Head#xmlElement.attributes,[],DB),
  process_record(Rest,DB).
 
% Process attributes for a record...
process_record_attrs([],New_rec,DB) ->
  io:format("Writing document: ~p~n",[New_rec]),
  {{_,_},_}=couch:save_doc(DB, New_rec);
 
process_record_attrs([Head|Rest],New_rec,DB) ->
  Fieldname=Head#xmlAttribute.name,
  V=lists:flatten(Head#xmlAttribute.value),
  if
    Fieldname==filename ->
      {_,Value2,_}=regexp:gsub(V,"\\\\","/"),
      {_,Value,_}=regexp:gsub(Value2,"M:","");
    true ->
      Value=V
  end,
  process_record_attrs(Rest,New_rec++[{Fieldname,xmerl_ucs:to_utf8(Value)}],DB).

With that done, I have some data to play with, and I must say I was surprised (pleasantly) to see it all come back to me with the non-ASCII stuff intact when I tested it from the Javascript console.

  1. CiaranG’s avatar

    Update: The problem with CouchDB not finding xmerl_ucs turned out to be an issue with CouchDB itself. It was swiftly fixed – use revision 186 from SVN to avoid the problem.

    Reply

Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">