An Open Access Peon: CMIS vs Google Documents API vs SWORD

The Atom protocol is a very simple mechanism for publishing news feeds - that is, date-ordered small bits of information. An Atom feed is a collection of Atom entries. Each entry contains some basic metadata (title, id) and may have links to other resources. Links of particular interest are 'edit' and 'edit-media' which, respectively, refer to the entry's metadata and media file.

    <?xml version="1.0"?>
    <entry xmlns="http://www.w3.org/2005/Atom">
      <title>The Beach</title>
      <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
      <updated>2005-10-07T17:17:08Z</updated>
      <author><name>Daffy</name></author>
      <summary type="text" />
      <content type="image/png"
         src="http://media.example.org/the_beach.png"/>
      <link rel="edit-media"
         href="http://media.example.org/edit/the_beach.png" />
      <link rel="edit"
         href="http://example.org/media/edit/the_beach.atom" />
    </entry>

The Atom Publishing protocol (or AtomPub) provides a protocol to add new entries (i.e. to publish to feeds). AtomPub uses the HTTP POST, PUT and DELETE methods to, respectively, create, update or delete entries.

To create an entry the client POSTs an Atom entry to the feed's URL. To update an entry the client PUTs an Atom entry to the entry's URL (replacing anything already there). And lastly, DELETEing an Atom entry URL destroys that entry. The protocol itself is quite readable so I suggest going there if you're lost!

AtomPub is sufficient if you just want to post small entries in XML but often the client wants to e.g. publish a photo, which the new Atom entry will then refer to. There are several approaches to this but the simplest is to use the Atom Multipart Media Resource Creation mechanism, which bundles together the Atom entry and the media file into a single POST.

Atom/AtomPub provides us with a fairly simple tool to publish items onto a Web site. As Institutional Repository (IR) developers we, unfortunately, require a more complex model than just a feed of entries containing one file each. We have more complex metadata and multiple files making up an object. An editorial workflow means items uploaded by users must first be checked by editors before they can be published. There are various other aspects to consider that I won't go into here.

So we like the simplicity of Atom/AtomPub but it doesn't fulfil all of our requirements. Fortunately it is easy to extend AtomPub by injecting additional links and metadata into entries. These links can connect to other URLs that allow complex manipulations to be made on the underlying data structure (hence also to create more complex data structures). OASIS CMIS, SWORD and the Google Documents API are all extensions of AtomPub better known as "AtomPub Profiles". (I'm sure there are others but these are the obvious candidates for IR use.)

OASIS Content Management Interoperability Services (CMIS) is over 200 pages long but, in part, describes an AtomPub profile. I concur with the sentiment here that being asked to implement CMIS won't make your developers happy. The model underpinning CMIS has a hierarchical folder structure. By supplying a special tag in a POST to a feed an Atom entry is created that points to another Atom feed (or 'folder'). In this way Atom entries are effectively typed to be either a 'document' or a 'folder'. Atom entries can be moved to other folders by POSTing them that folder's feed. There is lots more that CMIS adds in, to the extent that I forget what's at the beginning before I get to the end!

root feed
 |
 |-- document entry
 |
 |-- document entry
 |
 |-- folder entry
      |
      |-- folder feed
           |
           |-- document entry
           |
           |-- folder entry
...

The Google Documents API is in a different sphere to CMIS and SWORD. It is specific to Google's API so would need tweaking to be used elsewhere. Similarly to CMIS special syntax passed during a POST to a feed can create a folder-type Atom entry. This entry then points to a new feed which can in turn contain a mix of folder-entries or normal entries. Google support a number of parameters to modify the default behaviour of a URL, for instance downloading a document in a different format.

SWORD is an AtomPub profile developed to support deposit in IRs. SWORD v1 adds several HTTP headers to support more complex publishing behaviour. Often in the repository world one user will be depositing on behalf of another (doctoral student depositing her supervisor's old papers ...). To support this SWORD added the X-On-Behalf-Of header, which supplies the username of the user to deposit as - assuming the current user has permission to do that. Another part of SWORD v1 was to support more complex objects (i.e. multiple files) by defining 'packages'. Packages are collections of files and metadata and are archived together then published, with the server unpacking them to create the complex object. SWORD v2 (at time of writing) will look similar to the previous version but will define means for clients to interact with the packages after upload. OAI-ORE is used to describe the unpacked complex object while content-negotiation will be used to allow clients to retrieve the complex object in agreed format.

repository feed
 |
 |-- document entry
 |
 |-- document entry
 |    |
 |    |-- OAI-ORE/RDF
 |
 |-- document entry
...

All three AtomPub Profiles probably work with a client speaking just AtomPub. The question that is left is which extension of AtomPub is best adopted to achieve our goals. I don't think any of these protocols are entirely satisfactory: SWORD feels like it is working around AtomPub rather than building on it (publishing .zip files?). It isn't clear what IPR Google's profile has nor whether they will take it in a different (incompatible) direction to what we need. CMIS, given it's industrial backing, will likely be essential in the corporate environment but is daunting in its complexity (and that normally means difficult to get right in practise).

Regrettably my influence over any of these profiles is small - as developers we tend to be pushed more by political requirements ("your must support X") than technical merit. I just hope that, given the narrow range these profiles exist in, that they adopt the best bits of each other! (NB I would be interested to hear of any other potential AtomPub profiles)

2 Comments:

I really like the little ecosystem that has grown up around Atom and AtomPub. Erik Wildge's Atom Landscape Overview does a nice job of surveying the various things that build organically on top of Atom. I am personally hopeful that the SWORD effort could result in a lightweight extension mechanism to Atom to represent things that are needed for the repository community. I think publishing a small informational RFC that documented the extension would be a useful goal to have. I also think the culture of "working code wins" in the IETF would be amenable to the way JISC has tried to get SWORD out there in repository systems, instead of simply shouting from a soapbox somewhere.

Also, just wanted to say -- I got a good chuckle from the name of your blog. Thanks!

By Anonymous, at 8:10 pm
whoops, s/Erik Wildge's/Erik Wilde/

By Anonymous, at 8:10 pm


<< Home

An Open Access Peon

19 January 2011

CMIS vs Google Documents API vs SWORD

2 Comments:

About

About Me

Previous