An Open Access Peon

20 June 2006

Open access and science

I've been asked to contribute to a 'focus' section on open access:

More precisely, do you think that open-access publication speeds up scientific dialog between researchers and, consequently, should be extended to the whole scientific literature as quickly as possible?
Do you really think that Open-Access Articles Have a Greater Research Impact?
If yes, what are, according to you the main consequences on communication between scientists?

Nico Pitrelli (Deputy editor of JCOM - Journal of Science Communication)


A number of potential and identified benefits have been associated with open access (providing free access to users of research literature). Of foremost interest is whether open access papers receive more citations (and downloads) than papers that are only available through a subscription or similar payment. Eysenbach's work is the latest study to confirm the general finding that open access does increase citation impact. (Obviously, once all research is open access, there can be no citation advantage attributable to the free/non-free comparison.)

Eysenbach has attempted to measure only the free/non-free variable by comparing articles in a journal with an open access option and has argued that other studies mix up a number of potentially conflated variables (number of authors, chronology issues etc.). Regardless of the finer points I argue that so far the evidence points towards open access articles receiving more citations and, given there is very little cost in providing open access (by author self-archiving the pre-print and/or post-print), the potential benefits outweigh the 'risks'. This isn't to say authors will suddenly see citations where they didn't before - these studies compare averages and also seem to indicate those papers that would already be high impact benefit most from open access.

There are other potential benefits that open access could provide but perhaps aren't of immediate interest to authors. One of these is that the duration of the research cycle is reduced: an author writes an article, is read by other authors that then cite that article in their own articles. In physics this is the result of rapid pre-printing - authors write an article and simultaneously post the pre-print to the physics arXiv e-print service as well as submit to a journal (several physics publishers even accept submissions direct from the arXiv). Rapid, free access to pre-prints in physics has dramatically improved the rate of communication in that subject. This isn't to say journals have been side-tracked in physics - far from it - as it appears (in studies performed by Michael Kurtz) that authors cite the pre-print, then switch to reading and citing the journal article once published.

Another potential benefit of open access is that it opens up the market for providing services to researchers. Putting all scholarly research on the web, free to access, will allow it to be indexed by a wide range of services. We have already seen Google move into this area with their Scholar service and there are similar moves afoot by the other big players (Microsoft and Yahoo). There is also considerable interest by funding agencies in using open access to help promote the research they fund (Wellcome Trust in the UK). There are also research tools built by the academic community - Citeseer (for computer science) and my own Citebase (for the arXiv). These tools gain their usefulness from the seamless way a user can move from the service to the full-text, without having to pay access fees. They are also potentially more powerful than existing bibliographic databases, because the full-text is freely accessible hence can be made fully searchable.

The consequences for scientists are twofold. Firstly as authors they will increasingly be expected or even required to provide free access to their research results, either by publishing in open access journals or by author self-archiving their articles in an institutional or subject-based repository. These mandates may come from their institution, funding agency or even government. As users of research material scientists will find it easier to locate and gain access to the full-text of research articles (I already find a Google web search to be the quickest way to locate a cited paper). They will also increasingly see articles being automatically and autonomously measured and evaluated by third-party services. This is both to serve the needs of research agencies (who need to evaluate the impact of the researchers they fund) but also to provide better search and alerting services.

Ultimately open access is about using the power of the web - to provide instant, near-free access to information - to maximise the benefit of the investment in research. To do anything else is a betrayal of the public investment in science.

14 June 2006

I see data but do you CDATA?

For a technology so widely used HTML and family can throw up a lot of roadblocks in the way of progress. I've been working with javascript ('AJAX') with a view to improving the responsiveness of Citebase's web interface. An important requirement is to maintain a 'single-page-interface' but to allow differing analyses and links to be loaded into that page through a menu-like mechanism: ordinarily I'm a fan of keeping-it-simple, which in this case would mean as many different pages as there are functions, but I suspect most users would get lost in the myriad of page reloads.

One javascript library (the J in AJAX) suggested to me by a colleague was 'DOJO' ( http://www.dojotoolkit.org/), which provides a lot of javascript-based widgets that turn your web page into something more akin to a windows GUI. But, and you'll notice this if you go to that page, it breaks the forward/backward browser history navigation. In which case we're somewhere closer to the entire-web-page in a Macromedia flash file than I'm comfortable with. (I dislike flash-based things because it prevents me using the neat tools that Firefox otherwise provides, like forward/backward navigation on my 5-button mouse, keyboard scrolling/searching/link selection etc.)

After several days trying different techniques I've ended up back where technology was several years ago: the iframe. Coupled with a bit of javascript to make interactive menus (so we don't have to go to the server to get more options) iframes keep the forward/backward navigation working and can be automatically stretched by javascript so as to fit the content i.e. no scrollbars floating around the middle of a page. Now the question is how I provide hooks in the page for non-javascript, non-iframe users - least I forget how web pages look to web spiders.

Where does CDATA come into all this? It turns out if you XHTML DOCTYPE a libxml2 DOM then try to add a 'script' element libxml2 decides the text must be a CDATA section, but doesn't render it in a fashion compatible with browser's javascript parsing. Reading around it seems this is because 'script' is defined as containing CDATA (or PCDATA) in the XHTML specs, which strikes me as an unfortunate collision of standards and how things actually end up working (or not, in this case). I can understand why this has come about (mixing of content types within a single document), but 'little' things like this can be really annoying e.g. using DOJO requires setting up a configuration variable, loading its js file, then loading up particular widgets. Putting that into external files would require another two files, with one line in each.

I've pretty much come to the conclusion AJAX is a really nice technology for some specific areas. AJAX is good for complex user interaction (e.g. form-filling) but bad for simple interaction (e.g. navigation, scrolling and reading). Firefox makes a good job of AJAX (and I suspect Opera is better still), but as long as 80% (?) of users insist on using Microsoft's crippleware we're stuck in the slow lane. Then again, perhaps Internet Explorer 7.0 will solve all AJAX's problems ...

11 June 2006

What the ping?

I've been trying to get multicast working on my home ADSL router (a Linux-based PC using a Speedtouch 'frog' USB modem). This is because the BBC and several UK ADSL providers are running multicast trials, including my own ISP Plus.net. While Plus.net seem to have multicast enabled (I see the IGMP messages on the ppp device), getting Linux to actually do something multicast routing wise seems to be one of those "if you're in the club" topics: the preserve of those who find command-line interfaces exciting.

Out-of-date documentation annoyances - if you've read the incomplete multicast routing howto it ends (before the "to be continued") with pinging the 'all-hosts' group (224.0.0.1) to see who'll respond, except sometime between when that was written circa 1999 and now, responding to broadcast packets was disabled by default. If you want your Linux-kernel machine to respond do:

echo 0 >/proc/sys/net/ipv4/icmp_echo_ignore_broadcasts

Now, assuming multicast is alive and well on the Linux router and my Windows XP PC the next step is to get some routing going between the LAN and ADSL. I'm currently battling xorp, which seems to get horribly confused by an interface on an IPv6 enabled machine but which doesn't have an IPv6 address. Like, say, like the ADSL PPP connection? (NB I have IPv6 courtesy of a tunnel to the BT broker service - not that there's much out there in IPv6 land. I'm not even going to go near the kernel patches required to make multicast routing work on IPv6.)

And the purpose of this fruitless exercise? In the UK the BBC and ITV terrestial stations are broadcasting the World Cup, both of which should be on the multicast ... laptop + multicast TV = portable World Cup!

09 June 2006

Chris Gutteridge once commented that Citebase's abstract ('citations' ) page is rather overloaded with information. With that in mind I've been reworking Citebase's display in a similar fashion to Celestial. Of course, with Citebase this is a much bigger job.

A problem for pretty much any web site is how to provide lots of information while avoiding 'overloading'. I've always thought Citebase had a reasonably clean interface, but that references, citations and co-citations we laid out sequentially on page resulted in a lot of scrolling to navigate around.

So I've now redesigned the citations page making use of some 'AJAX' to provide buttons that - when selected - load a different chunk of citation data into the page. Looking at the users of Citebase most use Internet Explorer, followed by Firefox etc. In Firefox this loading/reloading is pretty clean and works with forward/backward navigation. With some suitable 'fixes' to get hold of the HTTPRequest object in IE (http://developer.apple.com/internet/webcontent/xmlhttpreq.html) the same sub-pages can be dynamically loaded. However, when navigating back to a page using forward/backward IE shows the original page (and not the one modified by AJAX). So, do I change to a total-page reload (now much easier to do having broken down the page into distinct methods) or find some way to hack IE?

It's also become apparent IE has a really bad way of handling button padding. It uses a percentage of the text width to pad and ignores any stylesheeting. It turns out the way to fix this is to use "* button { overflow: visible; }", which is a css hack to effect only IE.

The drip-feed of emails concerning the mismatch of citation-count figures in Citebase continues. This is because the citation counts in some places are taken from the record and in others from the duplicates. I thought I had got this correct, but it seems not.

08 June 2006

Reinventing the Wheel and Other Exercises

So, it looks like the re-write of Celestial is nearing its conclusion. As it was Celestial was getting to be a pain to maintain, as well as being particular user un-friendly. Now the OAI sets membership has been separated into per-repository tables, set-based selection performance is greatly increased (at least until a single repository ends up getting stupidly big). There's now a proper interface for managing subscriptions - the ability to register for reports on Celestial's harvesting.

Internally I've adopted a more-modular approach to the web interface, with each section in its own (lightly wrapped) .pm file. This has made it surprisingly easy (and neat) to add additional outputs. All of the existing functionality is there: the OAI interface, ListFriends, repository listing and editing, but now is all through a common Apache interface.

So, stuff I've learnt in this exercise (NB this is under Redhat Enterprise 4):

  1. Apache::RequestIO is needed to enable $r->read(), otherwise CGI fails on HTTP POST (had similar problems with hashes and requiring APR::Table). The mod_perl use of modules is pretty infuriating.
  2. There are some interesting, subtle issues in XML::LibXML when trying to generate SAX events from a sub-tree. Basically, the thing that SAX events are generated from has to be the root node, which means if you want to include a DOM fragment in another structure that generates SAX events the subtree has to be extracted and set as the root node of another DOM. So far, so annoying, but another bug inside LibXML caused me a headache. It seems LibXML segfaults if you try to set a subtree as the documentElement (presumably because it frees() the old root element, clobbering the subtree you just set as the documentElement). If you're wondering why you would ever want to do this, well my OAI library is based on sticking DOM fragments into a perl OO structure that outputs by generating SAX events.
  3. The HTTP connector classes across browsers behave the same, but are called different things and set self to different things in the callback. Why does this matter? Well, the different names can be handled using javascript voodoo, but if you want to open up multiple HTTP connectors getting a handle to the particular connector that triggered a callback is impossible (with the exception of Opera, which apparently sets self to be the connector object). Instead, for my Celestial AJAX experimentation, I had to store each connector in an array, then interogate each one in turn to spot the one that was in a ready state. More regrettable global-fudges to get around stateless callbacks.

This is the first post to this blog - if you're reading it, blimey. Essentially this will be my effort to document problems I've grappled with and the general grind of working on the tools I've developed and now support (Citebase, Celestial and ROAR). This blog is aimed at myself - as I'm useless at keeping a lab-book and can't find what I want in them anyway - but if it helps you, that's great.