An Open Access Peon

16 November 2006

Google CSE added to the Registry of Open Access Repositories

A few weeks ago OpenDOAR announced the inclusion of a experimental Google CSE ("customised search engine"). Google released their CSE (or 'co-op') tool on the 26th October, and its a testiment to Google's skill at identifying new niches that CSE has already gained such interest.

Being OpenDOAR's nearest (good natured) competitor its incumbent on me to make sure we don't fall behind in the technical stakes - quality of content is a very subjective issue that I'll leave to others to judge. So, what's involved in setting up a CSE?

Google CSE is the normal Google web index, but constrained to a human-edited list of web sites. The CSE is identified by a unique (Google-generated) identifier that, when passed to Google co-op with a query, returns matching pages from only those sites. As the creator of a CSE you can either point users at the CSE's virtual home page (at Google) or embed it in your own site. The level of integration into your own site's look and feel depends on how long you're prepared to spend implementing javascript wrappers. There are, however, some constraints made by Google, in particular commercial users must allow advertising to be included, and the "Google Custom Search" line must be included either in the search box or near by (which the Google-supplied code snippets do for you).

What Google CSE means for us registries is we can now (theoretically) provide full-content searches of registered repositories with a minimal of effort. This is what OpenDOAR have done, and we've followed suit in our own search interface.

In addition to constraining the search to given sites Google CSE provides 'refinements' - editor-provided key terms that either filter the list of sites, or weight certain sites higher in the search results. Refinements allow the CSE creator to provide sub-customised searches to more finely control the search results, the typical example being to create a CSE for a topic area (e.g. tropical diseases) then to provide refinements for different types of users (e.g. medical practitioners).

To create the Google CSE and refinements for ROAR I created two exports: the TSV and Context files. The TSV file contains the URLs of the sites to be included, labels for each site and the site's weighting. A label is an identifier that can be used to refine search results (e.g. Australian repositories are labelled with 'country_au').

The Context file contains the basic search engine configuration (title, description etc.) and what effect refinements have on the search results. Refinements can change the weighting of sites, alter the query or filter out given sites. In the ROAR CSE I've set all the labels to 'filter' (i.e. only include sites that contain the given label).

So, that's the theory, but in practise Google CSE refinements don't appear to work like that (or work at all). If you try out the ROAR search, first of all you'll notice you get very few matches back and secondly you actually get more matches the more refinements you use. Hopefully this will be resolved in time, in the meanwhile prepare to be confused!