Open Harvester Systems support questions and answers, bug reports, and development issues.

Moderators: jmacgreg, michael, John

Forum rules
The Public Knowledge Project Support Forum is moving to

This forum will be maintained permanently as an archived historical resource, but all new questions should be added to the new forum. Questions will no longer be monitored on this old forum after March 30, 2015.
Posts: 3
Joined: Mon Jun 25, 2007 7:32 pm

NLNZ modifications

Postby paynter » Thu Jul 12, 2007 5:31 pm

Hi all:

The National Library of New Zealand has been building a website based on PKP Harvester, with a lot of help from Kris Thornley at the OARiNZ project (you may have noticed his various contributions in these forums).

We are happy to contribute some or all for these changes back to the PKP.

Our website is designed to highlight high-quality metadata in New Zealand institutional repositories and to provide lots of ways for research users to discover the research documents. Rather than competing with Google Scholar on search, we have provided other features like sophisticated browse, RSS feeds, OAI-PMH re-export. We have also made a big effort to feature the contributing institutions.

We have tried to isolate our modifications so that they are in plugins that can be added or removed. However, a lot of our work has been user interface customisations--you wont recognise the website as a PKP Harvester system when you see it. Unfortunately, this means that the changes have bled through the PKPH code more than we'd like (as we have to change page classes, templates, locale files, etc). Another issue is that we are using an Oracle database, and we have had to tweak SQL in a number of places (for example, Oracle doesn't let you use LIMIT clauses but PKP Harvester uses them extensively).

However, there are some discrete sets of functionality that you may find useful. Some of the main modifications are:

1. Completely new browse system. For example, you can browse by author, with an A-Z list presented, then a list of names when the use selects a letter. There are similar Title browse and Year browse features. There is also a "dissertation list" browser. All of these are linked to RSS feeds (using Kris Thornely's SRU plugin) so you could, for example, subscribe to an RSS feed of all papers by John Smith, or all dissertations. We also have a hierarchical subject browse system.

2. You may be wondering how well this browse system works, given how patchy and inconsistent the metadata from various sources can be. Well, we have also updated the OAI-PMH Harvester to allow for the validation and transformation of metadata has it is harvested (Kris and I and Conal Tuohy came up with the following idea, and it is a really cool piece of work, if I say so myself.)

Kris has implemented a dc_interceptor schema plugin that we use in place of the dc plugin for OAI-PMH harvests. The dc interceptor plugin works by applying XSL transformations to the harvested Dublin Core metadata to produce two new sets of metadata for each record: Administrative metadata, that validates the metadata by recording any errors in the dc; and Internal metadata, that transforms the harvested DC metadata into consistent forms. We then store all three sets of metadata against the record (this works possible because each "entry" can have its own schema, independent of the record's schema. SO for ech record, we now have three sets of metadata. It is remarkable how little we had to tweak the PKP Harvester code to make it work (i.e. PKP Harvester really is very extensible) though I think we had to make a few more UI changes thanm we'd like in order to view records and generate summary text (Kris has a vanilla version of that code though).

3. Validation. The Administrative metadata is used for validation: it checks for metadata errors (or warnings). These errors can be used as the basis for metadata quality reports (for each institution, and for the repository as a whole). It can also be exported back to the source repositories, so a repository administrator can subscribe to an RSS feed (or OAI-PMH feed) of the errors in his repository metadata, which she can then fix. The validation XSL files are generated from a schematron schema.

4. Transformation. The Internal metadata is used to provide consistent browsing, search filtering etc across all the archives. For example, we transform all DC Creator metadata into internal Author metadata that is in "Lastname, Firstnames" format (for consistent browsing); we convert several different Subject metadata schemas into a standard internal Subject metadata schema based on a crosswalk that is stored in an XML file; and we convert all sorts of dc:type metadata into the Eprints Type Vocabulary schema (again based on an XML crosswalk file).

5. Click-though counter. We added a table to count the number of click-throughs per record, and generate monthly reports.

6. Institution pages. Each institution (i.e. PKP Harvester Archive) has a page that shows the institution logo, the description, a list of the most popular records, and also the most recent additions (in a sidebar). There are also links to institution-specific browse by author and browse recent records pages (with RSS feeds, etc).

7. OAI-PMH export, as per Kris Thornley's plugin. We also use Kris's SRU/SRW plugins, both to provide search services, and also as th ebasis of RSS feeds.

8. The "Placebo" harvester is a new harvester plugin that does nothing. It simply creates an archive with no records. I find it useful if I want to "switch off" an Archive but not delete i.

9. The "Meta Harvester" Plugin is more useful harvester plugin, and is used to create virtual "Institutions" within our system that don't correspond directly to OAI-PMH data providers. This is a harvester plugin that can be used to set up a "virtual" Archive that "steals" records from other Archives. For example, suppose your university has two different OAI-PMH feeds, but you want to present these to the user as though they came from a single (merged) source. You can set up an Archive with an OAI harvester for each of the two feeds, then set up an Archive with a Meta Harvester that merges the data from the other two institutions. The second use is for splitting feeds. We have a case where we have one OAI-PMH source ( that has to be split up among 6 different institutions, and the only way to tell which record belongs to which institution is by examining the OAI-PMH record identifier (no useful sets are available). We have done this by creating one Archive with an OAI Harvester, and six more with Meta Harvesters, each of which "steals" the records appropriate to one particular institution.

Anyway, let me know if any of this is likely to be useful. I should add, by way of proviso, that we have not tested this with very large archives (though the new harvester is remarkably quick).

We should have a public demo available in a few weeks (we're starting some closed demoing at the moment).


Posts: 10015
Joined: Wed Aug 10, 2005 12:56 pm

Postby asmecher » Mon Jul 16, 2007 3:50 pm

Hi Gordon,

Thanks for contributing! This list definitely looks interesting and we'd love to be able to merge some of the code into the Harvester distribution for a future release. Let me know if you'd like a hand with testing, and likewise when a demo is available.

Alec Smecher
Public Knowledge Project Team

Return to “Open Harvester Systems Support and Development”

Who is online

Users browsing this forum: No registered users and 0 guests