Difference between revisions of "Metadata and Filter Framework"

From PKP Wiki
Jump to: navigation, search
(Added a note about filter installation.)
(Added comparison to legacy OHS terminology - courtesy of MJ.)
 
(One intermediate revision by one user not shown)
Line 19: Line 19:
 
=== Meta-data Framework ===
 
=== Meta-data Framework ===
  
The meta-data framework has several important components. The most basic components of the framework are MetadataSchema and MetadataDescription.
+
The meta-data framework has several important components. The most basic components of the framework are MetadataSchema and MetadataDescription. Several MetadataDescriptions from distinct schemas can be joined in a single MetadataRecord as a single entity. Among other things MetadataRecords are useful for protocols like OAI that allow transmission of mixed records (e.g. administrative meta-data from one standard together with discovery meta-data from another standard) or for indexing.
 +
 
 +
For those who know our legacy Open Harvester Systems (OHS) terminology:
 +
* The MetadataSchema corresponds to the Schema in OHS,
 +
* MetadataDescription is conceptually similar to an OHS Record
 +
* and MetadataRecord corresponds to CompoundRecord in OHS.
  
 
The meta-data schema defines the meta-data standard that we want to implement. It defines the fields available within a given description, it defines the validation rules that apply to these fields and it defines the relation of descriptions among each other.
 
The meta-data schema defines the meta-data standard that we want to implement. It defines the fields available within a given description, it defines the validation rules that apply to these fields and it defines the relation of descriptions among each other.
Line 75: Line 80:
  
 
You add a statement to a meta-data description by specifying the property or "key" (e.g. "source" in the above example) and the value for the property (e.g. "Journal of Communication Science"). You won't be able to add such a statement to a meta-data description if it is not conforming to the standard, i.e. if it is not properly encoded or if it breaks the cardinality rules.
 
You add a statement to a meta-data description by specifying the property or "key" (e.g. "source" in the above example) and the value for the property (e.g. "Journal of Communication Science"). You won't be able to add such a statement to a meta-data description if it is not conforming to the standard, i.e. if it is not properly encoded or if it breaks the cardinality rules.
 
Several MetadataDescriptions from distinct schemas can be joined in a single MetadataRecord as a single entity. This is useful for standards that allow for nesting of other standards (like MODS) and also for protocols like OAI that allow transmission of mixed records or even for indexing.
 
 
  
 
=== The Relation of Filters and Meta-data ===
 
=== The Relation of Filters and Meta-data ===
Line 249: Line 251:
 
== Filter Installation ==
 
== Filter Installation ==
  
The best way to install filters is via the filterConfig.xml. You define a filters and filter groups in the configuration as outlined above and put them in some plug-in. The configuration file always must be found in filters/filterConfig.xml in the plug-in. This file will be discovered automatically when you install your plug-in with the upgrade.php tool or via the plug-in web installer. It will also be read when you install a PKP application for the first time.
+
The best way to install filters is via the filterConfig.xml. You define all filters and filter groups in a plug-in configuration file as outlined above. The configuration file always must be found in filters/filterConfig.xml within the plug-in. This file will be discovered automatically when you install your plug-in with the upgrade.php tool or via the plug-in web installer. It will also be read when you install a PKP application for the first time.
  
 
All filters and filter groups in the filterConfig.xml will then be installed automatically as soon as you execute <code>tools/upgrade.php upgrade</code>, install a plug-in via the web installer or re-install/upgrade your application as a whole.
 
All filters and filter groups in the filterConfig.xml will then be installed automatically as soon as you execute <code>tools/upgrade.php upgrade</code>, install a plug-in via the web installer or re-install/upgrade your application as a whole.

Latest revision as of 08:27, 15 December 2010

Introduction

This guide will give you an introduction to the concepts of the filter and meta-data frameworks. It is meant as a high-level conceptual introduction. There will be several code examples but I'll only highlight the important steps not all the details.

In other words: I assume that you're a PHP developer and are able to adapt high-level code examples to your own situation. To make this even easier I'll point you to real code examples (mostly from the citation assistant) wherever possible.


Basic notions

Filter Framework

The filter framework is a completely independent module that is not directly linked to the meta-data framework. The meta-data framework makes heavy use of the filter framework, though, so you have to understand it first before you can understand the meta-data framework itself. That's why we explain both frameworks in one place.

That said, please keep in mind, that the filter framework is much more generic than the meta-data framework which means that you can use it wherever you like, not only to transform meta-data.

The filter framework has one central class, the Filter class. In short a filter is nothing but an enhanced function that takes input data and generates output data from it. This is not at all different from any function, e.g. a normal PHP function. The Filter class introduces a lot of additional features, though, that make it much more capable than a normal PHP function. We'll talk about these in a bit.


Meta-data Framework

The meta-data framework has several important components. The most basic components of the framework are MetadataSchema and MetadataDescription. Several MetadataDescriptions from distinct schemas can be joined in a single MetadataRecord as a single entity. Among other things MetadataRecords are useful for protocols like OAI that allow transmission of mixed records (e.g. administrative meta-data from one standard together with discovery meta-data from another standard) or for indexing.

For those who know our legacy Open Harvester Systems (OHS) terminology:

  • The MetadataSchema corresponds to the Schema in OHS,
  • MetadataDescription is conceptually similar to an OHS Record
  • and MetadataRecord corresponds to CompoundRecord in OHS.

The meta-data schema defines the meta-data standard that we want to implement. It defines the fields available within a given description, it defines the validation rules that apply to these fields and it defines the relation of descriptions among each other.

I'll give an example to make this a bit clearer:

Say we want to model citations in NLM. NLM 3.0 defines the <citation-element> schema which consists of several sub-elements that may or may not have nested sub-elements themselves.

In our case we assume that we have a flat list of elements in <citation-element> like <article-title>, <source>, and so on. And we also assume that we have got <person-name> elements for authors and editors that need further nesting of a <name> element with, e.g. <given-names>, <surname> and <prefix>.

This means that we have to model a hierarchical structure that looks somehow like this:

citation-element

  • article-title
  • source
  • author
    • given-names
    • surname
    • prefix
    • ...
  • editor
    • given-names
    • surname
    • prefix
    • ...
  • ...

To implement such a structure without unnecessary duplication we'll define two schemas, one for the citation as a whole and one for names. Then we'll point the author and editor entries in the citation to a so-called "composite" element which is based on the name schema.

This will then schematically look like so:

citation schema:

  • article-title
  • source
  • author (is-a name)
  • editor (is-a name)
  • ...

name schema:

  • given-names
  • surname
  • prefix

In practice the schemas allow for much more flexibility. You can define any type of directed graph with them by pointing back and forth between schemas which means that you can implement every meta-data schema out there that has relevance. Conceptually this means that we are 100% compatible with two of the main meta-standards out there: the DCMI abstract model and the RDF triple specification.

The schema will also define the encoding of fields. We use the basic types defined in the DCMI abstract model from which we can build arbitrarily complex encoding definitions. This includes authorities/vocabularies as well as pattern-specific encodings like date, ISBN, etc. Validators are attached to each field to make sure that data that is being stored in descriptions will comply to what the standard defines.

The schema will also define the cardinality of fields. In the given example there can be only one article-title but there can be several authors in a record. We are able to define the following cardinalities:

  • one-to-many
  • one-to-one
  • zero-to-many
  • zero-to-one

Once you defined the schema you can instantiate it via the MetadataDescription. Whenever you create a meta-data description you'll say which standard the description must conform to. The MetadataDescription will then automatically enforce the standard so that you can only add standard-conforming statements to it.

You add a statement to a meta-data description by specifying the property or "key" (e.g. "source" in the above example) and the value for the property (e.g. "Journal of Communication Science"). You won't be able to add such a statement to a meta-data description if it is not conforming to the standard, i.e. if it is not properly encoded or if it breaks the cardinality rules.

The Relation of Filters and Meta-data

When you work with meta-data you'll make heavy use of filters in many situations. Filters are able to validate and accept very complex input/output types, like whole Word documents, XML documents but also metadata already codified in a MetadataDescription object or application objects down to simple primitive string or numeric values.

Filters help to abstract a lot of implementation details away from the user. You can define a transformation like "take a plain text citation and transform it to NLM 3.0 XML" in a single line of code without knowing a lot about the sometimes extremely complex algorithms necessary to do such magic. Most of the time filters will make use of complex tools, external databases or other, nested filters internally. All this complexity should be hidden from client code.

The most prominent examples where filters are used in the meta-data framework are the following:

  • import serialized meta-data into a meta-data description,
  • transform complex objects to make them available for the meta-data framework, e.g. by transforming an MS Word document into an Open Office document which can then be read as XML,
  • export a meta-data description into serialized meta-data,
  • parse plain text into structural data, e.g. a plain text author representation into a name schema,
  • crosswalk from one meta-data format to another, no matter whether still serialized or already encoded in a MetadataDescription,
  • enrich meta-data descriptions, e.g. by passing them through a cleansing process or by enriching them with data from external sources,
  • index meta-data. This is done by tokenizing and linking meta-data thereby preparing it for storage in an index engine,
  • extract and inject meta-data from/to application objects, e.g. extract/inject NLM meta-data from/into a PublishedArticle object.

Please keep in mind that the meta-data framework relies on the filter framework but the filter framework does not (and never should) rely anywhere on the meta-data framework.

The example

In this guide we'll exemplify the capabilities of the filter and meta-data frameworks based on one example that we carry through the different situations we'll explain.

We assume that we want to:

  1. import serialized MARCXML from an external system,
  2. store the record to the database,
  3. crosswalk the record to Dublin Core,
  4. index the record for search and finally
  5. export it in, say, RDF format.

Note: When I show (pseudo-)code within the example I'll leave out all paths, class packages, etc. to make the example more readable. We are only interested in concepts here not in actual code. The intent of the example is to give you the necessary notions to be able to understand the conceptual background of the meta-data framework.

So what would normally be

metadata::plugins.metadata.marc.schema.MarcSchema(*)

will rather become

metadata::MarcSchema(*)

There are lots of fully qualified samples out in the PKP code so before you actually implement something look at a few of them to make sure that you get the naming right. All the important core classes of the filter and meta-data frameworks also contain extensive class and inline comments which are closer to the actual technical implementation of the concepts we present here. I'll also point you to real-world examples where available and appropriate.

Filters

As we've said before, filters are nothing but highly sophisticated functions. This means: they take input and generate output and as with every function their advantage is that they hide their inner workings to the outside world.

What makes filters so much more capable than normal, say, PHP functions is that:

  • they can be stored to the database together with their parameterization (e.g. the OCLC API key for a citation database connector),
  • they can be configured per context (i.e. journal, press, conference),
  • they can define runtime requirements (e.g. PHP version, PHP modules, programs available on the server, ...) and enforce them automatically,
  • they can enforce and validate arbitrarily complex input/output types (e.g. Word documents, XML schemas/DTDs, meta-data schemas, etc.),
  • they can be nested to form composite filters (serialized, parallelized) for better re-use of generic filters, like an XSL filter,
  • they can be configured with XML (look out for filterConfig.xml files),
  • they can be discovered in the database based on input/output types which makes them very pluggable, think: "Give me all filters that any plug-in contributed that take a MARC description and transforms it to something." or "Give me all filters that transform NLM descriptions to plaintext, aka citation styles."

In the context of our example we'll have to define several filters:

  1. one that takes serialized MARCXML and spits out a MARC MetadataDescription. We'll call it "MarcXmlMarcFilter" throughout this text.
  2. one that takes a MARC MetadataDescription and maps it to PKP application objects like PublishedArticle.
  3. one that takes a MARC MetadataDescription and serializes it to XML. Let's call it "MarcMarcXmlFilter".
  4. one that takes serializes MARC to XML and crosswalks it to Dublin Core, using XSL internally, let's call it "MarcXmlDublinCoreXmlFilter".
  5. one that takes a MARC MetadataDescription and tokenizes it into an index format, e.g. Lucene document or PKP keyword format. Let's call it "MarcLuceneFilter"
  6. one that takes a MARC MetadataDescription and serializes it to RDF using Turtle. Let's call it "MarcRdfFilter"

Filters consist of two parts:

  • a filter class that does the actual work
  • a filter configuration that defines how the filter is persisted to the database including the available filter settings, the input/output types, etc

In the filter configuration you can also define partially pre-configured filters that define some but not all necessary settings for them to work. We call these filter templates. Such filter templates are useful when filters combine internal settings (e.g. a display name) with user settings (e.g. a personal citation database user name and password the user must enter).

Technical note: Filter templates are always stored in the context of the application (context "0") and they are marked as templates. Fully configured filter instances can live both, at the application level (context "0") and in an individual context (journal, press, conference).

The Filter Class

All filter classes must inherit from the Filter class. The Filter class contains extensive class and inline documentation. Please consult it before you implement a filter.

The Filter class provides several abstract template methods that you have to implement. The most prominent one is the process() method that does the actual work.

The filter framework will ensure that the input parameters passed into the process() method will always be validated already. So you can rely on getting the right data without having to validate it again.

You can do arbitrarily complex transformations in the process() method. There are no specific rules how these transformations should look like. You only have to make sure that the output type you generate will be valid, otherwise the filter will throw an error when you try to execute it.

When you extend the Filter class directly you'll write a filter that can not be persisted to the database. Such filters can be useful when you use them only internally without the user having to configure it or without having to share it via plug-ins. Often such simple filters encapsulate small pieces of work which you want to integrate in other filters.

Persistable Filters

If you want to create a filter that implements all of the above mentioned features you'll have to extend from PersistableFilter. This will create a filter that can be written to the database together with its configuration. Persistable filters are the most useful filters in the context of the meta-data framework because they can be contributed by plug-ins and shared across all applications with very weak coupling.

The FilterDAO which is responsible for filter persistance allows many different ways in which client code can "discover" filters that are in the database without having to know which exact filter it is looking for.

This means that you can for example add further citation output filters to the database via a plug-in and client code will immediately be able to use that filter simply because it will appear in its result list when looking for a certain class of filters.

Persistable filters are configured via XML which makes them highly re-usable. If you look out for filterConfig.xml files in the application you'll find many examples.

Type Definitions

Filter input/output types are defined via instances of the TypeDescription class. The type description class contains methods to check a given value against arbitrarily complex business rules. Every type description class defines its own descriptive language to specify the expected type in a more readable format.

The description of a type consists of two parts:

  1. a namespace
  2. the actual type description

Every type description class defines its own namespace.

Type descriptions can therefore be written like this:

namespace::type-description

e.g.

metadata::plugins.metadata.mods34.schema.Mods34Schema(*)

We currently have the following type descriptions:

  • primitive types (strings, numbers, etc.) in the primitive namespace
  • class types that test whether a given object inherits from a given class - uses the class namespace
  • meta-data descriptions - uses the metadata namespace
  • xml can be validated against schemas, DTD's and relax-ng - uses the xml namespace

Please look at the corresponding type description classes for an in-depth explanation how their specific type description syntax works. We'll use several examples here which should already give you a feeling for how type descriptors usually look like. They are designed to document complex types in a very readable format.

Filter Groups

Filters are grouped together by input/output type which facilitates type-based discovery of filters in the FilterDAO. All filters within one group have the exact same input/output type specification. There can be several filter groups with the same type combination, though, if they do semantically different things. This is especially important when they work with unstructured data that can only be described with relatively generic types like "primitive::string" or "xml::*" (the asterisk means "all xml formats").

Composite Filters

Sometimes it makes sense to combine simpler filters into more complex ones.

A real-world example would be that you you have several citation extraction filters and you want to run them in parallel and then combine their results into a single output.

To achieve this you would first use a GenericMultiplexerFilter filter that takes a single input, runs it through several nested filters and outputs an array of results. Then you would take the array resulting from the first step and de-multiplex it into a single result.

The two steps can be combined into a single filter by way of the GenericSequencerFilter which simply chains nested filters in a given order.

The so-called composite filter network would then look like this:

  • GenericSequencerFilter
    • GenericMultiplexerFilter
      • CitationExtractorFilter1
      • CitationExtractorFilter2
      • ...
    • DeMultiplexerFilter

You can execute such a filter in one single line of code without the client code having to know anything about the fact that the filter really consists of maybe ten other filters internally and does really complex things. We'll show an example for such a filter below. All composite filters extend from PersistableFilter which means that you can configure them and save them to the database as any other filter.

Composite filters implement a mechanism that exposes filter settings from the nested filters to the end user as if they were settings of the composite filter itself. You can link settings to each other so that if e.g. two nested filters take the same user/password combination the end user still has to enter this data only once.

Filter Configuration

The filter configuration draws all these concepts together into an XML file that allows you to define:

  • filter groups
  • filters
  • filter settings
  • composite filters

I'll give a simple example of the filter configuration based on our MARCXML use case that we introduced above. Several more complex examples will be shown later.

Say we want to configure a filter that understands serialized MARCXML and creates a MetadataDescription from it. This would be the MarcXmlMarcFilter from the example.

The filter definition would probably look somewhat like this (simplified!):

<filterGroup
  symbolic="marc-xml=>marc"
  inputType="xml::schema(marcxml.xsd)"
  outputType="metadata::MarcSchema(*)" />

<filter
  inGroup="marc-xml=>marc"
  class="MarcXmlMarcFilter" />

If this filter is defined in the MARC meta-data plug-in's "filter/filterConfig.xml" file, then it will automatically be discovered and saved to the filter registry in the database when installing the plug-in.

In this guide we'll show several examples for different filter configurations - including filters that require settings and filters that are installed as templates. For an exhaustive description of the filterConfig.xml please look at the corresponding DTD which contain inline comments.

Filter Installation

The best way to install filters is via the filterConfig.xml. You define all filters and filter groups in a plug-in configuration file as outlined above. The configuration file always must be found in filters/filterConfig.xml within the plug-in. This file will be discovered automatically when you install your plug-in with the upgrade.php tool or via the plug-in web installer. It will also be read when you install a PKP application for the first time.

All filters and filter groups in the filterConfig.xml will then be installed automatically as soon as you execute tools/upgrade.php upgrade, install a plug-in via the web installer or re-install/upgrade your application as a whole.

Metadata

Now that you have a basic understanding what filters do, we'll move on to show how the meta-data framework uses them to provide high-level functionality to client code.

We do this based on the example we defined above. The sub-sections are also roughly in the order in which you introduce a new meta-data standard to the code-base via a meta-data plug-in.

Define the Meta-data Schema

First you have to define the MARC MetadataSchema. If you look in the meta-data plug-ins' "schema" folders (in lib/pkp/plugins/metadata) you'll find lots of examples for such schemas.

Most of the time it makes sense to simplify the original schema and only use a sub-set of it. This also means that it often makes sense to unnest sub-elements into a flat key/value list if we only use it with 1:1 cardinality anyway. The MODS plug-in uses this technique quite extensively. Please have a look there.

The general guideline is: Don't implement fields you don't have data for. A good way to think about this is whether we have the field somewhere in our applications or plan to introduce it there in the forseeable future. Thanks to the decoupling that we achieve with the meta-data framework, introducing new fields also isn't very complicated and doesn't take a lot of time. You should try to get the composites right, however. It is not so easy to introduce additional composites at a later stage because oftentimes filter code needs to be aware of composites to some extent.

The names of meta-data keys can be everything that makes sense. The schemas that I've defined so far all have their main binding in XML so I have used some kind of pseudo-XPath grammar to define fields which makes serialization very easy because you can handle it mostly generically without much knowledge about the semantics of the schema which makes the schema much easier to extend. This technique also avoids namespace clashes because you stick very closely to the original standard's nomenclature.

I won't go into more detail with respect to schemas because the samples available in the code should be really easy to read and reproduce.


Map Meta-data to Application Objects

The next step usually is that you have to define the binding of PKP application objects to your standard. This means that you create a mapping between the meta-data schema and an object like "PublishedArticle" so that you can inject and extract meta-data into/from such an object.

This is only necessary if you want to exchange meta-data between the specific standard and the application objects directly. Sometimes (e.g. in OHS or when working with citations) this is not necessary because the meta-data can be worked with in it's raw state. In many cases you can also re-use an existing binding, e.g. by crosswalking your data to NLM which already has meta-data mappings to several PKP objects which you can then re-use. We do this when we work with OpenURL for example which we transform to NLM to interface with application objects.

We use a very special category of filters for the mapping of application objects to meta-data descriptions. These filters are called meta-data adapters.

Meta-data adapters all extend from the MetadataDataObjectAdapter class which in turn is a PersistableFilter (see above).

Technically speaking meta-data adapters transform "class::..." types into "metadata::..." types and back. They take an application object like PublishedArticle and transform it to, say, a MARC MetadataDescription or the other way round.

Meta-data adapters always have to implement two main methods which are defined in the MetadataDataObjectAdapter abstract class:

injectMetadata()

and

extractMetadata()

The first takes a meta-data description and maps its data to an application object which it then returns. The second takes an existing application object and extracts data from it into a standards-conforming meta-data description.

Meta-data adapters are closely integrated with our DataObject class. This means that if you define and configure a meta-data adapter correctly then the application object will automatically "know" about the existence of your adapter. As soon as you installed your meta-data plug-in you'll be able to say everywhere in the code:

$article =& $articleDao->getArticle();
$marcMetadata = $article->extractMetadata(new MarcSchema());

Which will give you a MetadataDescription that conforms to the MARC schema you defined in the previous step. Please note that you'll not have to change a single line of code within the Article class to achieve this! This also means that any application object can be bound to several standards at the same time without an additional performance hit.

You could as well have said:

$dcMetadata = $article->extractMetadata(new Dc11Schema());

if you had a meta-data adapter for DC meta-data (which we actually have).

Once you have got the meta-data object you can transform it to whatever you have filters for completely independent from your use case or the application context you're working with. This is what standards are for after all!

This is an extremely powerful cross-application interface that will do away with lots of code duplication that we currently have in our code. We currently re-implement schema-to-application-object mappings in different places within the same application and across applications. We have many implementations of the Dublin Core and NLM schemas throughout the applications for example. Not only is this duplicate code but also are these implementations often inconsistent with each other. Using the meta-data framework you can re-use a single consistent meta-data standard implementation throughout all applications. No need for porting OAI plug-ins between applications any more for example. We could even have a generic OAI plug-in now that automatically discovers all available meta-data plug-ins with XML binding and provides an OAI protocol wrapper for it.

Import Meta-data from External Sources

Now let's go to the next part of our example. Say we have got MARC data in serialized XML form and want to import it, e.g. into articles.

First we have to create a filter that takes XML data conforming to some XML schema. You use the "xml::..." type description for that. We've already shown an example for this when we explained the filter configuration, see above. The internals of that filter are quite obvious: You'll have to parse the XML and then add it's contents to a MetadataDescription object that has been initialized with the MarcSchema. If you chose your key names intelligently (e.g. using some kind of XPath nomenclature) then the mapping can be done mostly generically without having to define mappings for every single entry.

The logic would be somehow like this:

  1. Transform your xml into a sequence of fully qualified XPath entries + values found at these locations. You can use a simple GoF visitor pattern for this to traverse the XML tree which is what the PHP4 compatible XML parser does anyway.
  2. Check whether the XPath entry exists in your schema.
  3. If such an entry exists then add the value to the description.

You may have to implement this algorithm recursively if you implement nested composite meta-data descriptions as we already explained.

We assume that the filter you created is in a filter group with the symbolic name 'marc-xml=>marc'. This is exactly what you can see when you look at our configuration example above.

Now assume that you want to use that filter which has been installed via a plug-in somewhere in your code. To do so you do something like:

// Retrieve the filter from the database by group.
// (There are many other "discovery" methods available).
$marcxmlImportFilters = $filterDao->getObjectsByGroup('marc-xml=>marc');

// We assume that there is only one such filter. More than one filter wouldn't make
// a lot of sense in this case.
assert(count($marcxmlImportFilters) == 1));

// Ingest serialized MARC.
$marcDescription =& $marcxmlImportFilters[0]->execute($serializedMarc);

// Inject the meta-data into an article object. This is done via the
// meta-data adapter we explained in the previous section.
$article = new Article();
$article->injectMetadata($marcDescription);

// Persist the imported article.
$articleDao->insertArticle($article);


The variable $marcDescription will now contain an instance of a MetadataDescription object conforming to the MARC standard.

It is also possible to enrich application objects with meta-data that was not originally part of it's interface. Such additional meta-data can be added to the DataObject via the meta-data adapter. The meta-data adapter can declare additional meta-data fields (see it's class documentation) which will then be automatically persisted by the ArticleDAO (or any other DAO) and saved alongside other data in the settings table of the application object.


Store Metadata to the Database

Rather than storing your meta-data via application objects as in the previous example you can also use the MetadataDescriptionDAO to store meta-data records directly to the database.

The meta-data description DAO works like any other DAO. It persists, retrieves and updates MetadataDescription objects. Please have a look at it's interface. There's nothing special about this DAO.

Serializing Meta-data

Our sample requirements also contain a filter that allows us to serialize MARC data to MARCXML. Say we want to export an article to MARCXML to be sent to some library.

First we'll have to implement a filter that takes a MARC meta-data description object and transforms it to XML. We can do this very similarly to how we imported XML if we defined our meta-data property names well.

There are examples of XML serialization filters for Dublin Core, NLM 3.0 and NLM 2.3 in the code base which all use this generic approach.

In a nutshell they use a Smarty template which takes an (pseudo) XPath expression and transform it to an XML tag with the meta-data value as content. This is repeated for all statements in the meta-data description including nested composite statements which will be serialized by recursively calling the serialization filter on the sub-description thereby building an arbitrarily deeply nested XML hierarchy.

The filter would again have to be configured and installed via some filterConfig.xml as explained above. As soon as you've done this you can write code similarly to so:

// Extract meta-data from an article object. This is done via the
// meta-data adapter we explained already.
$marcDescription = $article->extractMetadata(new MarcSchema());

// Retrieve the MARCXML serialization filter.
$marcxmlExportFilters = $filterDao->getObjectsByGroup('marc=>marc-xml');

// We assume that there is only one such filter. More than one filter wouldn't make
// a lot of sense in this case.
assert(count($marcxmlExportFilters) == 1));

// Serialize the MARC meta-data to MARCXML.
$serializedMarc =& $marcxmlExportFilters[0]->execute($marcDescription);

The $serializedMarc variable now contains XML that you can write to disc or deliver via OAI, SOAP or whatever other web-based protocol.

Crosswalks

There are several types of crosswalks that can be implemented with the meta-data framework. Some crosswalks take one MetadataDescriptions object and transform it to another one that conforms to a different schema. This is how we do the crosswalk between NLM and OpenURL for example which can be found in the NLM meta-data plug-in. Such crosswalks can extend from the CrosswalkFilter class which provides some basic infrastructure.

Other crosswalks, the ones that interest us here, re-use existing open source code and make it usable within our code base.

We assume for our example that we've found an XSL stylesheet that takes serialized MARCXML and crosswalks it to Dublin Core XML. We have saved this file and all its dependencies somewhere on the file system or we have it accessible somewhere on the web.

With the meta-data framework we don't have to write a single line of code to configure and persist a filter that can do the crosswalk based on this XSL.

The only thing that we have to do is to configure the generic XSL transformation filter that the meta-data framework provides in a filterConfig.xml and install it via a plug-in.

Here is how the configuration would look like:

<filter
  inGroup="marc-xml=>dc-xml"
  class="lib.pkp.classes.xslt.XSLTransformationFilter"
  isTemplate="0">
  <setting type="string"><name>displayName</name><value>MARCXML to Dublin Core Crosswalk</value></setting>
  <setting type="const"><name>xslType</name><value>XSL_TRANSFORMER_DOCTYPE_FILE</value></setting>
  <setting type="string"><name>xsl</name><value>lib/pkp/plugins/metadata/marc/filter/marcxml-dc.xsl</value></setting>
</filter>

This configuration also shows how to configure settings. The filter does not require any user settings so we don't have to define it as a template. See the filter documentation above.

Put this in the filter/filterConfig.xml of a plug-in, install the plug-in and you can immediately write the following code without anything else to do:

// Retrieve all MARCXML crosswalk filters from the database.
$marcxmlCrosswalkFilters = $filterDao->getObjectsByTypeDescription('xml::schema(marcxml.xsd)', 'xml::%');

// Code that let's the user choose among one of these, e.g. via drop-down.
...

// We assume that the user has chosen the above "MARCXML to Dublin Core Crosswalk"
// filter which has been in the drop-down under it's display name. We further assume
// that the selected filter now is in $selectedMarcCrosswalk. 

// Do the actual crosswalk.
$crosswalkedXml =& $selectedMarcCrosswalk->execute($serializedMarc);

The $crosswalkXml variable now contains standards compliant Dublin Core meta-data! All this has been achieved in about 15 lines of code and can now be re-used across all applications assuming that the plug-in lives in lib/pkp/plugins which is now possible with the new shared plug-in infrastructure which is available in all applications now.

Please note that this time we used a different filter discovery mechanism than before. We now used type-based filter discovery which allows you to discover filters based on the input/output types they process even if these filters come from different filter groups. The type definition can contain SQL wildcards (i.e. '%' and '?') to match several distinct type descriptions. You can even pass in the input object so that only filters will be returned that can process that exact object.

Composite Filters and Templates

Let's go to the next filter from our sample list: The tokenizer filter for indexing/searching. We do not yet have an implementation of a meta-data tokenizer filter in our code base. As soon as we provide better Lucene support we'll get one.

If we work with a tool like Lucene that already provides several tokenizers that take a special input format then we only have to cross-walk our meta-data to that format and create a wrapper filter around Lucene that can tokenize it from there.

Imagine that we are doing just that. This means that we'll have to pull those two filters together into a single composite filter using the GenericSequencerFilter for it. I'll sketch the filter configuration required for that:

<filter
  inGroup="..."
  class="GenericSequencerFilter"
  is_template="1">
  <setting type="string"><name>displayName</name><value>MARC Lucene Tokenizer</value></setting>

  <filter
    inGroup="..."
    class="MarcLuceneCrosswalkFilter">
    <setting type="int"><name>seq</name><value>1</value></setting>
    <setting type="string">...</setting>
  </filter>

  <filter
    inGroup="..."
    class="LuceneTokenizerFilter">
    <setting type="int"><name>seq</name><value>2</value></setting>
    <setting type="string">...</setting>
  </filter>

  <setting type="object">
    <name>settingsMapping</name>
    <value><array>
      <element key="solrUrl"><array>
        <element>seq1_solrUrl</element>
        <element>seq2_solrUrl</element>
      </array></element>
      <element key="..."><array>
        ...
      </array></element>
    </array></value>
  </setting>
</filter>

This filter configuration shows several concepts that we already named but did not yet exemplify.

You can see that this is a nested filter. It contains a main filter configuration which relies on the GenericSequencerFilter which we already mentioned and then configurations of the nested filters. The filters have a "seq" setting which is used by the sequencer filter to execute them in the right order. The output of the filter with sequence number one will be the input of the filter with sequence number two and so on.

Then you can see that this time the filter is configured as a template. This is necessary because we assume that the end user has to configure the base URL of a Solr installation for the filter to work properly.

This means that a pre-configured instance of this filter is stored in the database in site context (context "0") with the "solrUrl" setting still empty. The user will have to go to some configuration page and configure the Solr URL prior to being able to use this filter. Once the user has configured the filter, a fully-configured non-template copy of it will be saved in the user's journal, press or conference context.

Finally the configuration tells the filter framework that both nested sub-filters share a common setting - the solrUrl setting. This is what the settingsMapping is for in the code. If you did not define such a mapping the filter framework would present the setting twice to the end user because it needs to be configured for both sub-filters.

A good example for that kind of filter is in step 3 of the journal set-up where we configure the filters for the citation assistant. You can see grids there which let you choose from all installed filter templates. When you choose one of the templates you are presented with the missing configuration parameters. Once you fill them in and save the filter, it will be stored as a non-template filter and appear in the filter grid. The FilterDAO provides the $isTemplate variable in several places to distinguish between template and non-template filters.

If you put the above configuration in a filterConfig.xml of a plug-in and install the plug-in then you can discover and use the filter in the exact same way as any non-composite filter we have shown in the previous examples. Your code does not have to be aware of the fact that it is dealing with a composite filter. Composite and non-composite filters will be returned in the same result set so that they can be handled in the exact same way.

Real-world examples for composite filters exist several already. Please have a look at the filterConfig.xml of the NLM 3.0 plug-in. You'll find a composite filter there.

You can also construct composite filters on the fly. A good real-world example for this is in the citation assistant. The citation assistant user interface allows users to choose from several citation parsers and citation database connectors with checkboxes. The selected filters then will be combined into a GenericMultiplexerFilter at runtime which again will be inserted into a GenericSequencerFilter for de-multiplexing and transformation into a Citation object. This code can be seen in the CitationDAO's _filterCitation() method and looks somehow like this (extract):

// Instantiate the citation multiplexer filter.
import('lib.pkp.classes.filter.GenericMultiplexerFilter');
$citationMultiplexer = new GenericMultiplexerFilter($filterGroup, $transformationDefinition['displayName']);

// Add sub-filters to the multiplexer.
foreach($filterList as $citationFilter) {
  if ($citationFilter->supports($muxInputData, $nullVar)) {
    $citationMultiplexer->addFilter($citationFilter);
    unset($citationFilter);
  }
}

// Instantiate the citation de-multiplexer filter.
import('lib.pkp.plugins.metadata.nlm30.filter.Nlm30CitationDemultiplexerFilter');
$citationDemultiplexer = new Nlm30CitationDemultiplexerFilter();

// Combine multiplexer and de-multiplexer to form the
// final citation filter network.
import('lib.pkp.classes.filter.GenericSequencerFilter');
$citationFilterNet = new GenericSequencerFilter(
  PersistableFilter::tempGroup(
    $filterGroup->getInputType(),
    'class::lib.pkp.classes.citation.Citation'),
    'Citation Filter Network');
$citationFilterNet->addFilter($citationMultiplexer);
$citationFilterNet->addFilter($citationDemultiplexer);

// Send the input through the citation filter network.
$filteredCitation =& $citationFilterNet->execute($muxInputData);

You can see that nothing in here points to any specific filter implementation which is the reason why it is so easy to extend the citation assistant with further citation filters. You just have to write the connector as a filter and configure it in a plug-in with the correct filter group. Then it will be discovered automatically by the citation assistant.

Define Runtime Requirements

The final example we look at will be taking a MARC MetadataDescription and serializes it to RDF using Turtle. We won't go into much detail for this example as the general steps to implement such a requirement should be quite obvious by now. The only thing I want to demonstrate with this example is how filters can define runtime requirements so that they won't be executed without all their requirements in place.

Runtime requirements are defined as normal settings in the filterConfig.xml and will be translated internally into a RuntimeRequirements object. You also can instantiate the RuntimeRequirements object yourself and pass it into a filter programmatically.

Currently the following runtime requirements can be configured for a filter:

  • min required PHP version (setting name: phpVersionMin, the format is x.x.x, e.g. 5.1.0)
  • max allowed PHP version (setting name: phpVersionMax)
  • required PHP extensions (setting name: phpExtensions, this is an array - see the array example in the above composite filter configuration)
  • required external programs (setting name: externalPrograms, again an array)

Additional parameters can easily be added if required.

The filter requirements are enforced in two places:

  1. The FilterDAO will not return any filters that don't work in the current runtime environment even if they are in the database.
  2. The Filter will check the runtime requirements before passing control to the process() method and fail if not all requirements are met.

Let's assume that in our example we'll use an external tool to transform MARCXML data into RDF data. Let's say the program is called marc2rdf. In that case you'd delegate to this program in your filter's process() method and configure your filter with the following additional setting:

...
<setting type="object">
  <name>externalPrograms</name>
  <value><array>
    <element key="0">/usr/bin/marc2rdf</element>
  </array></value>
</setting>
...

That's all that is required. Now if the given external program is not present then the filter will not be returned from the FilterDAO and if you instantiate it manually it will throw a runtime requirement error when you try to call its execute() method.

Additional Information

It is helpful to understand the DCMI abstract model well before you try to work with the meta-data framework. You should also look at the class documentation of all the classes I mentioned in my text. Usually they contain relatively detailed documentation in the file header.

Finally: Please don't hesitate to contact me directly whenever you have a question.