OJSdeSearchConcept

From PKP Wiki
Revision as of 11:45, 3 October 2012 by Jerico.dev (Talk | contribs)

Jump to: navigation, search

Contents

Overview

The following two sub-sections will provide an overview over the project background and the structure of this document. Read it as a guideline for this document or as a help for quick access to sub-sections if you prefer to skim over specific document parts first.

Project Background

The Center for Digital Systems (CeDiS) of Free University Berlin (FUB) currently implements “OJS.de” – a project funded by Deutsche Forschungsgemeinschaft (DFG), the German government's science funding organization. The project has been set up to adapt OJS even better to the needs of OJS users in Germany and other German-speaking countries while – wherever possible – creating value for the larger OJS community, too. One of the project tasks is to implement an “optimized search function”. Three main goals should be achieved:

  1. The current OJS search function experiences problems in dealing with multilingual content. The optimized search function should be able to deal with documents in all supported OJS languages.
  2. OJS search should – where possible – benefit from additional search features provided by Lucene/solr.
  3. The current search function does not scale well. The optimized search function should work even for large OJS providers who want to provide a central search server including journals from several separate OJS installations.

Currently OJS provides custom search across article meta-data and full texts. A simple algorithm is used to process text for search: Text is first being split up according to white space (“tokenization”), then common words (“stopwords”) are being removed from the token stream. All remaining tokens will be stored as a relational database table. This yields satisfactory search results for languages like German or English. It does not work for languages that use logographic notational systems like Japanese or Chinese. Such languages should be made searchable from OJS. In addition, the new search platform should support additional advanced search features, e.g. improved ranking, faceted search or searching across several OJS installations. From an end user, administrator and OJS provider perspective the new Lucene/solr solution must implement at least all currently available OJS search functionality. Several deployment scenarios should be supported (ranging from single journal installations to large OJS provider deployments). Depending on the deployment scenario, design principles like simplicity and flexibility have to be reconciled. Both, the deployment scenarios and corresponding design principles will be detailed below. The new search function should be implemented based on the enterprise search server “solr” and the underlying “Lucene” search framework. As solr and Lucene provide complex and flexible configuration support, a detailed specification is required that defines the exact search functionality and user interface, the architecture of the search platform and the integration of OJS with solr/Lucene. Potentially conflicting project goals have to be reconciled separately for each deployment scenario. This document specifies the exact scope, requirements, user interface, design principles and technical architecture of the new search function.

Structure of this Document

This document is divided into several conceptually distinct parts: First we enumerate all project requirements that guide the scope of this specification as well as our architectural choices and recommendations. Based on these requirements we specify several distinct aspects of the system:

  • the end-user visible OJS interface (administration, configuration and search)
  • the indexing back-end (submission and preprocessing of meta-data and full texts, extraction, transformation, analysis and storage of term data; index configuration and maintenance)
  • the search query back-end (submission, transformation, parsing and analysis of the query; ranking, paging and highlighting of the result list; advanced browsing features like faceting, alternative spelling suggestions and “more-like-this” search proposals)
  • the network protocol for communications between the OJS installation and the search engine for all indexing, querying and configuration needs
  • the recommended deployment options for different usage scenarios (from single journal installations to large OJS provider deployments)

For each of these system parts and for all possible usage scenarios we'll identify necessary changes to OJS and appropriate configuration of the solr/Lucene back-end. Once the design and implementation options identified and recommendations made for all aspects of the system we'll be able to compile a complete feature implementation/recommendation matrix. This will help project owners to select and prioritize specific, well-defined features for implementation and serves as an initial guideline for subsequent implementation phases of the project.

Project Requirements

The requirements for this project can be divided into the following areas:

  • Deployment scenarios: the proposed search architecture should scale from small single-journal deployments up to large OJS providers that potentially host hundreds of OJS installations.
  • Design principles: depending on the deployment scenario, principles like simplicity and flexibility have to be reconciled.
  • Compatibility: a specific list of search features have to be implemented to guarantee compatibility with the existing OJS search function.
  • Advanced design problems: Solr provides a large amount of configuration options. Alternative options should be identified and implementation choices be defined for each feature or deployment scenario.

The following sub-sections will detail these requirements to provide a guideline for the implementation recommendations to be made in this document.

Deployment Scenarios

OJS is being used in very different organizational contexts that range from individual scientists publishing their own OJS journal on shared hosting servers all the way up to specialized OJS providers hosting hundreds of distinct OJS installations for a large number of 3rd-party publishers.

The new search function should work in all these scenarios and must therefore support:

  • S1: search across articles of a single journal
  • S2: search across multiple journals of a single OJS installation
  • S3: search across various OJS installations within the provider's network
  • S4: search across various applications (including one or more OJS installations) within the provider's network

We will reference these scenarios with their abbreviations (S1-S4) below, where necessary.

Design Principles

Depending on the deployment scenario several conflicting design principles have to be reconciled:

  • Simplicity and transparency: One of the crucial strengths of OJS is its simplicity and low entry level of installation, configuration and use. This advantage should be maintained as far as possible.
  • Functional robustness and flexibility: The currently existing search function should remain available as an option. It should be possible to configure whether to use the current or the newly developed solr search for an OJS installation.
  • Compatibility with PKP OJS development: The solr integration should be compatible with the PKP development code line.

The relative importance of the design principles will depend on the deployment scenario and its specific requirements. As the design principles may conflict with each other, it is possible that a compromise has to be made for each deployment scenario.

Compatibility with the Current OJS Search

The current OJS search function implements the following features that must be supported by the newly developed system, too:

Basic indexing/search features:

  • search across all journals of an OJS installation
  • search for author, title, abstract, keywords and full text.
  • search for publication type, coverage, supplementary files and publication date (advanced search only)
  • Full text documents can be indexed if they are in HTML, PDF, PS or Microsoft Word format.

Search syntax:

  • no distinction between lower and upper case
  • ignore common words of low relevance (“stopwords”)
  • list all documents by default that contain all search terms (implicit “AND”)
  • select documents that contain one of the given terms (“OR”)
  • only select documents that do not contain a given term (“NOT”)
  • implement advanced search syntax, e.g. archive ((journal OR proceeding) NOT dissertation)
  • search for exact word phrases
  • support wildcard searches (“*”)

Miscellaneous features:

  • Search results contain the corresponding articles and can be paged.
  • The index can be re-built from within OJS.

Advanced Design Problems

The following additional design problems have to be explicitly addressed in this document.

Architecture:

  • How to implement multi-client capabilities for the configuration of solr, the communication interfaces and data?
  • Which users (role) will install and configure solr – in other words: will the configuration be done on journal, publisher or overall system level?
  • Scalability: When should features like “distributed search” or “replication” be implemented?
  • How should solr be deployed?
  • Establish complete configuration recommendations.
  • Which platforms will be supported (e.g. OS, servlet container)?
  • How will the search server be integrated with the OJS PHP environment?
  • Can solr be integrated as a plug-in? What are (dis-)advantages of such a deployment option?
  • Which disadvantages or problems may be expected when integrating OJS search with an organization-wide search (S4)?

Indexation:

  • Which field types and schema should be defined?
  • Which tokenizers and filters should be used?
  • How many indexes/cores are required?
  • To what architectural level will these indexes correspond (e.g. per journal, per installation)?
  • When and how will documents be indexed (addition, update, optimization and deletion)?
  • How can the index be re-built?
  • How will data be sent to solr?
  • Will documents be parsed with a native solr extension or is an external program required?
  • Which further file formats will be supported?
  • Will meta-data be extracted from documents?
  • Will all manifestations of an article be supported (e.g. when both, HTML and PDF versions, are available)?

Search:

  • Which search syntax will be supported? (Ideally the search syntax should be identical to the currently existing OJS search syntax.)
  • How can auto-suggestions be implemented?
  • How could the ranking be implemented (spanning several indexes where necessary)?
  • Which after-search options (e.g. sorting) will be available?
  • How could faceting be realized?

High-Level Feature Summary

The following table shows an overview of important requirements and deployment scenarios. It is meant as a high-level summary, not as an exhaustive list.

Requirements x Deployment Scenario
Single Journal (S1) Multi-Journal (S2) Multi-Installation (S3) Institution-wide Index (S4)
Document types article meta-data, galleys and supp. files article meta-data galleys, supp. files + arbitrary additional documents
Document source single journal several journals of a single installation journals accross (groups of) installations several installations + arbitrary external applications
Search fields (simple search) author*, title, abstract, galley content, keyword search (discipline, subject, type**, coverage***)
Search fields (advanced) author*, title, discipline, subject, type**, coverage***, galley content, supp. file content, publication date
Supported document languages English, German, Spanish, Chinese, Japanese + a reasonable fallback for all other languages and foreign language citations/mix-ins
Document formats Plaintext, HTML, PDF, PS, Microsoft Word
Basic search syntax AND (default), OR, NOT, nesting of queries, wildcards, phrase search
Result presentation Paged results are returned on article level ranked by term frequency (“TF-IDF”).
Optional (advanced) features auto-suggestion, faceting, alternative ranking criteria, highlighting, search proposals (e.g. alternative spelling or “more-like-this”)

* contains first name, middle name, last name, affiliation, biography

** usually contains a research approach or method for the article

*** consists of the article's geo coverage, chronological coverage and "sample" coverage

Test Data and Sample Queries

Requirements are covered and exemplified by a number of sample queries. These are an integral part of the requirements specification of this project. All sample queries are executed against a mixed-language, mixed-discipline corpus of OJS test journals and articles. Both, sample queries and sample data, are being provided by FUB and their partners.

Test data were taken from live OJS journals. Wherever possible complete copies of journals were made. When full copies were not available partial content (select journal issues and/or articles) were imported into an OJS test database for indexing and querying.

The following process was applied to collect sample queries:

  • An online form simulates the OJS search form (simplified and advanced).
  • Test users (editors and readers) of various OJS journals were asked to provide realistic test queries.
  • Submitted test queries were executed against the test corpus.
  • Result sets were returned to test users for review.
  • Search results (precision, callback, ranking) were tuned according to user feedback.

OJS User Interface

The following sub-sections propose changes to the OJS user interface for potential additional search features. We will only describe new or changed features. A full description of the existing search interface is not in scope for this document.

Core Code Changes vs. Integration as a Plug-In

The question whether solr/Lucene should be integrated as a plug-in or whether changes should be made directly to OJS core code is not only one of the user interface. But it certainly makes a difference whether solr related configuration options are separated out into a plug-in or whether they appear on core set-up pages like administrative settings or journal set-up. The main question is whether solr/Lucene will be used by a majority of users or not. If a minority of users opts to use solr/Lucene then it would not be appropriate to “pollute” the core configuration pages with those options. If, however, many users are interested in using solr/Lucene, then hiding the feature away into a plug-in would make it unnecessarily difficult for those users to find the options they look for. It is difficult to decide this without asking a representative number of users. From prior experience with similar changes, though, it seems reasonable to assume that the majority of users will continue to use the existing search interface. This is above all because solr/Lucene requires Java to be present on the hosting environment (see installation requirements below) which is often not the case. It also requires a certain amount of additional installation and configuration. And it requires a servlet container like jetty to be up-and-running all the time. While we'll reduce the installation and maintenance overhead to a minimum, the solr/Lucene search back-end will still be “heavier” and more difficult to deploy than the default search implementation. There are important user groups, though, who definitely require an improved solr search back-end. These are OJS service providers and publishers of journals that contain content in non-Western languages which are not supported by the current search back-end. While the former are advanced users who will be well acquainted with OJS plug-ins, the latter may not. When removing solr/Lucene to a plug-in then this should be well advertised to the second user group, e.g. through OJS forums which rank well in Google. Other, more technical arguments are in favor of factoring the solr/Lucene integration into a plug-in: It will be easier for the PKP core development team to review and maintain the new code if it is concentrated in a single place. And it will be easier to port the code to other PKP applications like OCS or OMP because the interface points of a plug-in with the core code are relatively easy to identify. Providing solr integration as a plug-in will also make it easier to include block plug-ins as needed for use cases like faceting. A disadvantage of implementing the integration as a (generic) plug-in is that we'll probably have to introduce an important number of additional plug-in hooks into OJS core code with little future potential for re-use. This can however be mitigated by factoring the search plug-in into its own plug-in category later, if PKP wishes so. With these arguments in mind and after consulting the PKP core development team, we recommend integrating solr/Lucene and jetty as a generic OJS plug-in.

Search Interface

In accordance with the design principles defined above, the OJS search interface should be changed as little as possible. This means that existing features are to be maintained unchanged, no matter what search back-end will be used. The following sections only describe changes required by additional search features that are not part of the current search solution and may be optionally provided by the new solr search function. As before, all search features are open to the public. By using forced return field configurations for our search interface (see “Querying” below) we'll make sure that full texts of subscription-based journals cannot leak. Subscription-based journals may not want to enable highlighting, though (see “Result List” below).

Search Syntax

The search syntax of the solr-driven search will be a super-set of the syntax currently provided by OJS. This means that all queries that work in the current OJS search will be supported in the same way by the solr back-end.

Additionally any search query understood by solr's “edismax” query parser (see the corresponding solr documentation) will be supported. Some advanced search options, only supported when searching via Lucene back-end are:

  • a question mark as a wildcard allows matching a single letter
  • phrase query with term proximity (e.g. “some phrase”~3 finds documents containing “some” and “phrase” with not more than three words in between)
  • additional ranking parameters like term boost and field boost
  • fuzzy queries (e.g. research~ would match words similar to “research”)

These details will be completely transparent to most end users while still giving advanced users the full query power of solr/Lucene directly from OJS if they wish so.

Auto-Suggest

When typing a query in a search box (simple or advanced), then potential search terms starting with the same letters of the last entered search term will be proposed. The offered search terms will be taken from all terms indexed for the search field the user is typing in (“query term completion”).

Alternative Spelling Proposals

After executing a search, OJS may propose alternative spellings of the same search query. These alternative search proposals – if they exist – may be offered as hyperlinks above or below the result list. Clicking on one of these hyperlinks will immediately execute the alternative search and return the corresponding result set.

Result List

Results will be presented and paged in the same way as for the existing OJS search as long as no additional search features are being activated.

Ranking is according to the default Lucene TF-IDF ranking method, see the “Ranking” chapter below for details. We may optionally provide alternative ranking metrics, see “Custom Document Ranking” below.

As Lucene is not very good in retrieving documents far down the result list, we'll restrict the result list to 1000 documents independently of the actual size of the result set. This keeps users and web crawlers from executing overly expensive query operations.

We may implement result “highlighting”. If enabled (see “Configuring Search Features” below) then an extract from the full text may be provided containing highlighted search keywords from the query. This helps end users to better judge the relevance of search results.

We may implement “instant search” functionality. This means that searches are being executed in the background while the user is still entering a query. A few top results could be immediately displayed – without the user having to hit the submit button – using AJAX requests and dynamic HTML. Implementing “instant search” would require us to place the search query field(s) and the result set on the same page (see Google's instant search feature for an example). This would be a considerable deviation from the current “two-page” OJS search interface and would require us to adapt the search interface for the default OJS search, too. It doesn't mean that the default OJS search needs to be implemented with instant search but it would have to be implemented as a “one page” search solution, too.

Finally we may want to implement a "More like this" hyperlink or button besides every article in the result list. Clicking on this UI element will yield documents containing similar "interesting terms" as the chosen document. See the solr documentation for a definition of what is considered an "interesting term".

Result Manipulation and Refinement

Currently the ordering of search results cannot be manipulated. Order is by “relevance” according to the default ranking method.

We may provide an optional configuration option to enable alternative ordering criteria, e.g. alphabetically by author or title or by publication date. When enabled as an optional search feature, such ordering criteria could appear as a drop-down at the top of the search result list.

We may also propose a dynamic list of filter criteria (e.g. authors, publication date ranges, disciplines, type, subject and coverage keywords) to further refine the result list. This is called "faceting". Facets could be provided as a list of links organized by facet category (aka search field) in an optional block plug-in. This allows OJS administrators to easily enable/disable facets and flexibly place them according to their journal design.

Clicking on one of the facet links will re-execute the original search with an additional filter as defined by the clicked facet. Once a search has been re-executed with a given facet, the facet will be displayed above the result list as a regular filter. As all filters it shows a “delete symbol” next to it. Clicking on the delete symbol will re-execute the search without the facet filter.

Facets will only be shown in the currently selected UI language. This avoids that the same search term will be displayed in many different languages which would be inappropriate for keyword fields or journal titles. We show a maximum of 15 facets per category. Facet categories will use the "extras on demand" pattern, i.e. facets belonging to a category will be hidden by default and only appear after clicking the category name.

Only such facets will be displayed that improve the selectivity of the search, i.e. facet filters that would return the same number of results as the currently displayed search query result set, will not be displayed. Only facets that return at least one document will be shown.

While displaying a facet filtered search, facet categories corresponding to active filters will disappear from the list of available facets. Multiple facet filters can be applied by clicking another facet link while displaying an already filtered search.

If we want to support selection of multiple facets from one category then we could place check-boxes besides facets rather than implementing them as links. In this case we need a “Search Again” button below the facet list, so that all selected facet filters can be applied. For the sake of interface simplicity we do not recommend enabling selection of multiple facets from one category. Users can enable and delete facets from the same category to achieve similar results. Advanced users can use properly filtered queries directly from the search field.

Specification of a common search interface

To achieve our goal to maintain both, the existing and the new search interfaces, as much in sync as possible, we have to create a layer of abstraction that allows us to manipulate certain elements of the search UI while maintaining a common structure for both implementations.

The following paragraphs will outline common interface elements for search query definition and result set display. This will allow for maximum consistency of the user experience across search implementations. As a side effect we reduce code duplication and maintenance cost.

The specification has to define the areas of:

  • search query definition (search filters) for simple and advanced search
  • presentation of search filters, facets and alternative spelling suggestions on the search results page
  • the search results list (article data, highlighting, similarity search, enabling instant results)

To specify a search query, the user has to be able to influence the following parameters:

  • query terms per field (search filters)
  • desired result set ordering criteria and ordering direction
  • paging parameters
  • possibly a document for a similarity search

To achieve these goals we propose several changes to the current user interface.


Combine search query definition and result display in one interface

We propose to remove the distinction between the advanced search page and the result set page. This is necessary to enable instant search and to integrate advanced search and faceting. It will also remove the distinction between simple and advanced search and makes search refinement easier and more consistent.


Overall page layout

We propose the following order of search interface elements:

  • the main search box
  • directly below the main search box an alternative spelling suggestion, preceded by the words "did you mean..." (if active and if one was returned by the search engine)
  • below that the list of currently active advanced filters
  • then a toggle button "advanced search options" to display additional search categories
  • empty advanced search fields (hidden by default)
  • the content of the pre-results hook (if any), this includes the drop-down selectors for ordering criteria and ordering direction.
  • the result list if a search was already executed, produced dynamically in case of instant search whenever filters change
  • paging parameters
  • search instructions

This interface remains the same, completely independent of the way in which the search query was specified (simple or advanced search, faceting, similarity search, initial or refined search query, query proposed by the spelling correction module, etc.).


Advanced search

We propose to hide advanced search fields by default and let the user access them as "extras on demand" with a toggle button. For better accessibility, advanced search fields will be visible by default in case JavaScript is disabled. The main search box (simple search) will be visible all time.


Search form fields

To enable auto-suggest and reduce code duplication we'll create a separate template for search form fields that produces either an auto-suggest enabled field or a normal field, depending on whether auto-suggest is active or not. The template takes the necessary input parameters to be able to produce both, the active filters and empty advanced search fields.


Active filters

Currently active advanced search filters will not be displayed on the results page which can be quite confusing. We therefore propose to display all currently active search filters.

Active search filters originate from several sources: advanced search, simple search, faceted search and similarity search. They should be presented in a common visual format independent of their source:

  • The filter on "all fields" will always be displayed in the main search box on the results page.
  • Advanced search terms (journal, authors, title, abstract, full text, supplementary files, date, discipline, keywords/subject, type, coverage and index terms) will be displayed in their respective form fields. Non-empty form fields will be visible by default. A delete button ("Delete this filter") for each form field will make it easy to remove a filter.
  • In case of an "index terms" search we display the search terms in a specific search field that comprises all of discipline, keywords/subject, type and coverage fields. This is necessary to achieve the required "OR" disjunction for the SQL-based search.
  • In case of a similarity search, we show the terms that have actually been used. These will be displayed as usual per-field filters.
  • In case of a faceted search, facet filters will be translated into advanced filters and displayed as such.

To achieve uniform treatment of facet filters and advanced filters we have to either treat both as filter queries (i.e. no influence on ranking, better caching) or as part of the main query (influence on ranking). We agreed that performance bottlenecks are improbable in our case due to the usually quite limited number of documents in an index. Simplicity of the UI therefore seems more important. We also believe that our data is mostly high-cardinality and therefore caching will not play an important role anyway. We therefore decided that low-cardinality search fields (journal, publication date, installation id) will be consistently implemented as filter queries (no influence on ranking) while all other fields will be included in the main query when filtered (and will therefore influence ranking) no matter whether they were filtered through faceting or advanced search input.

Result list

The result list will no longer be produced from a list of OJS objects but from an indexed array containing only the visible data.

  • When searching across multiple installations, we can no longer retrieve data from our database. We'll get content directly from the search server.
  • Some of the new display elements, e.g. highlighting and "more like this", are independent of OJS objects.

Whether or not a certain element will be displayed in the result set depends simply on its presence in the indexed array. This is an easy way to keep application logic in the controller and simplify the view template.

Administration and Configuration Interface

All solr, Lucene and servlet container (e.g. jetty) related configurations in OJS will appear on a plug-in settings page as is the case for other generic OJS plug-ins.

Installation- vs. Journal-Level Configuration

Most of the before-mentioned search features could potentially be dis-/enabled and configured on journal level, e.g. highlighting, faceting, additional ordering criteria, “more-like-this”, etc. Other configuration options make more sense on an installation level, e.g. the configuration of the network endpoint of the solr server (see “Configuring the Deployment Scenario” below). The problem with journal specific configuration is that OJS has an installation-wide search option on it's central home page. This means that each of the journal-level options would have to be repeated on installation-level, too. This is comparable to OJS language options which exist for both, the installation and specific journals. While increasing configuration flexibility, providing journal-level configuration has a few drawbacks:

  1. It is considerably more implementation effort to have both, installation- and journal-level configuration.
  2. It will confuse some users to find the same configuration options in two different places. This has at least been a problem for internationalization options in the past.
  3. End users using the search function will find an inconsistent user interface with some options enabled for one journal and disabled for other journals of the same installation. This may be quite confusing.

With our project goal of simplicity in mind it therefore seems preferable to provide all or at least most search options on system level only as long as there is not a strong case for journal-specific configuration. This also implies a recommendation for the authorization model for search options: Most search options would be system-level and therefore be made by the OJS installation's administrators (admin role). Providers often do not give away the administrator credentials to journals they host. So this would be equivalent to reserving search configuration to providers, too. There is one notable exception to this principle: As we've seen before, faceting is best being implemented as a journal-level block plug-in so that it can easily be adapted to the journal-specific design and page layout. As this means that faceting has to be implemented as a separate plug-in anyway it doesn't seem to be a strong disadvantage to have it implemented on journal level. This also means that placement of faceting within the journal design, once faceting has been enabled system-wide by the administrator, would be the responsibility of journal managers.

Configuring the Deployment Scenario

We support two main deployment options (see “Deployment Options” below):

  1. a fully preconfigured local jetty/solr server (embedded deployment) and
  2. a central solr server running in an arbitrary servlet container somewhere on the network (network deployment).

The former is the default configuration. The embedded jetty server runs local to the OJS installation and listens on the loopback IP address (127.0.0.1) to protect it from exposure to other servers. To support the second deployment option we'll need a configuration option consisting of the host and port of the solr server. We recommend this to be an OJS administrator-level (installation-wide) option so that we have a unique and unambiguous solr endpoint to send article meta-data to. We can optionally provide an additional configuration parameter for the solr search handler to be used. This is “/solr/search” in the embedded deployment but advanced users may want to deploy additional preconfigured search handlers. This will enable them to work with installation-specific search parameters (e.g. ranking-related or for a sub-set of journals) without having to customize any OJS code. For the sake of simplicity we do not provide any means to directly set solr parameters from within OJS. Less advanced users should be able to use solr from a very simple interface while advanced users still may customize search to a very large extent by changing parameters directly in solr configuration files. Keeping solr configuration within solr's configuration files also helps keeping solr secure: Search endpoints can be constrained through mandatory configuration parameters which would not be possible when implementing client side configuration. Such configuration would have to be communicated over the network thereby being open to manipulation from the outside. The solr plug-in's home page will display a warning message, whenever the current configuration does not point to a running solr server. In this case, the plug-in will point to the README file distributed in it's home directory. This file will contain all necessary installation and configuration information to get up-and-running with OJS solr search.

Starting/Stopping solr

We provide a shell script to start/stop the embedded solr server. This script could be started/stopped from OJS if (and only if) it should be run under the same user as PHP. This user depends on the local web server configuration. In most cases it will be either the web server's user or – in more advanced installations – a dedicated PHP user. There may be other difficulties in starting/stopping solr directly from within OJS, see “Starting/Stopping Solr” in the “Embedded Deployment” chapter below. If all preconditions for tool execution are met then we can place a Start/Stop button onto the solr plugin main page. This allows administrators to start/stop solr from within OJS which will further simplify work with the embedded scenario.

Configuring Search Features

If we follow the recommendation to keep all search configuration on installation level then the following features could be dis-/enabled system-wide through simple check-boxes on the search plug-in's settings page:

  • highlighting
  • auto-suggestions
  • alternative spelling proposals
  • alternative order criteria
  • instant search
  • more-like-this links
  • faceting
  • custom document ranking

Rather than providing many feature-specific configuration parameters it seems more appropriate to provide a well thought-out default configuration for all of these features to keep the user interface as simple as possible. It has to be kept in mind that advanced users will always be able to tune features directly in the solr configuration. Therefore it is recommended to only provide OJS configuration for what cannot be configured directly in solr and choose good defaults otherwise.

It may even be defined that search features like auto-suggestions, alternative spelling proposals or highlighting that occupy little screen real estate and do not have a strong performance impact could be implemented out-of-the-box without the possibility of disabling them. The difference in implementation (configurable or not) seems to be negligible. So it is essentially up to the project owner to take this decision based on a trade-off between flexibility and simplicity of the user interface.

There are two notable exceptions to the recommendation to keep search feature configuration limited to simple on/off switches:

  1. The configuration of the faceting block would be done through the usual interface in step 5 of the journal setup and the normal OJS design customization process.
  2. If the inclusion of additional ranking data (e.g. citation index, usage statistics, etc.) should be possible, then we'll need an interface where such ranking information can be uploaded or integrated from external sources. One possibility is the “Custom Document Ranking Factor Configuration” described below.

Custom Document Ranking Factor Configuration

If a custom document ranking factor (e.g. citation index data, usage metrics, etc.) should be supported this can easily be done as a generic input field on the article editing page. When custom document ranking is enabled in the plug-in then such a field will appear there. If editors insert a low value then the article will rank lower than default. A high value will increase it's mean ranking position. Numbers will be linearly normalized so that their mean is one. See the “Ranking” chapter below for internal implementation details and more examples of potential alternative ranking methods. Alternatively an import plug-in could be realized that allows import of document ranking data from different file formats (CSV, XML, etc.) or even pulling ranking data from an external source via HTTP.

Index Administration

Usually no special index administration should be necessary to maintain the solr index up-to-date. All index maintenance due to article additions, updates or deletions should be handled automatically. There are situations, though, in which it may be necessary to re-index some articles, e.g. when the solr index got lost, out-of-sync or corrupted.

Partial or full re-indexing

The existing OJS “re-indexing” button will trigger a re-indexing operation in solr if the solr plug-in is switched on. An additional drop-down field can be implemented to select a single journal for re-indexing.

Additionally we recommend exposing a CLI interface for index rebuild so that rebuilding indexes across several OJS instances can be easily automated if required.

Index Optimization

Index optimization is most likely not relevant to the embedded scenario. Lucene does a good job in automatically joining index segments thereby keeping a good balance between index re-organization load and long term query/update performance.

To keep the OJS interface simple and easy to use, we recommend not to support index optimization from within OJS. Providers that work with large multi-installation indexes can use the default solr interface to optimize their index if required. Index optimization can also be scripted if a provider wishes to automate this process.

Indexing

The following sections describe several aspects of the indexing back-end of the proposed search system. This comprises some changes to the OJS back-end but above all includes solr/Lucene configuration recommendations.

Index Architecture

Index architecture is one of the most important aspects of solr configuration. We list available options in this area and provide recommendations with respect to the requirements specified for this project.

Single Index vs. Multi-Index Architecture

The main decision with respect to index architecture is whether to use a single index or multiple indexes (and corresponding solr cores).

Advantages of a single index for all journals and document types:

  • enables search across various OJS instances
  • easy installation, configuration and maintenance (no need for solr configuration when adding additional OJS instances)
  • easy search across multiple document types: A single search across article meta-data, galleys and supplementary files with the intend to retrieve articles is possible.
  • easy search across languages
  • no need to merge, de-duplicate and rank search results from different indexes (distributed search)

Disadvantages of a single index:

  • potential ranking problems when restricting search to heterogeneous sub-sets of an index (e.g. a single journal)
  • potential namespace collisions for fields if re-using the same schema for different document types (e.g. supp. file title and galley title in the same field)
  • scalability problems if scaling beyond tens of millions of documents
  • adding documents invalidates caches for all documents (i.e. activity in one journal will invalidate the cache of all journals)
  • the whole index may have to be rebuilt in case of index corruption

Implications of Multilingual Support for the Index Architecture

There are two basic design options to index a multilingual document collection:

  1. Use one index per language
  2. Use one field per language in a single index

See http://lucene.472066.n3.nabble.com/Designing-a-multilingual-index-td688766.html for a discussion of multilingual index design.

Advantages of a single index:

  • One index is simpler to manage and query. A single configuration can be used for all languages.
  • Results will already be joined and jointly ranked. No de-duplication of search results required.

Advantages of a multi-index approach:

  • The multi-index approach may be more scalable in very large deployment scenarios - especially where a large number of OJS installations are indexed on a central search server.
  • Language configurations may be modularized into separate solr core configurations. No re-indexing of all documents is required when a new language is being introduced into existing documents. It is questionable, though, whether journals will ever introduce a new language into already published articles. So this advantage is probably only theoretical in the case of OJS.
  • The ranking metric "docFreq" is per-field while "maxDoc" is not. Using one index per language these parameters will be correct even when using a single field definition for all languages. We can easily work around this in a single-index design, however, by providing one field per language.

In our case the advantages of a single-index approach for multilingual content definitely outweigh its disadvantages.

Index Architecture Recommendations

The following sections provide index architecture recommendations for all deployment scenarios.

Single Index Architecture

We generally recommend a single-index architecture if possible.

Several disadvantages of the single index scenario are not relevant in scenarios S1 to S3:

  • We have only one relevant document type: OJS articles. By properly de-normalizing our data we can easily avoid field name collisions or ranking problems due to re-use of fields for different content (e.g. we would certainly have two separate 'name' fields for article name and author name).
  • It is not to be expected that the number of documents per journal (S1), installation (S2) or provider (S3) will exceed millions of articles. If it should happen then providers of this size will certainly have the skill available to configure a replicated search server while maintaining API compatibility based on our search interface documentation.
  • In usual scenarios the cost of cache invalidation due to new galley or supplementary file upload seems reasonable. If the cost of cache invalidation or synchronous index update after galley/supp. file addition becomes prohibitive we can still choose a nightly update strategy (see “Pull Processing” below). This is in line with the current 24 hour index caching strategy.
  • Our multilingual design can be implemented in a single index.

On the other hand there are advantages of a single index architecture (e.g. search across several OJS instances, simplicity of configuration, maintenance, etc.) which are relevant in our case, see above.

There are two potential problems that can occur when consolidating many journals in a single index:

  • more costly index rebuild
  • potential ranking distortions

The first point refers to the fact that if the whole index needs to be rebuilt (e.g. due to index corruption) we have to trigger the rebuild from all connected OJS instances. This cannot be automated within OJS as OJS does not allow actions across instances. It can, however, be easily automated via a simple custom shell script when we provide a CLI interface for index rebuilds which we recommend.

Whether ranking will suffer from a single-index approach depends on the heterogeneity of the journals added to the index. It may become a problem when search terms that have a high selectivity for one journal are much less selective for other journals thereby distorting Lucene's default inverse document frequency (IDF) scoring measure when restricting query results to a single journal.

An example will illustrate this: Imagine that you have two Mathematics journals. One of these journals accepts contributions from all sub-disciplines while the other is specialized on topology. Now a search on "algebraic topology" may be quite selective in the general Maths journal while it may hit a whole bunch of articles in the topology journal. This is probably not a problem as long as we search across both journals. If we search within the general maths journal only, then documents matching "algebraic topology" will probably receive lower scores than they should because the overall index-level document frequency for "algebraic topology" is higher than appropriate for the article sub-set of the general maths journal. This means that in a search with several search terms, e.g. "algebraic topology AND number theory" the second term will probably be overrepresented in the journal-restricted query result set. Only experiment with test data can show whether this is relevant in practice. It is fair to believe, though, that the majority of queries will be across all indexed journals and therefore not suffer such distortion. This is because most users do have an interest in their topic matter rather than being interested in a specific publication only.

NB: We do not have to bother about content heterogeneity on lower granularity levels, e.g. journal sections, as these cannot be selected as search criteria to limit search results.

The same ranking distortion could theoretically apply to multilingual content if we were to collect all languages in a single index field. In the proposed schema, however, we use a separate field per language, see “Multilingual Documents” below. As document frequency counts are per index field, we'll get correct language-specific document counts. The total document count will also be ok as we'll denormalize all language versions to the article level.

While we generally recommend a single index design there are cases where a multi-index design may be appropriate and can be optionally implemented by a provider:

  • when frequent index corruption or cache invalidation turns out to be a relevant problem,
  • when ranking distortions become relevant or
  • when reaching scaling limits.

Whether these problems occur or not can only be decided by experimentation. While one index per OJS instance is supported, even in a network scenario, it must be kept in mind that multiple indexes may have disadvantages: From a user perspective the most relevant potential disadvantage is that searches across several journals will only be supported when those journals are in the same index. This is due to the fact that we do not recommend distributed search across several indexes because they are much more complex and therefore costly to implement and create difficult ranking problems we can hardly solve. See a full list above.

S1 and S2: Embedded Solr Core

While we generally recommend a single-index architecture for all deployment options, there are a few comments to be made with respect to specific employment scenarios.

In deployment scenario S1 and S2 we only search within the realm of a single OJS installation. This means that a single embedded solr core listening on the loopback IP interface could serve such requests, see “Embedded Deployment” below.

S3: Single-Core Dedicated Solr Server

In deployment scenario S3 we search across installations. This means that the default deployment approach with a per-installation embedded solr core will not be ideal as it means searching across a potentially large number of distributed cores. Therefore, the provider will probably want to maintain a single index for all OJS installations deployed on their network.

This has a few implications:

  1. We have to provide guidance on how to install, configure and operate a stand-alone solr server to receive documents from an arbitrary number of OJS installations.
  2. The OJS solr integration will need a configuration parameter that points to the embedded solr core by default but can be pointed to an arbitrary solr endpoint (host, port) on the provider's network. See “Configuration the Deployment Scenario” above.
  3. The OJS solr document ID will have to include a unique installation ID so that documents can be uniquely identified across OJS installations. See the “Data Model” and document update XML protocol specifications below.

S4: Multi-Core Dedicated Solr Server(s)

In deployment scenario S4 we have an unspecified number of disparate document types to be indexed. This means that the best index design needs to be defined on a per-case basis. We may distinguish two possible integration scenarios:

  1. display non-OJS search results in OJS
  2. include OJS search results into non-OJS searches

The present specification only deals with the second case as the first almost certainly requires provider-specific customization of OJS code that we do have no information about.

Our index architecture recommendation for the S4 scenario is to create a separate dedicated solr core with OJS documents exactly as in scenario S3. Then searches to the "OJS core" can be combined with queries to solr cores with non-OJS document types in federated search requests from arbitrary third-party search interfaces within the provider's network. (See http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set for one possible solution of federated search.)

This has the advantage that the standard OJS solr search support can be used unchanged based on the same documentation resources that we provide to support S3 (see previous section).

The only extra requirement to support the S4 scenario is to make sure that the unique document ID of other document types does not clash with the OJS unique article id. This is important so that a federated search can uniquely identify OJS documents among other application documents. When working with a globally unique installation ID such clashes are extremely improbable. Potential ID clashes are only a problem when using solr's built-in federated search feature. Otherwise the search client will query the cores separately and join documents based on application-specific logic (e.g. displaying separate result lists for different document types).

Data Model

Our recommendation for the data model is based on the type of queries and results required according to our feature list. We also try to implement a data model that requires as little schema and index modifications in the future as possible to reduce maintenance cost.

Meta-data fields that we want to search separately (e.g. in an advanced search) must be implemented as separate fields in Lucene. Sometimes all text is joined in an additional "catch-all" field to support unstructured queries. We do not believe that such a field is necessary in our case as we'll do query expansion instead.

To support multilingual search and proper ranking of multilingual content we need one field per language for all localized meta-data fields, galleys and supplementary files.

In order to avoid ranking problems we also prefer to have separate fields per document format (e.g. PDF, HTML, MS Word) rather than joining all data formats into a single search field. We can use query expansion to cover all formats while still maintaining good ranking metrics even when certain formats are not used as frequently as other formats.

The relatively large number of required fields for such a denormalized multilingual/multiformat data model is not a problem in Lucene (see http://lucene.472066.n3.nabble.com/Maximum-number-of-fields-allowed-in-a-Solr-document-td505435.html). Storing sparse or denormalized data is efficient in Lucene, comparable to a NoSQL database.

We prefer dynamic fields over statically configured fields:

  • Dynamic fields allow us to reduce our configuration to one generic field definition per analyzer chain (i.e. language).
  • No re-configuration or re-indexing of the data schema will be required to support additional languages or document formats.
  • No re-configuration of the data schema will be required to add additional meta-data fields.

The publication date will be indexed to a trie date type field.

Authors are not localized and will be stored verbatim in a multi-valued string type field.

Specific fields are:

  • the globally unique document ID field ("article_id") concatenating a globally unique installation ID, the journal ID and the article ID,
  • “inst_id” and “journal_id” fields required for administrative purposes,
  • the authors field (“authors_txt”) which is the only multi-valued field,
  • localized article meta-data fields ("title_xx_XX", "abstract_xx_XX", "discipline_xx_XX", "subject_xx_XX", "type_xx_XX", "coverage_xx_XX") where "xx_XX" stands for the locale of the field,
  • the publication date field ("publicationDate_dt"),
  • a single localized field for supplementary file data ("suppFiles_xx_XX") where "xx_XX" stands for the locale,
  • localized galley full-text fields ("galleyFullText_mmm_xx_XX") where "mmm" stands for data format, eg. "pdf", "html" or "doc" and "xx_XX" stands for the locale of the document.
  • fields to support result set ordering ("title_xx_XX_txtsort", "authors_txtsort", "issuePublicationDate_dtsort", etc.), see below
  • fields to support auto-suggest and alternative spelling proposals ("XXXX_spell"), see below
  • fields to support faceting ("discipline_xx_XX_facet", "authors_facet", etc.), see below

These fields will be analyzed for search query use cases, potentially including stemming (see “Analysis” below). The exact data schema obviously depends on the number of languages and data formats used by the indexed journals.

In the case of supplementary files there may be several files for a single locale/document format combination. As we only query for articles, all supplementary file full text can be joined into a single field per language/document format. And as we do not allow queries on specific supplementary file meta-data fields we can even further consolidate supplementary file meta-data into a single field per language.

To reduce index size and minimize communication over the network link all our fields are indexed but not stored. The only field to be stored in the index is the ID field which will also be the only field to be returned over the network in response to a the query request. Article data (title, abstract, etc.) will then have to be retrieved locally in OJS for display. As we are using paged result sets this can be done without relevant performance impact.

If we want to support highlighting then the galley fields need to be stored, too.

Further specialized fields will be required for certain use cases. If we want to support auto-suggestions or alternative spelling suggestions then we'll have to provide textual article meta-data fields in a minimally analyzed (lowercase only, non-localized) version. These fields will be called “xxxxx_spell” where “xxxxx” stands for the field name without locale extension.

Fields that we want to use as optional sort criteria need to be single valued, indexed, and not-tokenized . This means that sortable values will potentially have to be analyzed separately into “xxxxx_xx_XX_txtsort” or "xxxxx_dtsort" fields where “xxxxx” stands for the field name and “xx_XX” for the locale (if any) of the sort field.

Faceting fields ("xxxxx_xx_XX_facet") need to be localized. They are minimally analyzed (lower case only) and tokenized by separator (e.g. "," or ";") rather than by whitespace.

If we want to support the “more-like-this” feature then we may have to store term vectors for galley fields if we run into performance problems. We do not store term vectors by default, though.

Further technical details of the data model can be found in plugins/generic/solr/embedded/solr/conf/schema.xml.

Document Submission and Preprocessing

Article data needs to be submitted to solr and preprocessed so that it can be ingested by solr's Lucene back-end. This is especially true for binary galley and supplementary file formats that need to be transformed into a UTF-8 character stream. The following sections will describe various options and recommendations with respect to document submission and preprocessing.

Existing OJS Document Conversion vs. Tika

The current OJS search engine implements document conversion based on 3rd-party commandline tools that need to be installed on the OJS server. Solr, on the other hand, is well integrated with Tika, a document and document meta-data extraction engine written in pure Java. We have to decide whether to re-use the existing OJS solution or whether to use Tika instead.

Advantages of the existing OJS conversion:

  • We can re-use an established process that some OJS users already know about.
  • Conversion of PostScript files can be provided out-of-the-box.

Advantages of Tika:

  • According to our tests, Tika works at least one order of magnitude faster than the current OJS solution. This is especially important for large deployment scenarios, i.e. when re-indexing a large number of articles.
  • Tika is easier to use and install than the current OJS solution. No additional 3rd-party tools have to be installed as is now the case (except for solr itself of course). Plain text, HTML, MS Word (97 and 2010), ePub and PDF documents are supported out-of-the-box by the code that comes with the standard solr distribution. Caution: Tika does not convert PostScript files!
  • Can be deployed independently on the search server and does not need an OJS installation to work. In scenarios S3 and S4 this means considerably less infrastructure to be deployed on OJS nodes.
  • Very well tested and maintained.
  • Enables indexing of several additional source file types out-of-the-box, see https://tika.apache.org/1.0/formats.html and https://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml.

The only real disadvantage of Tika with respect to our requirements is that it does not support conversion of PS files. PS could be supported indirectly by first converting it to PDF locally and then submitting PDF to the solr server. It is however not clear, whether nowadays there exist OJS installations with an interest in solr that actually use Postscript as a publishing format. The advantage of solr being able to support the ePub format seems more important than the missing PS support.

Recommendation: Use the Tika conversion engine.

Local vs. Remote Processing

In the multi-installation scenarios S3 and S4 document preprocessing could be done locally to the installation or on the central solr server.

Advantages of local processing are:

  • The solr server experiences less load for resource-intensive preprocessing tasks.

Advantages of remote processing are:

  • Doing all processing on a single server will simplify deployment and maintenance as 3rd-party dependencies only need to be installed and configured on a single server. OJS installations can be added without additional search-related installation requirements.
  • Solr preprocessing extensions like Cell or DIH work locally to the Solr core.
  • We can keep load off the end-user facing OJS application servers for consistent perceived response time.

Recommendation: Use remote processing, mostly due to the reduced deployment cost and easy use of Solr extensions.

Push vs. Pull

Document load can be initiated on the client side (push processing) or on the server side (pull processing). Both options have their strengths and weaknesses.

Advantages of push configuration:

  • Indexing can be done on-demand when new documents are added to OJS. This guarantees that the index is always up-to-date.
  • No solr-side import scheduler needs to be configured and maintained.
  • Push is simpler to implement when implemented as a synchronous call without callback. This may be sufficient in our case, especially for the embedded scenario, although it implies the risk that documents may not be indexed if something unexpected goes wrong during indexing or if the solr server is down. Without additional safety measures this means that a full index rebuild is required after the solr server comes back online. This problem can be mitigated by implementing the "dirty" pattern: Articles are marked "dirty" when they are updated and every call to the synchronous indexing API will update all "dirty" articles. Articles will only be marked "clean" when the solr server confirmed that they were successfully indexed. In case of solr server downtime, the next index update can either be triggered by the next regular update of any article or through an additional button in the plug-in interface (e.g. "Index all pending changes") and by an additional switch to the "rebuildSearchIndex" script.


Advantages of pull configuration:

  • Push processing means that editorial activity during daytime will cause update load peaks on the solr server exactly while it also experiences high search volume. This load can be quite erratic and fluctuating in larger system environments and therefore difficult to balance. In pull mode indexing schedules can be configured and co-ordinated in one single place (for scenarios S3 or S4) to balance document import load on the central search server and keep it to off hours.
  • Pull also means that the process is more resilient against solr server downtime. In case of service outage, updates can be postponed until the server is back online. A full index rebuild is not required in that case.

Recommendation: Use the simpler push configuration by default but check its performance and reliability early on. If it turns out to be slow or unreliable, especially in the network deployment case, then provide instructions and sample configuration for an optional pull configuration for larger deployments, see “OJS/solr Protocol Specification” and “Deployment Options” below.

Both, push and pull processing, can be implemented with our without callback. We recommend callback for network deployment only where large amounts of data have to be indexed and full index re-builds can be very costly, see “OJS/solr Protocol Specification” below.

Implications of Multilingual Document Processing

The OJS search feature returns result sets on article level rather than listing galleys or supplementary files as independent entities. This means that ideally our index should contain one entry per article so that we do not have to de-duplicate and join result sets. Different language versions and formats of articles should be spread over separate fields rather than documents. Such a denormalized design also facilitates multilingual search and ranking. A detailed argumentation for this preferred index design will be given in the “Multilingual Documents” section below.

For document preprocessing this design implies that we have to join various binary files (galleys and supplementary files in all languages and formats) plus the article meta-data fields into a single solr/Lucene document. As we'll see in the “Solr Preprocessing plug-ins” section, this considerably influences and restricts the implementation options for document import.

Custom Preprocessing Wrapper vs. solr Plug-Ins

We have to decide whether we want to implement our own custom preprocessing wrapper to solr as in the current OJS search implementation or whether we want to re-use the preprocessing interface and capabilities provided by native solr import and preprocessing plug-ins.

Advantages of a custom preprocessing interface are:

  • We could use an arbitrary data transmission protocol, e.g. re-use existing export formats like OAI or the OJS native export format or use solr's native document addition format directly over the wire. The former implies that we somehow have to interpret these formats on the server side. The latter means that we have to transform binaries into a UTF-8 character stream on the client side, see the discussion of local document preprocessing above.
  • We could re-use the existing document conversion code, rather than using Tika. See the discussion of the existing OJS preprocessing code above.

Advantages of standard solr plug-ins:

  • We can re-use solr's elaborate document preprocessing capabilities which are more powerful than those currently implemented in OJS.
  • Tika is well integrated with solr through two different plug-ins: DIH and Cell. Using native solr plug-ins means that we can use Tika as a conversion engine without having to write custom Tika integration code.
  • Custom remote preprocessing code to interpret OAI messages or OJS export formats is expensive: It means either implementing and maintaining a separate server-side PHP application or extending solr with custom Java code.
  • Solr plug-ins support pull and push configurations out-of-the-box.

A priori both options have their strengths and advantages. In our case, though, the choice is relatively clear due to our preference for remote document preprocessing and Tika as an extraction engine. Having to maintain custom Java code or creating a separate server-side PHP preprocessing and Tika integration engine are certainly not attractive options for FUB or PKP.

Recommendation: The advantages of using established solr plug-ins for data extraction and preprocessing outweigh the advantages of a custom preprocessing interface in our case.

Solr Preprocessing plug-ins: IDH vs. Cell

Currently there are two native solr extensions that support Tika integration: The "Data Import Handler" (IDH) and the "Solr Content Extraction Library" (Solr Cell).

Cell is meant to index large amounts of files with very little configuration requirements. Cell does not support more complex import scenarios with several data sources and complex transformation requirements, though. It also does not support data pull. In our case, these disadvantages rule it out as a solution.

The second standard solr preprocessing plug-in, IDH, is a flexible extraction, transformation and loading framework for solr that allows integration of various data sources and supports both, pull and push scenarios.

Unfortunately even IDH has two limitations that are relevant in our case:

  • IDH's XPath implementation is incomplete. It does not support certain types of XPath queries that are relevant to us: An IDH XPath query cannot qualify on two different XML attributes at the same time which rules out the possibility to transmit native OJS XML to IDH.
  • Due to it's sequential “row” concept imposed on XML parsing, IDH also does not usually support denormalizing several binary documents into a single Lucene document. In fact no standard solr contribution is designed to do so out-of-the-box (see http://lucene.472066.n3.nabble.com/multiple-binary-documents-into-a-single-solr-document-Vignette-OpenText-integration-td472172.html). Only by developing a custom XML data transmission format with CDATA-embedded XML sub-documents did we manage to work around this limitation without having to resort to custom compiled Java code on the server side, see “XML format for article addition” below.

Recommendation: Use IDH for document preprocessing with a custom XML document transmission format.

Should we use Tika to retrieve Meta-Data from Documents?

Tika can retrieve document meta-data from certain document formats, e.g. MS Word documents. This functionality is also well integrated with IDH.

Using this meta-data is problematic, though:

  • Document meta-data cannot be consistently retrieved from all document types.
  • Even where the document theoretically allows for storage of a full meta-data set, these meta-data may be incomplete or inconsistent with OJS meta-data.
  • We do have a full set of high-quality document meta-data in OJS that we can use instead.

Recommendation: Do not use Tika to extract document meta-data but use the data provided by OJS instead.

Transmission Protocol

IDH supports several data transmission protocols, e.g. direct file access, HTTP, JDBC, etc. In our case we could use direct file access or JDBC for the embedded deployment scenario. But as we also have to support multi-installation scenarios we prefer channeling all data through the network stack so that we can use a single preprocessing configuration for all deployment options. Using the network locally is only marginally slower than accessing the database and file system directly. By far most processing time is spent for document conversion and indexing so document transmission will hardly become a performance bottleneck.

HTTP is the network protocol supported by IDH. HTTP can be used for push and pull configurations. It supports transmission of character stream (meta-)data as well as binary (full text) documents. Our recommendation is therefore to use HTTP as the only data transmission protocol in all deployment scenarios.

Non-HTTP protocols can still be optionally supported (e.g. for performance reasons) by making relatively small custom changes to the default IDH configuration.

Exact details of the transmission protocol will be laid out in the “OJS/solr Protocol Specification” below.

Submission and Preprocessing Recommendations

To sum up: Our analysis of the data import process revealed that the following requirements should be met by a data preprocessing solution:

  • No custom Java programming should be required.
  • Push and pull scenarios should be supported.
  • Remote preprocessing should be supported.
  • We have to support denormalization of various binary files into a single Lucene document.
  • Preprocessing should be done with Tika using native solr plug-ins.
  • Documents and meta-data should be sent over the network.

We provide a prototypical IDH configuration that serves all these import and preprocessing needs:

  • We provide push and pull configurations. Push is supported by IDH's ContentStreamDataSource and pull is supported via the UrlDataSource.
  • Both configurations do not require direct file or database access. All communication is over the network stack.
  • In our prototype we demonstrate a way to use an IDH FieldReaderDataSource to pass embedded XML between nested IDH XPathEntityProcessor instances. This allows us to denormalize our article entity with a single IDH configuration. We also draw heavily on IDH's ScriptTransformer to dynamically create new solr fields when additional languages or file types are being indexed for the first time. This means that no IDH maintenance will be necessary to support additional locales.
  • All file transformations are done via IDH's Tika integration (TikaEntityProcessor). We nest the Tika processor into an XPathEntityProcessor and combine it with a ScriptTransformer to denormalize several binary files into dynamic solr fields.

Please see plugins/generic/solr/embedded/solr/conf/dih-ojs.xml for details.

Analysis

In the Lucene context, “analysis” means filtering the character stream of preprocessed document data (e.g. filter out diacritics), splitting it up into indexed search terms (tokenization) and manipulating terms to improve the relevance of search results (e.g. synonym injection, lower casing and stemming).

Precision and Recall

This part of the document describes how we analyze and index documents and queries to improve precision and recall of the OJS search. In other words: We have to include a maximum number of documents relevant to a given search query (recall) into our result set while including a minimum of false positives (precision).

Measures that may improve recall in our case are:

  • not making a difference between lower and upper case letters
  • removing diacritics to ignore common misspellings
  • using an appropriate tokenization strategy (e.g. n-gram for logographic notation or unspecified languages and whitespace for alphabetical notation)
  • using "stemmers" to identify documents containing different grammatical forms of the words in a query
  • using synonym lists (thesauri) to include documents that contain terms with similar meaning

Measures that improve precision may be:

  • ignore frequently used words that usually carry little distinctive meaning ("stopwords")

Often there is a certain conflict between optimizing recall and precision. Measures that improve recall by ignoring potentially significant differences between search terms may produce false positives thereby reducing precision.

Please observe that most of the above measures require knowledge about the text language, i.e. its specific notation, grammar or even pronunciation. A notable exception to this rule is n-gram analysis which is language-agnostic. Support for a broad number of languages is one of our most important requirements. Therefore appropriate language-specific treatment of meta-data and full text documents is critical to the success of the proposed design. We'll therefore treat language-specific analysis in detail in the following section.

Our general approach is to keep the analysis process as simple as possible by default. This also includes minimal stemming and language-specific analysis. This is to honor the “simplicity” design goal as specified for this project. Whenever we discover unsatisfactory relevance of result lists during testing (see our testing approach above), especially insufficient recall of multilingual documents, we'll further customize analysis chains. This ensures that additional complexity is only introduced when well justified by specific user needs.

Multilingual Documents

It is one of the core requirements of this project to better support search in multilingual content. This is especially true for languages with logographic notation, such as Japanese or Chinese, that are not supported by the current OJS search implementation. We've already analyzed the impact of multilingual documents on index and data model design. The most important part of multilingual support lies in the analysis process, though. In fact, allowing for language-specific analysis is one of the reasons why we recommend a “one-field-per-language” data model.

There is no recommended default approach for dealing with multilingual content in solr/Lucene. The range of potential applications is so large that individual solutions have to be found for every use case. We'll therefore handle this question to a considerable amount of detail: First we'll list a few specific analysis requirements derived from the more general project requirements presented earlier. Then we'll discuss several approaches to multilingual analysis. Finally we'll recommend an individual solution for the use cases to be supported in this project.

Requirements

Requirements for the analysis process must above all be derived from expected user queries and the corresponding correctly ranked result lists. The following list of analysis requirements are therefore derived from properties specific to multilingual OJS search queries:

  • The OJS search form is language agnostic. Search terms can be entered in any language. Both, single and mixed-language queries, should be allowed.
  • The indexing process should be able to deal with galleys and supplementary files in different languages.
  • The indexing process should usually be able to rely on the locale information given for the galley or supplementary file being indexed. A language classifier might optionally be used for galleys whose locale information is unreliable or cannot be identified.
  • The indexing process should be able to deal with mixed-language documents where short foreign-language paragraphs alternate with the main galley/supplementary file language. This means that e.g. an English search term entered into the OJS search box should find a German document containing the search word in an English-language citation.
  • The following languages should be specifically supported: English, German, Spanish, Chinese, Japanese. Other languages should be supported by a generic analysis process that works reasonably well for multilingual documents.
  • A process should be defined and documented for plugging in additional language-specific analysis chains on demand.

Further requirements derive from multilingual test queries. Consult the list of test queries linked in the main “Requirements” section above for details.

Language Recognition vs. Preset Language

When multilingual content should be analyzed in a language-specific manner (e.g. stemming, stopwords, etc.) we need to know the document language to be able to branch into the correct analysis chain. There are two basic approaches to obtain such language identity information: machine language recognition and user input.

Advantages of machine language recognition:

  • Deals with incomplete or unreliable locale information of meta-data, galleys and supplementary files.

Advantages of preset languages:

  • Simpler to implement.

Reliability of machine language recognition vs. preset languages mainly depends on the reliability of user input in the case of preset languages: In our case user provided language information will probably be quite reliable for meta-data and galleys. This is not the case for the content of supplementary files as these do not have a standardized locale field. This seems to be a minor problem, though: It is assumed that searches on supplementary file content are of minor importance in our case.

Our recommendation therefore is to work with preset languages to avoid unnecessary implementation/maintenance cost and complexity. If we see in practice that important test queries cannot be run with preset languages then we can still plug-in language recognition where necessary. We can use solr's “langid” plug-in in this case, see https://wiki.apache.org/solr/LanguageDetection. It provides field-level language recognition out-of-he-box.

Document vs. Paragraph-Level Language Recognition

The granularity of multilingual analysis has a great influence on implementation complexity and cost. While document-level language processing is largely supported with standard Lucene components, paragraph or sentence-level language recognition and processing requires considerable custom implementation work. This includes development and maintenance of custom solr/Lucene plug-ins based on 3rd-party natural language processing (NLP) frameworks like OpenNLP or LingPipe.

We identified the following implementation options for multilingual support:

  1. Allow language-specific treatment only on a document level and treat all documents as "monolingual". Document parts that are not in the main document language may or may not be recognized depending on the linguistic/notational similarity between the main document language and the secondary language.
  2. Allow language-specific treatment on document level and provide an additional "one-size-fits-all" analysis channel that works reasonably well with a large number of languages (e.g. using an n-gram approach, see below). Search queries would then be expanded across the language-specific and generic search fields. This will probably improve recall but reduce precision for secondary-language search terms.
  3. Perform paragraph or sentence-level language recognition and analyze text chunks individually according to their specific language. This should provide highest precision and recall but will be considerably more expensive to implement and maintain.

The advantage of the first two options is that they can be implemented with standard solr/Lucene components. The third option will require development and maintenance of custom solr/Lucene plug-ins and integration with third-party language processing tools. This is not an option in our case as it would require custom Java programming which has been excluded as a possibility for this project.

We recommend the second approach which will be further detailed in the next section.

Language-Specific Analysis vs. n-gram Approach

There are two basic approaches to deal with multilingual content: A generic n-gram approach that works in a language-agnostic manner and provides relatively good mixed-language analysis results. Alternatively language-specific analysis chains can be used to analyze text whose language is known at analysis time.

Advantages of an n-gram approach:

  • relatively easy to implement with a single multilingual analyzer chain
  • easy to introduce new languages (no additional configuration required)
  • easy to query (no need for query expansion)
  • can be used to index mixed-language documents
  • no language identification required
  • may speed up wildcard searches in some situations

Advantages of language-specific analysis chains:

  • higher relevancy of search results (less false positives or false negatives)
  • language information contributes to proper ranking of documents
  • easier to tune in case of language-specific relevancy or ranking problems
  • requires less storage space, especially when compared to multi-gram analysis (e.g. full 2-, 3- and 4-gram analysis for a single field).

While language-specific analysis chains may not be ideal for mixed-language content, it is improbable that n-gram analysis alone will provide satisfactory relevance of result sets.

We therefore recommend a mixed approach: We should provide language-specific analysis chains for the main language of a document or meta-data fields where the language is known and supported. All fields and documents may additionally undergo partial n-gram (e.g. edge-gram) analysis if we find that this is necessary to support multilingual document fields or fields that do not have a language specified. The results from both analysis processes will have to go into separate fields. This requires separate fields per language (see “Data Model” above) and query expansion to all language fields (see the “Query Transformation and Expansion” below).

Character Stream Filtering

According to our “simplicity by default” approach we do not recommend any character stream filtering unless specific test use cases require us to do so. The recommended stemming filters deal to a large extent with diacritics. Lower case filtering is done on a token level.

Tokenizing

Tokenization differs for alphabetic languages on the one side and logographic languages on the other. We recommend standard whitespace tokenization for most Western languages while a bigram approach is usually recommended for Japanese, Chinese and Korean. We therefore recommend the solr CJK-tokenizer for these languages.

Token Filtering

We recommend lowercase filtering for alphabetic languages and language-specific stopword filtering by default. In order to simplify analysis and avoid additional maintenance cost, we do not recommend synonym filtering unless required to support specific test cases.

Stemming

We recommend solr's minimal language-specific stemming implementations where they exist. Should these yield insufficient recall during testing then we can replace them with more aggressive stemmers on a case-by-case basis.

One might even want to remove all stemming and cluster all alphabetical languages into a single analysis chain similarly to what currently is being done in standard OJS search. In order to keep flexibility for advanced use cases in scenarios S3 and S4 we do recommend language-specific analysis chains, though, even if not used out-of-the-box. It has to be kept in mind that this complexity is completely transparent to end users.

Special Fields

Keyword fields like discipline, subject, etc. are not usually passed through stemming filters. We therefore recommend a generic, language-agnostic analysis chain for all keyword fields.

We have to support a special analysis chain for the article and issue publication date so that range queries on the publication date can be supported. There are default analyzers and field types for dates which we recommend here.

Text fields to be sorted on must not be tokenized. Date fields to be sorted on must be of a different type as date fields to be queried on. We therefore provide special field types for sort ordering.

Theoretically chronological coverage could be analyzed with a location analyzer if (and only if) geographical coverage would be given in a well-defined latitude/longitude format. As this is not usually the case in OJS we recommend analyzing geographic coverage in the same way as other keyword fields.

Field Storage

Most use cases only require us to index fields. Storage is not required. The only field we need to store (and return from queries) is the document ID field which will be required by OJS to retrieve article data for display in result sets. There is a notable exceptions to this rule, though: If we enable highlighting then storage of galley fields is mandatory. This is necessary so that the highlighting component can return search terms in their original context. Therefore highlighting considerably increases storage space required by OJS solr indexes. This should be considered when deciding whether this feature is to be supported out-of-the-box.

Default Implementation

Please see plugins/generic/solr/embedded/solr/conf/schema.xml for our recommended analysis configuration.

Querying

Query Entry and Auto-Suggest

Search queries entered into the search fields will be submitted to the OJS server as POST requests. This does not differ from the current search implementation. OJS will then start a nested HTTP request to the solr server which will return matching article IDs and (if enabled) advanced search information as highlighting, alternative spelling suggestions, similar articles, etc.

OJS will access its article database to present the result set. This works exactly in the same way as in the current OJS search implementation. The only difference is that article IDs are provided by solr rather than being retrieved from OJS' own index implementation.

Auto-suggestions can be obtained from prefixed facet searches, a dictionary-based suggester component or term searches. In all cases AJAX search requests have to be sent to a specific OJS handler operation on every keystroke in one of the search fields. The OJS handler will then delegate internally to the OJS solr search endpoint and return the results to the OJS client browser. This requires us to implement a specific OJS handler operation for auto-suggestions plus corresponding JavaScript client code which can be based on PKP's JavaScript framework.

Prefixed facet searches have the advantage that they only suggest searches that will actually return results. They can even inform about the number of documents that will be returned for each proposed search query. The problem with prefixed facet searches is that they do not scale linearly when searching in large indexes. Suggester searches are extremely fast as they can be provided almost without processing from a pre-calculated dictionary. We have to provide separate dictionary fields for each search field that we want to enable for auto-suggest. These fields have the same requirements with respect to the data model and analysis process as alternative spelling suggestions. See the data model and corresponding faceting/spelling chapters for details.

We recommend implementing both, prefixed facet and suggester based auto-suggestions. Prefixed facet suggestions should be the default. If it turns out that prefixed facet searches do not provide the necessary performance in a given deployment scenario then switching from one implementation to the other should just require a bit of re-configuration and no programming effort. Most of the implementation effort lies in the JavaScript front-end and OJS handler which are the same for both auto-suggest implementations except for a few search parameters. We do not consider the third possibility to obtain suggestions: term searches. Term searches do not provide a considerable advantage over suggester searches but are less flexible.

Query Parser

Solr provides a considerable number of query parsers. Prominent choices in our case are: “lucene” (the default Lucene query syntax), “dismax” (provides a simplified query syntax and allows additional fine tuning of query parameters, e.g. term boost) and “edismax” which improves on details of the dismax implementation and safety.

Unfortunately none of these query parsers currently corresponds 100% to our needs. The problematic requirements are "implicit AND", "cross-language search", "advanced multi-field search", "language specific stopword lists" and full support for the "NOT" query keyword.

The "lucene" parser supports implicit AND and would therefore be a potential choice in our case. Unfortunately, though, it does not implement multi-field queries. Fielded searches can only be done one field at a time. Naively one could do something like:

   .../defType=lucene&q=field1:(some phrase) OR field2:(some phrase)

to implement multi-field queries. This however has unexpected side effects when querying with NOT. Imagine an article with the following title in two languages: The German version is "Kanzlerin Merkel und die Eurokrise" and the English version would then be "Chancellor Merkel and the Euro Crisis". Now some user who does not want any article to remind him of his personal stock portfolio disaster might query "Merkel NOT Eurokrise". This query would then be expanded to:

   .../defType=lucene&q=title_de_DE:(Merkel NOT Eurokrise) OR title_en_US:(Merkel NOT Eurokrise)

While the first field search returns no results, the second certainly will. The English title contains the word "Merkel" but not the word "Eurokrise". And "FALSE OR TRUE" will become "TRUE" overall so the article matches. While this may make sense technically it causes unintuitive results in practice.

In principle the "edismax" parser's multi-field query could come to the rescue. Transforming the above query to:

   .../defType=edismax&q=Merkel NOT Eurokrise&qf=title_de_DE title_en_US

yields the expected result. Now the two fields are treated as if they were concatenated.

Apart from that the "edismax" parser has further advantages over the "lucene" parser: It has the most advanced features for OJS providers to customize search handlers, e.g. with respect to field or document boost. It also is more forgiving about erroneous user input and will gracefully fall back to a simple keyword search in case syntax errors are encountered in the query.

The problem is that "edismax" (and "dismax", too) do not support implicit AND conjunction of search phrases. Setting the "edismax" min-match parameter to 100% could simulate implicit AND but unfortunately it is buggy when used together with fielded search. This means that:

   .../defType=edismax&q=(Merkel Eurokrise)&qf=title_de_DE title_en_US&mm=100%

works as expected but

   .../defType=edismax&q=(Merkel Eurokrise) +journal_id:test-inst-2&qf=title_de_DE title_en_US&mm=100%

does not as the min-match parameter will suddenly stop to work when used in conjunction with a fielded query. See http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-td3222594.html. Actually the problem is worse than described in the cited article as can be seen in our example. The min-match parameter will stop to work even for the main query when adding a completely independent field query.

This problem can be worked around with subqueries, though. When reformulating the above query as:

   .../q=_query_:"{!edismax qf='title_de_DE title_en_US' v='Merkel Eurokrise'}" +journal_id:test-inst-2 +inst_id:test-inst&mm=100%

everything will work as expected. Subqueries have the additional advantage that they enable advanced multi-lingual queries which require us to query different field lists with individual search phrases, like this:

   .../q=_query_:"{!edismax qf='title_de_DE title_en_US' v='Merkel Eurokrise'}" _query_:"{!edismax qf='subject_de_DE subject_en_US' v='Obama'}" +journal_id:test-inst-2 +inst_id:test-inst&mm=100%

So why not doing it like this? The problem is another quirk of the min-match parameter: It does not work as expected when used in conjunction with stopword lists. Searching for "Die Eurokrise" in the above example will return no results as "die" is a German stopword but it is not on the list of English stopwords. Internally the query would therefore be transformed to something like this (simplified!):

   +title_en_US:die +(title_de_DE:Eurokrise OR title_en_US:Eurokrise)

Edismax will "optimize" the query by removing the stopword "die" from the query on the German title. A min-match setting of "100%" will cause all parts of the query to be mandatory, though. Now there obviously will be no match for the first search term in our example. And as this search phrase is marked "mandatory" by our min-match setting, the article will not be selected for the result set although it is certainly relevant. See http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html and https://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ for details.

The solutions proposed by solr's developers are to either remove stopword lists altogether, use the same stopword list for all fields (aka languages) or use a min-match setting of "1" (i.e. implicit OR). None of these solutions is compatible with our requirements.

In other words: There is currently no known way to implement all of our requirements with the existing solr query parsers.

This leaves us with the following options:

  1. Drop the "implicit AND" requirement. This means we can set edismax "min-match" to 1 which would work around the stopword problem while correctly dealing with "NOT" queries and cross-language search. Additionally providers would then be able to use edismax's advanced result set optimization features.
  2. Drop the requirement to implement language-specific stopword lists: We could implement a single stopword list across all languages (or no stopword list at all) which will allow us to use a min-match setting of "100%" which will make implicit AND work in most situations with edismax. Cross-language search, advanced multi-field search and "NOT" will then be supported as required.
  3. Drop the cross-language search requirement: We can implement this by providing a drop-down box in the search GUI to select the language to be searched. If we do this then we can do without cross-language searches which will bring the "lucene" query parser back into consideration. We can implement multi-field search (i.e. "all fields" and "index term fields") by indexing "combined" fields that concatenate index terms from several fields. Such concatenation is not possible across languages as solr requires separate fields for separate analysis chains. This option will solve the problems with searches containing a "NOT" keyword and also make "implicit AND" work correctly.
  4. Drop the requirement to fully support the "NOT" operator: Using the "lucene" parser together with OR-related field-based searches as demonstrated in our first example would then be the proposed solution.

I ordered the proposals according to our current judgment of "best usability" for the end user. "Implicit OR" may be a deviation from current OJS requirements but it certainly is what most users are already used to, e.g. from Google searches. The "implicit AND" makes sense for the OJS SQL search as it does not implement a very effective ranking mechanism. With solr's improved ranking algorithm results that match the query only partially will appear lower in the list so basically the first few hits in the list should be exactly those containing all search terms. Therefore dropping the "implicit AND" requirement seems to have almost no impact on usability or may even improve usability by increasing recall while maintaining precision through intelligent result ranking.

Stopword lists are mainly used to improve query performance. Removing stopword lists will probably only marginally influence results from a user perspective. It is, however, probable that recall decreases which may reduce overall result set relevance.

Introducing a language drop-down for queries or no longer supporting the "NOT" operator are highly visible changes to the current search implementation and will have adverse impact on usability. We therefore do not recommend these options.

Wildcards and Stemming

A known problem of solr is the lack of analysis filter support for wildcards. Wildcards are not being analyzed in the same way as normal query terms. The support for wildcards has significantly improved in solr 3.6.0: It is now possible for filters in the analysis chain to support a special "multiterm" interface which will handle wildcard transformation. There are not many filters, though, which support this interface.

Stemmers, in particular, will probably not be able to do anything about wildcard terms in the foreseeable future. Terms with wildcards cannot be stemmed as the usual stemming heuristics need the whole word to be present. This means that wildcard queries will often not match stemmed index entries. If the words "scientist" and "scientific", say, are both stemmed to "scient" then the search "scienti*" will find none of them. This may not be a huge problem in a language like English where relatively unaggressive stemming yields good results. In German, French and other languages with much more mutable grammatical forms, stemming may have to be more aggressive. In this case the combination of wildcards and stemming can produce quite visible problems.

This target conflict of tolerance towards varying grammatical forms of the same word and support for wildcard queries cannot be completely solved with current solr technology. We therefore try to compromise by using light stemmers and tolerate a certain amount of false negatives in our search results.

Query Analysis and Synonym Injection

We recommend that for query purposes, the query will be analyzed exactly in the same way as the queried document fields at indexing time. Therefore there isn't much to say about query analysis that has not been said before in the analysis chapter. There is only one possible exception to this rule: We may want to consider synonym injection at query time. This means that an additional analysis component could be added to the query analyzer that checks for synonyms, either in a static, manually maintained language-specific thesaurus or in online sources like WordNet. Whenever a synonym is found it will be injected to the token stream and handled as if it had been part of the original query. Whether or not query-side synonym injection should be implemented and from which source is to be decided by the project owner.

Ranking

Ranking of OJS articles is done through the default solr/Lucene ranking algorithm. The algorithm is called “term frequency – inverse document frequency” or “TF-IDF” for short and will be outlined in the next few paragraphs. Lucene-specific details of the ranking algorithms are out of scope for this document and can be looked up in the Lucene documentation, above all in the JavaDoc of the StandardSimilarity class.

Term Frequency

The term frequency TF(t, d) is the number of times the term t occurs in a given document d.

Inverse Document Frequency

The inverse document frequency of a term t in an index is:

IDF(t) = log(N / DF(t))

Where:

  • t is an arbitrary dictionary term
  • N is the total number of documents in an index. In the Lucene documentation this measure is usually referenced to as "maxDoc".
  • DF(t) is the document frequency of term t in an index. The document frequency is defined as the number of documents containing the term t one or more times. In Lucene this is usually called "docFreq".

NB: IDF is finite if every dictionary term t occurs at least once in the document collection (as then DF(t) > 0). If we build our dictionary exclusively from the document collection itself then this is guaranteed and IDF is defined everywhere.

Combined Term / Inverse Document Frequency

Ranking in Lucene is some variant of the combined term/inverse document frequency:

TF-IDF(t, d) = TF(t, d) * IDF(t)

NB: TF-IDF is zero if a document does not contain the term t. If the dictionary contains terms that are not in the document collection then TF-IDF is defined to be zero for this term for all documents. We can usually avoid this by building our dictionary exclusively from the document collection.

Overlap Score Measure

Score(q, d) is the sum over all t in q of TF-IDF(t, d) where q is the set of all terms in a search query.

In other words: A search term contributes highest to a document's ranking for a given search query when the term occurs often in a document and the term has a high discriminatory significance in the document collection by choosing a small percentage of a collection's documents only.

In Lucene scoring of search queries with multiple terms is done with a “coordination factor” that works similarly to the approach just outlined. Lucene calculates a score for each field to derive it's document-level score. Lucene also further modifies this simple ranking approach for easy customization (e.g. term boost, field boost, document boost, etc.) and efficiency.

Vector Space Model

Another perspective on Lucene's scoring algorithm is by modeling documents as term vectors in a vector space spanning all terms of a dictionary. More precisely: If D is a set of terms (i.e. a “dictionary”) then a single document d can be modeled as a vector V(d) in a card(D)-dimensional vector space.

Example: When using TF-IDF as scoring measure then the n-th component of the vector produced by the vector function V(d) is the TF-IDF of the document for the n-th dictionary term.

In this model a similarity (or distance) measure can be defined that computes the similarity of two documents (or a document and a query).

A common similarity measure is the cosine similarity:

sim(d1, d2) = V(d1) . V(d2) / |V(d1)||V(d2)|

where the dot (.) stands for the vector dot product (inner product) and || is the Euclidean norm. This is equal to the inner product of V(d1) and V(d2) normalized to unit length.

The advantage of this model is that not only distances/similarities between documents but also between a search query and a document can easily be calculated:

sim(d, q) == v(d) . v(q)

where v() stands for the document vector V() normalized to unit length.

NB: The “more-like-this” feature relies on similarity calculations between documents. It therefore requires term vectors to be stored in the index.

Fine-Tuning the Ranking

By default we do not change any preconfigured ranking factors. By using the “edismax” query parser (see above) we do keep the option, though, to fine-tune client-side (query-level) ranking parameters should it become necessary. It is recommended to avoid client-side ranking adjustments. By configuring different solr search request handlers, different ranking approaches can be provided to different OJS installations by changing their search endpoint configuration (see deployment option configuration above).

Potential Additional Ranking Metrics

While TF-IDF is the default scoring/ranking model in Lucene it can be customized by providing so-called boost factors for different entities in the model. These are simple multipliers that can occur on term and document level that increase (multiplier > 1) or decrease (multiplier < 1) the scoring contribution of that entity.

In our case document boost opens up a few interesting possibilities to further tune the relevancy of search results. Here are a few examples of metrics that could be used to “boost” certain articles so that they rank higher in result lists:

  • citation index data
  • usage metrics, e.g. as supplied by the OJS.de statistics sub-project
  • click-through popularity feed back from OJS, i.e. the number of times an article was actually opened after it being presented on a result list
  • article recency, i.e. favor articles with a more recent publication date over older articles

The question is how such data could be provided to solr. I propose that we implement an API in OJS that can receive and store document-level boost data for each article. This can be implemented as a non-mandatory setting in the article settings table. If such boost data is present then it will automatically be sent to solr at indexing time. We'll have to implement a normalization method so that editors can enter arbitrary numbers that will then be translated to proper boost factors. Changing the boost data would mean that the article would have to be re-indexed (like any other change to search-related article meta-data).

Alternatively, advanced users can provide periodically updated files with document boost data, see http:// lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField. html.

Instant Search

Instant search means that while the user is still typing in a search request, a first result-preview is being displayed instantly in the result list. It requires a “one page” search interface as explained above. It also requires an additional OJS AJAX request handler that forwards AJAX search events to the solr server and returns results in a format that can be interpreted by the OJS browser code. The OJS development branch has access to the new JavaScript framework that was developed for OMP. We could use it to implement the necessary client-side JavaScript for “instant search”.

Result Presentation

Result presentation is mostly an OJS-side task. It will not differ from the current implementation except for a few details that will be outlined later. The solr search server returns ID fields only which can be used by OJS to retrieve all required additional data from its database. This implementation recommendation is due to two reasons:

  • The solr index will consume considerably less space if the original text does not have to be stored.
  • In a subscription-based scenario we want to avoid that full texts can be leaked if malicious users gain direct access to the Lucene server. It has to be admitted, though, that not storing full texts is a relatively weak protection and can be worked around to a certain extend. The best protection against full text leakage is proper firewalling of the solr server as described in the deployment option chapter below.

If we want to support highlighting then we have to store galley full texts for all interface languages. Highlighting requires the original non-analyzed text to be present so that the context of search terms can be retrieved. Other necessary changes to the result presentation have already been defined in the interface specification above. Please refer there for more details.

Paging

Paging is supported in solr through a query parameter. OJS will restrict search queries using solr's “start” and “rows” query parameters so that only actually displayed articles will be returned. This reduces the size of messages to be passed over the network.

Highlighting

Highlighting can be supported in solr by a simple additional search query parameter: “...&hl=on...” plus a few configuration parameters defining the highlighted fields and the amount of context to be returned. If highlighting should be supported then we propose to base it on galley full text. In this case solr will automatically return extracts from the galley full text containing query terms. These extracts can then be displayed to the end user as part of the result list.

NB: Highlighting requires original (non-tokenized) galley full text to be stored in the index which can considerably blow up index size.

Ordering

Lucene allows server-side ordering based on any indexed field. Server-side ordering is especially important when retrieving paged result sets. By default result sets will be sorted by the “virtual” field score which has been described above in the ranking section. Any other field can be specified with the sort parameter, e.g. “...&sort=authors_txtsort asc, score desc...”.

OJS displays "localized" fields in its result set. This localized data may differ from the indexed data in case a article or journal title does not exist in the currently selected interface locale. In this case another locale will be displayed for the articles concerned. This means that data in sort fields may have to differ from the data indexed for query purposes. See the "data model" and "XML format" sections for technical requirements of sortable fields.

Faceting

As described in the interface specification above, we propose faceting for the following search fields:

  • authors
  • publication date
  • disciplines
  • type
  • subject
  • coverage
  • journal title (cross-journal queries only)

Some special filtering of these fields can be applied for faceting:

  • It may make sense to only include the first author for faceting. It is up to the project owner to define this.
  • Faceting on the publication date must be by date range rather than discrete date values.
  • All other fields must be tokenized by separators (e.g. "," and ";") and otherwise left intact so that facets display with the original spelling. Lowercasing facets may make sense, though.

We propose that we only return faceting results for the currently chosen interface language. This means that the fields for faceting cannot be preconfigured in the OJS search request handler but must be provided in the query string, e.g. “...&facet.field=subjects_de_DE”. Facets will be selected as links in the faceting block plug-in as described in the interface section above. Selecting one or more facets will result in the original query being re-issued with an additional filter query, e.g. in the case of a date range “...&fq=publication_date_dt:([2006-01- 01T00:00:00Z TO 2007-01-01T00:00:00Z] NOT "2007-01-01T00:00:00Z") ...”.

Alternative Spelling Suggestions

Alternative spelling suggestions for a given search query can be provided based on solr's spellcheck search component. If this should be implemented then we recommend creating a multilingual dictionary based on a concatenated field that contains all meta-data and full text. The spellcheck dictionary will be stored in a separate Lucene index in …/files/solr/data/spellchecker. Such a spellcheck configuration has been implemented in the default configuration and details can be checked there. The dictionary needs to be rebuilt after large updates to the solr index. For performance reasons, we recommend issuing dictionary build commands on demand from the OJS installation after a few updates rather than updating after every commit. There is also an automatic “build after optimize” option for the spellchecker component. We cannot use this as we do not recommend optimization in the embedded scenario (see search interface specification above). We suggest to issue build commands to the usual OJS search interface with the following parameters: “q=nothing&rows=0&spellcheck=true&spellcheck.build=true&spellcheck.dictionary=default”. The q parameter is not usually required to build the dictionary. In our set-up with a preconfigured search request handler not providing it will result in an error.

"More like this"

The “more-like-this” (MLT) feature can be implemented by configuring the solr MLT search component. This component extracts "interesting terms" from a document and then executes a query against the index to identify documents that contain these interesting terms.

Usually the results from the MLT component can be displayed unchanged. In our case, however, this won't work because we want to support a mixture of features (e.g. highlighting, sorting, etc.) not directly supported by the MLT component. Using MLT results directly also is problematic as it means that re-executing the search with the "interesting terms", might produce a different result which would be non-intuitive from an end-user's perspective.

We therefore considered the following alternatives:

  1. Disable all incompatible features on the plugin settings page when enabling the MLT feature.
  2. Disabling all incompatible features just for the MLT-query itself.
  3. Using the MLT component to extract "interesting terms" and execute a second query with these terms against our usual search component with all enabled features working.

The first option seems unnecessarily restrictive. MLT-queries do not seem to be important (popular) enough to warrant such a drastic approach. Probably no one would enable MLT if it means to forgo sorting, highlighting, alternative spellings, etc.

The second option could be appropriate if the features to be disabled were not so exposed. Unfortunately the MLT does not support sorting of results which means that a central element of our UI would be disabled when executing MLT queries. It would be difficult to make this transparent to users. Worse even: Subsequent searches with the same terms could return different results as we'd then query against our usual search request handler which may handle search terms differently. So this option would result in a rather inconsistent user experience.

The third option also has drawbacks. It means that we have to execute two queries rather than one: a first query to retrieve "interesting terms" and a second query to retrieve the documents matching these terms. This will impact performance of MLT queries. It also means that we cannot use the automatic ranking boost feature of the MLT component as we do not officially support boosting through our regular query interface (although expert users may use it as a non-documented feature as described elsewhere).

As MLT queries do not make out a large proportion of overall user requests, the performance hit does not seem to be too relevant. Users may intuitively understand that MLT queries are "somehow more complex" and therefore take a bit longer to execute than normal queries. So consistency of the user experience will probably not be a problem. The fact that we cannot use automatic boost can be considered a payoff for simplicity and consistency of the user interface. The advantage of seeing exactly the query that was executed and being able to reproduce the exact results by manually re-executing the same query with "interesting terms" seems valuable enough to accept this drawback.

We therefore recommend implementing the third option.

There are several ways to access the MLT component:

  • as a dedicated request handler that usually is supplied a single document id,
  • as a search component of another request handler or
  • as a request handler that ingests full text and proposes similar documents from the index.

The first option is the most frequently used option and seems adequate to retrieve "interesting terms" in our case.

We recommend the following default parameters for the MLT request handler:

  • mlt=on,
  • mlt.fl=xxx where xxx is set to the document field corresponding to the title and abstract with all languages supported for these fields in the index, e.g. mlt.fl=title_en_US title_de_DE abstract_en_US abstract_de_DE,
  • mlt.interestingTerms=list to return the terms for a subsequent search query
  • q=id:xxx where xxx is the solr document ID (see “Data Model” above) of the article we wish to base our search on,
  • start=0 and rows=0 as we do not want any direct results,
  • mlt.boost=off as we do not support boosting in our regular interface.

As we expect the MLT feature not to be used too frequently in most cases we propose not to store term vectors by default. This means that when an MLT request is being issued the corresponding document fields will have to be re-analyzed to derive term vector information. This is slower than storing term vectors but saves storage space. Term vectors can always be activated by advanced users if the default configuration is found to be too slow or resource consuming.

OJS/solr Protocol Specification

Internal API Specification

Several 3rd-party PHP/solr integration libraries exist, including solr and PECL extensions, see https://wiki.apache.org/solr/SolPHP. Unfortunately none of these have been officially included into PHP and/or solr distributions. Regular upgrade of these extensions therefore adds additional maintenance cost and installation complexity. Furthermore none of the 3rd-party components complies with PKP's policy to keep maximum backwards PHP compatibility. Finally these libraries seem unnecessarily complex and overly generic for our use cases. We therefore recommend creating a lightweight and specialized PHP wrapper library hiding the internal workings of solr HTTP requests behind an easy-to-use internal service facade. This will allow us to keep OJS handler code separate from solr communication logic while maintaining complexity to a minimum. This solr service facade can be maintained within the solr plug-in. We recommend using the PHP curl() extension for solr HTTP communication. This extension is widely used in other areas of OJS and it can therefore be assumed that it is present in most OJS server environments. We can use curl() to place GET and POST requests to the solr request handlers as described in the following sections. We propose using the “phps” response writer (“...&wt=phps...”) when we want to use response data in PHP. This is easier to use, faster and less memory intensive than using XML as a response model. When we respond to AJAX calls (e.g. for instant search or auto-suggest), then we propose to use the JSON response writer (“...&wt=json...”) for easy re-routing of solr responses to the OJS JavaScript client. Administrative calls will always return XML which we'll have to parse with PKP's PHP4-compatible XML libraries. The fact that we implement solr integration as a generic plug-in requires changes to the OJS core code: We'll have to introduce additional hooks that call plug-in code at appropriate places in OJS core code. The following rules apply:

  • If possible, hooks should not be placed in code that will later be refactored into a plugin (e.g. classes/search/ArticleSearchDAO.inc.php, and any MySQL index specific methods in classes/search/ArticleSearch.inc.php and classes/search/ArticleSearchIndex.inc.php).
  • We'll introduce an ArticleSearchManager for consistency with naming conventions elsewhere and have article indexing functions delegated to that class. It'll be responsible for invoking whatever plugins are configured and new hooks will mostly be placed there. Hooks should be named accordingly.

Index Maintenance Protocol

The index maintenance protocol is responsible for enabling write access from OJS to the solr index. It provides functions for adding/updating and deleting documents. Both functions are batch functions that can be invoked with one or many articles at once.

Adding Documents

As we've seen above, articles can be pushed or pulled. In the “push” configuration, OJS will take action whenever one or more articles need to be added or updated. In the “pull” configuration, solr will initiate index updates due to a central update schedule.

Push Processing

Changes to an article trigger the following actions:

  1. Whenever an aspect of an article changes (article meta-data, galleys, supplementary file meta-data or files), the OJS core code will call the indexing API marking the article as "changed". We therefore recommend renaming the update...(), index...() and delete...() methods of the ArticleSearchIndex class to article(Metadata|File|Files|)Changed()/suppFileMetadataChanged()/articleFileDeleted(). The action of these functions for legacy SQL indexing does not have to change. This is a mere rename to document the changed intent of the call. This leaves it up to the indexing implementation whether it wants to implement the "dirty" pattern or not.
  2. When all changes to an article are done then the core OJS code will inform the indexing API that all changes are done. We recommend introducing an "articleChangesFinished()" method to the ArticleSearchIndex class. Indexing implementations that do not implement the "dirty" pattern will simply ignore that call. Plugins that implement the "dirty" pattern will now collect all changes and update the index accordingly. In the case of the solr plug-in this means that an XML document with all article meta-data, including the corresponding galley and supplementary file meta-data, of all "dirty" articles is sent over HTTP POST to the OJS DIH request handler .../solr/ojs/dih that is part of the recommended default solr configuration.
  3. To keep the HTTP call as light as possible, the XML does not contain galleys or supplementary files but only contains links to full text documents. DIH asynchronously pulls these documents from the OJS server for extraction and preprocessing.
  4. OJS will only mark an article as "clean" if the response given by solr indicates indexing success. To keep the implementation as simple as possible, the first implementation will be synchronous. If this turns out to make the OJS client unresponsive then the existing processing framework can be used to implement asynchronous indexing.
  5. If processing returned an error then a notification will be provided to OJS technical administrators so that they can correct the error. The indexing status of the articles will not be changed to "clean". The next update will either occur when another article changes or when the admin triggers a manual refresh, either through the plug-in home page or through an additional switch of the rebuildSearchIndex.php script.

Pull Processing

A pull processing protocol may be implemented like this:

  1. Article editing works in the same way as for the push protocol above, with the exception that a call to the articleChangesFinished() method will not do anything (=noop) as indexing will be initiated on the solr-side rather than being initiated by OJS. Whenever a change made to an article (publish, change, unpublish) is complete, the article's state will change to "dirty". The atomic nature of database transactions will make sure that concurrent access to the dirty flag will be properly serialized.
  2. The solr server will implement a scheduler that initiates indexing (e.g. via cron script). When the scheduler script fires, it will send a parameter-less GET request to a well known, installation-wide OJS end point.
  3. When a request is made to this web service endpoint then all "dirty" articles up to a maximum batch size (the same as for reindexing) will be published to the solr server via XML. The pulled XML (or the request response code?) will indicate whether the response contained all articles or whether another batch needs to be pulled later. Otherwise the XML is in the same format used for push processing so that the DIH push configuration can be used completely unchanged for pull processing.
  4. All retrieved articles (and only those!) will immediately be marked "clean". Changes occurring during the pull request must be able to continue normally. This means that we have to reset the flag selectively to avoid race conditions and to support multi-batch processing.
  5. The server-side scheduler script will save the retrieved XML in a staging folder as a file with a unique ID (e.g. timestamp). Ordering the files by name should lead to a unique FIFO serialization of the files in the staging folder.
  6. If an error occurs (e.g. connectivity problems) while an XML is being received, then the server has to discard everything that has been retrieved so far and no articles may be marked "clean" on the OJS side.
  7. If the XML (or the response code) indicated that more articles must be retrieved then the scheduler will do so until all article changes have been pulled.

Server-side XML processing:

  1. A second polling script on the server side will check whether an XML file exists in the staging folder. It will select the next file to be processed and POSTS it to the local DIH which will process it in exactly the same way as in the push configuration.
  2. If processing is successful then the polling script will compress the file and move it to an archive folder. The name of the compressed file should maintain the unique ordering for easier debugging or replay. In case of a processing error the script will pause and try to process the file again. If, after three iterations, the file could not be successfully processed, it will be moved to an "errors" folder for later inspection and the polling script continues to process the next file.

Please note that we'll have to implement an additional OJS handler operation that returns the XML article list. We also have to make sure that articles marked for deletion can be processed by DIH (which is not necessary in the case of push processing). We can do this with the $deleteDocById feature of the DataImportHandler.

As indexing is an idempotent deletion/addition process in Lucene, network link or processing errors during any of the above steps will not result in an incomplete or corrupt index as long as DIH correctly confirms indexation. If anything goes wrong in the process no information will be lost as all offending files will be saved for manual inspection. Once the error has been resolved the file can be moved back to the staging folder being polled by the scheduler script and it will immediately be processed by the server.

If the index becomes corrupt then a provider has got two options:

  1. If a point in time can be identified up to which the index was healthy and a backup of the index at that time exists then the provider can restore the index and select all files from the archive that were pulled after the time of the backup. These files can be "replayed" by decompressing them and moving them to the staging folder. Due to the idempotent nature of the indexing process this should lead to a healthy index in almost all possible situations.
  2. If "replaying" files onto a backup index is not possible then the provider could run a script marking all articles in all installations dirty. The polling mechanism will make sure that even if there are many files coming in due to such a measure, the solr server will maintain constant load until all journals have been re-indexed. If the file names contain some indication about the source of the file (i.e. the installation id) then it is easy to monitor indexing process in such a case by issuing "ls some-file-pattern | wc -l" commands on the staging folder.

XML Format for Article Addition

As laid out in the preprocessing section, we recommend using native solr plug-ins for data extraction. In our case we have chosen the Data Import Handler (DIH) for document extraction and preprocessing.

It has been evaluated whether existing OAI providers, i.e. NLM, Mods or MARC over OAI could be used with DIH. It has also been analyzed whether the native (or other existing) export formats could be imported. Unfortunately neither is not possible because DIH imposes limitations onto the OJS/solr data exchange format:

  • DIH's XPath implementation is not complete. Only a subset of the XPath specification is supported. XPath queries that qualify on several attributes cannot be used which rules out OJS native export format. We have to provide a simple XML format that can be interpreted with DIH.
  • DIH's Tika integration is usually restricted to a fixed number of binary documents per Lucene document. In our case, however, we have to support indexing of an arbitrary dynamic number of galleys and supplementary files per article. We work around this limitation by embedding CDATA-wrapped XML sub-documents for galleys and supplementary files into the main XML article list. Such documents can be extracted separately into fields and – together with a special field data source and custom DIH ScriptTransformer – make DIH "believe" that it is dealing with one binary file at a time. This workaround rules out both, OAI and OJS XML export formats as DIH source formats in our case.

Fortunately the required OJS/solr XML date exchange format is quite simple. A sample implementation exists which executes a pure SQL script to construct the XML for push to the solr test server from an arbitrary OJS database. The XML format is as follows:

  • <articleList>...</articleList>: This is the root element containing a list of article entities.
    • <article id=”...” instId=”...” journalId=”...”>...</article>: This element is the only allowed child element of the <articles> element and its sub-elements contain all meta-data and file information of a single OJS article. The ID attribute contains a combination of a universally unique OJS installation ID, the journal ID and the article ID. This is necessary so that IDs are unique even when providers collect article data from several installations into a single search index. “instId” and “journalId” contain the installation and journal IDs separately which will be required for administrative purposes, e.g. batch deletion of articles.
      • <authorList><author>...</author>...</authorList>: Full names of one or several article authors. This and the following elements are placed below the <article> element. If the information for any of this or the following search fields is not available then the element will be missing completely. Order of elements matters in the case of authors.
      • <titleList><title locale=”...” sortOnly="(true|false)">...</title>...</titleList>: The article title together with its locale. Order of sub-elements does not matter for this or the following meta-data fields. The "sortOnly" flag is used to indicate whether the given title is to be used for result set ordering only. This may happen when a title does not exist for a given locale and OJS displays a "localized" title instead which differs from the currently chosen display locale.
      • <abstractList><abstract locale=”...”>...</abstract>...</abstractList>: Localized article abstracts.
      • <disciplineList><discipline locale=”...”>...</discipline>...</disciplineList>: Localized article disciplines.
      • <subjectList><subject locale=”...”>...</subject>...</subjectList>: Localized article subjects.
      • <typeList><type locale=”...”>...</type>...</typeList>: Localized article types.
      • <coverageList><coverage locale=”...”>...</coverage>...</coverageList>: A list of coverage keywords (concatenates geographic, time and sample coverage).
      • <journalTitleList><journalTitle locale=”...” sortOnly="(true|false)">...</journalTitle>...</journalTitleList>: The journal title together with its locale and sort only flag.
      • <publicationDate>...</publicationDate>: The article's publication date in ISO 8601 format without second fractions (“YYYY-MM-DDTHH:MM:SSZ”). All dates are treated as UTC dates. OJS must translate local publication dates into UTC before sending them to solr.
      • <issuePublicationDate>...</issuePublicationDate>: The issue's publication date in ISO 8601 format.
      • <galley-xml>...</galley-xml>: A UTF-8 encoded CDATA field that contains an embedded XML file (including the <?xml …?> header. We have to embed this XML so that solr's DIH extension can treat it separately during import processing. This is a workaround so that we can import several binary files for a single article.
      • <suppFile-xml>...</suppFile-xml>: A UTF-8 encoded CDATA field containing embedded XML with supplementary file data. See <galley-xml> above for an explanation why we embed a secondary XML character stream.

Description of the embedded galley XML:

  • <galleyList>...</galleyList>: Wraps a list of galleys. This is the root element of the XML file embedded in <galley-xml>...</galley-xml>.
    • <galley locale=”...” mimetype=”...” url=”...” />: An element representing a single galley. It has no sub-elements. The mimetype attribute is the MIME type as stored in OJS' File class. The url attribute points to the URL of the full text file. DIH will pull the file from there over the network and extract its content.

Description of the embedded supplementary file XML:

  • <suppFileList>...</suppFileList>: Wraps a list of supplementary files. This is the root element of the XML file embedded in <suppFile-xml>...</suppFile-xml>.
    • <suppFile locale=”...” mimetype=”...” url=”...”>...</suppFile>: An element representing a single supplementary file. It contains further sub-elements with some supplementary file meta-data. See the <galley> element above definition of the mimetype and url attributes. OJS has to make sure that the locale is one of the valid OJS locales or “unknown”. This requires internal transformation of the supplementary file language to the OJS 5-letter locale format if possible.
      • <titleList><title locale=”...”>...</title>...</titleList>: A supplementary files localized title information.
      • <creatorList><creator locale=”...”>...</creator>...</creatorList>: Supplementary file creators.
      • <subjectList><subject locale=”...”>...</subject>...</subjectList>: Supplementary file subjects.
      • <typeOtherList><typeOther locale=”...”>...</typeOther>...</typeOtherList>: Supplementary file types.
      • <descriptionList><description locale=”...”>...</description>...</descriptionList>: Supplementary file descriptions.
      • <sourceList><source locale=”...”>...</source>...</sourceList>: Supplementary file sources.

The <articleList> is the only mandatory element. All other <*List> elements have cardinality 0..1 with respect to their parent elements. All other elements have cardinality 0..n with respect to their parent elements.

The update handler listening at “http://127.0.0.1:8983/solr/ojs/dih” in the default embedded solr server configuration will be able to consume this XML format.

Updating Documents

When a user updates an OJS article, galley or supplementary file, all documents and meta-data belonging to the same article will have to be re-indexed.

Lucene does not support partial update of already indexed documents. Therefore the OJS/solr protocol does not implement a specific update syntax. Adding a document with an ID that already exists in the index will automatically delete the existing document and add the updated document.

See the protocol for document addition for more details.

Deleting Documents

We propose to support four use cases:

  • delete a single article from the index
  • delete all articles of a journal
  • delete all articles of an installation
  • delete all articles in the index

Deletion of a single article from the index is required when an article is being unpublished in OJS (“rejected and archived”). Deletion of articles from a journal or installation will be required when (partially) re-building an index, see the interface specification above. Deletion of all articles in the index only differs from from the third case in scenarios S3 and S4. As these are installation-overarching operations we do not recommend providing an end-user interface for this task in OJS. We rather recommend that providers completely delete or move their index directory or build a new index in the background in a separate core and switch to this core after the re-build by using direct access to the solr web interface.

All other use cases can be supported by calling solr's native update handler “.../solr/ojs/update” with the usual <delete>...</delete> syntax from within OJS. We provide journal and installation IDs in our data model so that we can batch delete all documents from these entities with a simple delete search query.

When working with push updates then all deleted documents can immediately be pushed for re-indexing if required. When working with pull updates then deleted articles will be marked “not indexed” so that they'll be re-indexed automatically the next time a pull request arrives from the solr server.

Subscription-Based Publications

In order for the Solr server to gain access to subscription-only content on the server, its server IP will have to be authorized as an "institutional subscriber". We'll make sure that the normal subscription checks will be valid in a pull scenario. This means that our XML web service for "dirty" articles will only provide access to subscription-protected articles if the requesting server can properly authenticate itself and has been authorized to access the article data.

Search Protocol

The OJS/solr search protocol is the well documented “edismax” query and result format. We do not reproduce the general “edismax” syntax here. Please refer to the official solr documentation for details.

We implement a custom search request handler for OJS search queries. In the embedded scenario it will listen to requests at “http://127.0.0.1:8983/solr/ojs/search”. We do not place queries through solr's default “/select” request handler. This handler should not allow public access in the default configuration as it allows direct requests to solr's administrative request handler.

Configuring our own request handler has further advantages:

  • We can preconfigure it with mandatory parameters that cannot be changed client-side. This helps to secure our request handler to a certain extent and reduces the amount of parameters that need to be passed in from OJS.
  • We enable advanced users to set almost all search parameters (mandatory or default) without having to change OJS code.
  • We enable advanced users to define their own restricted search endpoints (e.g. filtering on a certain category of journals) if they implement a provider-wide search server. These endpoints can then be configured as custom search endpoints in OJS, see “Configuring the Deployment Scenario” above.

We only use a subset of the “edismax” search options. This subset has been described in the “querying” chapter above. Please refer there for details on the protocol parts actually being used by OJS search access to solr.

Deployment Options

According to our requirements, we need to cover a large range of deployment scenarios, from single journal deployments of OJS (S1) all the way up to large system landscapes including integration of OJS search with arbitrary search applications (S4). Fortunately the large majority of the configuration described in this document is independent of the deployment scenario. This means that only very few parameters will differ for the recommended configuration of different deployment scenarios. More specifically we recommend two deployment options:

  • The single-journal and single-installation scenarios (S1 and S2) can be supported with an embedded solr server. The configuration for the embedded server will be part of the default OJS distribution. We call this deployment option the “embedded deployment”.
  • The multi-installation and “just-another-app” scenarios (S3 and S4) can be supported with a central solr server reachable from all OJS servers over the provider's internal network. We call this deployment option the “network deployment”. We believe that even large OJS providers with one hundred or more journals will not require advanced solr scalability features like replication, see the “Index Architecture” discussion above. There is nothing, however, that keeps providers from replicating their OJS core to several servers if they wish so. Balancing between replicated servers can be done over an HTTP proxy or by configuring part of the OJS installations with one back-end and part with the other. Such configurations are out-of-scope for this document, though.

Common Deployment Properties

Installation Requirements

All OJS installation requirements apply unchanged. The following additional installation requirements must be met by the OJS server (embedded deployment) or the solr server (network deployment):

  • Operating System: Any operating system that supports J2SE 1.5 or greater (this includes all recent versions of Linux and Windows).
  • Disk Space: In the case of embedded deployment the disk the OJS installation resides on should have at least 150 MB of free disk space and the disk where the "files" directory resides on, should have enough free disk space to accommodate the search index created by solr. This should be no more than the double of the space occupied by galleys and supplementary files in that same folder. In the network deployment, disk space requirements for the servlet container and solr binaries depend on the chosen installation details. The space for the index should be at least double the space occupied by all galleys and supplementary files of the journals to be indexed.
  • RAM: Memory requirements depend very much on the size of the indexed journals. If the journals have several GB of article galley files then for best performance a few GB of RAM will be required for the solr server and for the operating system's file cache. Smaller installations require less memory. We recommend starting the embedded server with default settings and only get back to it if performance problems occur in practice. In most cases, default settings will work well.

Code/Binary Distribution

Both deployment options have in common that the solr client and configuration will be integrated into OJS as a generic plug-in. While the plug-in is disabled, the current OJS search function will work unchanged. Enabling the plug-in will switch to solr as a search back-end.

OJS plug-in code will be maintained within PKP's official github repository. The already existing SWORD plug-in creates a precedent for the integration of 3rd-party software libraries through PKP's plug-in mechanism. No Java software has been integrated into OJS by way of plug-ins so far. We therefore expect that a few additional integration techniques need to be developed.

Our proposed integration approach is described in the README.txt provided with the plug-in and will be summarized here.

For several reasons Java binaries for jetty or solr/Lucene should not be part of PKP's default distribution:

  • The binaries are large and would inappropriately blow up the OJS distribution as many OJS users will not want to us solr search.
  • When distributing binaries, PKP will have to take care to always upgrade binaries to the latest version and even release hot fixes when security updates occur. This adds a lot of unnecessary maintenance cost.

An integration as a git subproject as in the case of SWORD also does not seem appropriate as jetty/solr do not use git for their projects and maintaining our own jetty/solr binary git release server would be relatively costly.

We rather recommend that users download jetty/solr binaries from their original sources unchanged and extract them into well documented destinations within the solr OJS plug-in. A preconfigured installation script can then take care of copying or linking binaries to their required locations.

We cannot define a precise prescription for installing solr in a network deployment as this will largely depend on the provider's installation policy. Most providers will probably already have a preferred servlet container and may want to install and configure container and solr through OS-specific installation mechanisms.

Security

Solr's example server does not come preconfigured with security in mind. Solr itself does not provide any authentication or authorization mechanisms. Securing solr must mostly be done through the servlet container and by properly protecting the server solr runs on. The following recommendations should be followed:

  • Servers that host solr must be properly firewalled. Only search client applications should have (restricted) access to the solr search and update interfaces. In the case of the embedded scenario this means that solr should not be exposed to the network at all.
  • Administrators should pay special attention to potential CSRF risks when developing their firewall strategy for solr. Clients with access to solr (e.g. browsers of admin staff) should be protected from 3rd-party “takeover”.
  • Exposing solr to the public is strongly discouraged. If done, an authentication scheme must be implemented in the servlet container or HTTP proxy to limit access to solr's admin interface, the OJS DIH import handler, the default solr update handler and the generic select handler. A sample configuration for jetty using BASIC HTTP authentication is provided in the default configuration. This is not a recommended protection mechanism, though!
  • We have chosen to provide custom search handlers rather than making search available through the generic select handler. The generic select handler allows unsecured access to update and admin handlers and may therefore NOT be exposed to the public.
  • We recommend disabling remote streaming in solrconfig.xml: enableRemoteStreaming = false. Otherwise content of arbitrary files the solr process has access to locally or over the network will be exposed to whoever can access solr!
  • We recommend disabling JMX unless actually used.
  • We recommend never to use solr's example configuration unchanged as it is not secure.

As most providers operate in an Open Access scenario, we do not recommend access limitations to the search handler by default (except for the firewalling as described above). The default recommended configuration will expose the query interface to all users on the provider's network who have HTTP access to the solr endpoint.

In order to limit access in a subscription based environment and reduce the amount of data to be transferred over the network, our custom search handler was configured with mandatory (“invariant”) query parameters limiting – among other things – the returned fields to the article ID field and search score. Further recommendations for subscription-based journals have been given in this document where appropriate.

Deployment Descriptors

The default solr deployment descriptor has been provided in plugins/generic/solr/embedded/solr/conf/solrconfig.xml. This descriptor is recommended for both, embedded and network deployments.

A default jetty configuration has been developed for the embedded scenario, see plugins/generic/solr/embedded/etc/*.*

Details of recommended solr and servlet container configurations for both scenarios will be given in the following sections.

Embedded Deployment

The embedded deployment option will work for the large majority of OJS users. With a few easy and well-documented additional installation steps it is possible to transform every OJS server into a solr server that should be reasonably secure for the majority OJS users. We have laid out these steps in the README.txt that comes with the plug-in and will be displayed on the plug-in home page as long as no working solr server has been configured for the plug-in.

The embedded deployment works with a preconfigured Jetty server and solr binaries directly deployed to the special plug-in directory “plugins/generic/solr/lib”. It is sufficient to download and extract the binaries and execute an installation script to get up and running, both on Linux and Windows operating systems. We pre-package all solr configuration required for embedded deployment inside the plug-in. No additional manual configuration is usually required. Transitive data, i.e. the index and the spellchecking dictionary, will by default be saved to the files_dir configured in config.inc.php. We'll create a “solr” sub-directory there for our purposes.

See plugins/generic/solr/embedded/bin/start.sh for further details of the configuration of the embedded scenario.

Security

The configuration of the embedded scenario follows a "secure by default" approach. While we do recommend proper firewalling of the OJS server even in the embedded scenario, the default configuration will provide basic protection even with no firewall in place. We do this by binding the embedded Jetty server to the loopback device (127.0.0.1) which should prohibit external access to the server on most operating systems. The above comments about CSRF vulnerability of solr apply to the embedded deployment if users log into the OJS server and open a browser (or other client software with network access) there.

Jetty/solr Upgrade

Even in the embedded scenario, jetty and solr will need to be upgraded from time to time, e.g. in case of security or performance updates. In this case the new versions can simply be extracted into “plugins/generic/solr/lib” following the instructions in README.txt.

Starting/Stopping Solr

In the embedded scenario, solr can be started from within OJS with a background exec() call of a start script running a daemonized version of jetty with proper start parameters. On Windows this will probably not work without additional installation steps to create a system service. We may alternatively try to work around this restriction by running Jetty within a “permanent” PHP background process (e.g. http://stackoverflow.com/questions/45953/php-execute-a-background-process). Whether this works has to be tested in practice. It doesn't seem to be a very scalable and reliable option, though.

Alternatively the Linux or Windows shell solr start script provided in plugins/generic/solr/embedded/bin can always be executed directly on the OJS server.

In the embedded scenario, the privileges of the web server / PHP user are probably appropriate for the solr server too. This will be the default case when starting solr from within OJS. Users are free, though, to execute the start script manually with any other user as long as they make sure that that user has write permissions to the solr index files.

Logging

Analyzing search query logs is a great tool to optimize search. We do not recommend enabling query logging by default in the embedded scenario, though. Most users opting for the embedded scenario will not be able to interpret query logs, so these logs will just unnecessarily occupy disk space. The default configuration sets logging levels to obtain enough information on the console when users need support through the forum or other remote communication means.

Network Deployment

The network deployment option enables large service providers to connect any OJS installation to a solr server running in an arbitrary servlet container (e.g. Jetty or Tomcat) deployed somewhere on the local network. We do not give specific installation instructions for solr servers deployed like this as these instructions depend on the provider's individual OS and (in the case of Linux) OS distribution. We make sure, though, that providers can copy the solr configuration directory provided with the solr plug-in unchanged and plug it into their servlet container. This enables providers to get up-and-running with an OJS-compatible solr installation in very little time.

We'll also recommend providing full step-by-step installation instructions for a well-known Linux distribution e.g. Debian/Ubuntu. This can then usually easily be adapted to other distributions as well.

Feature Implementation Matrix

The feature implementation matrix details search front-end and back-end features that must be implemented to provide minimum search functionality that guarantees compatibility with the current OJS search as well as additional optional features. It also contains back-end features that provide additional administrative advantages to providers or improve index maintenance.

Every entry in the matrix contains

  • a short feature description including its relevance to the different deployment options where applicable,
  • the OJS authorization level to access the feature,
  • whether it is a feature already present in OJS' search implementation or whether it is a new feature,
  • a description of the feature's business value,
  • an approximate implementation effort classification differentiating between the OJS back-end and user interface,
  • alternative implementation options (if any),
  • test cases (if defined by FUB or their partners) and
  • comments

The complete feature implementation matrix can be found here: https://docs.google.com/spreadsheet/ccc?key=0ArYsBcy_S9NkdFlBS0VqcE9wQjFHU3NhOFBFT191dHc&pli=1#gid=0

The feature implementation matrix is meant as a specific guideline for the project owner to select and prioritize features to be implemented in future projects. It also may be used as an implementation guideline for 3rd party service providers executing future implementation projects.