- 1 Overview
- 2 Project Requirements
- 3 OJS User Interface
- 3.1 Core Code Changes vs. Integration as a Plug-In
- 3.2 Search Interface
- 3.3 Administration and Configuration Interface
- 4 Indexing
- 4.1 Index Architecture
- 4.1.1 Single Index vs. Multi-Index Architecture
- 4.1.2 Implications of Multilingual Support for the Index Architecture
- 4.1.3 Index Architecture Recommendations
- 4.2 Data Model
- 4.3 Document Submission and Preprocessing
- 4.3.1 Existing OJS Document Conversion vs. Tika
- 4.3.2 Local vs. Remote Processing
- 4.3.3 Push vs. Pull
- 4.3.4 Implications of Multilingual Document Processing
- 4.3.5 Custom Preprocessing Wrapper vs. solr Plug-Ins
- 4.3.6 Solr Preprocessing plug-ins: IDH vs. Cell
- 4.3.7 Should we use Tika to retrieve Meta-Data from Documents?
- 4.3.8 Transmission Protocol
- 4.3.9 Submission and Preprocessing Recommendations
- 4.4 Analysis
- 4.4.1 Precision and Recall
- 4.4.2 Multilingual Documents
- 4.4.3 Character Stream Filtering
- 4.4.4 Tokenizing
- 4.4.5 Token Filtering
- 4.4.6 Stemming
- 4.4.7 Special Fields
- 4.4.8 Field Storage
- 4.4.9 Default Implementation
- 4.1 Index Architecture
- 5 Querying
- 5.1 Query Entry and Auto-Suggest
- 5.2 Query Parser
- 5.3 Query Transformation and Expansion
- 5.4 Query Analysis and Synonym Injection
- 5.5 Ranking
- 5.6 Instant Search
- 5.7 Result Presentation
- 5.8 Paging
- 5.9 Highlighting
- 5.10 Ordering
- 5.11 Faceting
- 5.12 Alternative Spelling Suggestions
- 5.13 "More like this"
- 6 OJS/solr Protocol Specification
- 7 Deployment Options
- 8 Feature Implementation Matrix
The following two sub-sections will provide an overview over the project background and the structure of this document. Read it as a guideline for this document or as a help for quick access to sub-sections if you prefer to skim over specific document parts first.
The Center for Digital Systems (CeDiS) of Free University Berlin (FUB) currently implements “OJS.de” – a project funded by Deutsche Forschungsgemeinschaft (DFG), the German government's science funding organization. The project has been set up to adapt OJS even better to the needs of OJS users in Germany and other German-speaking countries while – wherever possible – creating value for the larger OJS community, too. One of the project tasks is to implement an “optimized search function”. Three main goals should be achieved:
- The current OJS search function experiences problems in dealing with multilingual content. The optimized search function should be able to deal with documents in all supported OJS languages.
- OJS search should – where possible – benefit from additional search features provided by Lucene/solr.
- The current search function does not scale well. The optimized search function should work even for large OJS providers who want to provide a central search server including journals from several separate OJS installations.
Currently OJS provides custom search across article meta-data and full texts. A simple algorithm is used to process text for search: Text is first being split up according to white space (“tokenization”), then common words (“stopwords”) are being removed from the token stream. All remaining tokens will be stored as a relational database table. This yields satisfactory search results for languages like German or English. It does not work for languages that use logographic notational systems like Japanese or Chinese. Such languages should be made searchable from OJS. In addition, the new search platform should support additional advanced search features, e.g. improved ranking, faceted search or searching across several OJS installations. From an end user, administrator and OJS provider perspective the new Lucene/solr solution must implement at least all currently available OJS search functionality. Several deployment scenarios should be supported (ranging from single journal installations to large OJS provider deployments). Depending on the deployment scenario, design principles like simplicity and flexibility have to be reconciled. Both, the deployment scenarios and corresponding design principles will be detailed below. The new search function should be implemented based on the enterprise search server “solr” and the underlying “Lucene” search framework. As solr and Lucene provide complex and flexible configuration support, a detailed specification is required that defines the exact search functionality and user interface, the architecture of the search platform and the integration of OJS with solr/Lucene. Potentially conflicting project goals have to be reconciled separately for each deployment scenario. This document specifies the exact scope, requirements, user interface, design principles and technical architecture of the new search function.
Structure of this Document
This document is divided into several conceptually distinct parts: First we enumerate all project requirements that guide the scope of this specification as well as our architectural choices and recommendations. Based on these requirements we specify several distinct aspects of the system:
- the end-user visible OJS interface (administration, configuration and search)
- the indexing back-end (submission and preprocessing of meta-data and full texts, extraction, transformation, analysis and storage of term data; index configuration and maintenance)
- the search query back-end (submission, transformation, parsing and analysis of the query; ranking, paging and highlighting of the result list; advanced browsing features like faceting, alternative spelling suggestions and “more-like-this” search proposals)
- the network protocol for communications between the OJS installation and the search engine for all indexing, querying and configuration needs
- the recommended deployment options for different usage scenarios (from single journal installations to large OJS provider deployments)
For each of these system parts and for all possible usage scenarios we'll identify necessary changes to OJS and appropriate configuration of the solr/Lucene back-end. Once the design and implementation options identified and recommendations made for all aspects of the system we'll be able to compile a complete feature implementation/recommendation matrix. This will help project owners to select and prioritize specific, well-defined features for implementation and serves as an initial guideline for subsequent implementation phases of the project.
The requirements for this project can be divided into the following areas:
- Deployment scenarios: the proposed search architecture should scale from small single-journal deployments up to large OJS providers that potentially host hundreds of OJS installations.
- Design principles: depending on the deployment scenario, principles like simplicity and flexibility have to be reconciled.
- Compatibility: a specific list of search features have to be implemented to guarantee compatibility with the existing OJS search function.
- Advanced design problems: Solr provides a large amount of configuration options. Alternative options should be identified and implementation choices be defined for each feature or deployment scenario.
The following sub-sections will detail these requirements to provide a guideline for the implementation recommendations to be made in this document.
OJS is being used in very different organizational contexts that range from individual scientists publishing their own OJS journal on shared hosting servers all the way up to specialized OJS providers hosting hundreds of distinct OJS installations for a large number of 3rd-party publishers.
The new search function should work in all these scenarios and must therefore support:
- S1: search across articles of a single journal
- S2: search across multiple journals of a single OJS installation
- S3: search across various OJS installations within the provider's network
- S4: search across various applications (including one or more OJS installations) within the provider's network
We will reference these scenarios with their abbreviations (S1-S4) below, where necessary.
Depending on the deployment scenario several conflicting design principles have to be reconciled:
- Simplicity and transparency: One of the crucial strengths of OJS is its simplicity and low entry level of installation, configuration and use. This advantage should be maintained as far as possible.
- Functional robustness and flexibility: The currently existing search function should remain available as an option. It should be possible to configure whether to use the current or the newly developed solr search for an OJS installation.
- Compatibility with PKP OJS development: The solr integration should be compatible with the PKP development code line.
The relative importance of the design principles will depend on the deployment scenario and its specific requirements. As the design principles may conflict with each other, it is possible that a compromise has to be made for each deployment scenario.
Compatibility with the Current OJS Search
The current OJS search function implements the following features that must be supported by the newly developed system, too:
Basic indexing/search features:
- search across all journals of an OJS installation
- search for author, title, abstract, keywords and full text.
- search for publication type, coverage, supplementary files and publication date (advanced search only)
- Full text documents can be indexed if they are in HTML, PDF, PS or Microsoft Word format.
- no distinction between lower and upper case
- ignore common words of low relevance (“stopwords”)
- list all documents by default that contain all search terms (implicit “AND”)
- select documents that contain one of the given terms (“OR”)
- only select documents that do not contain a given term (“NOT”)
- implement advanced search syntax, e.g. archive ((journal OR proceeding) NOT dissertation)
- search for exact word phrases
- support wildcard searches (“*”)
- Search results contain the corresponding articles and can be paged.
- The index can be re-built from within OJS.
Advanced Design Problems
The following additional design problems have to be explicitly addressed in this document.
- How to implement multi-client capabilities for the configuration of solr, the communication interfaces and data?
- Which users (role) will install and configure solr – in other words: will the configuration be done on journal, publisher or overall system level?
- Scalability: When should features like “distributed search” or “replication” be implemented?
- How should solr be deployed?
- Establish complete configuration recommendations.
- Which platforms will be supported (e.g. OS, servlet container)?
- How will the search server be integrated with the OJS PHP environment?
- Can solr be integrated as a plug-in? What are (dis-)advantages of such a deployment option?
- Which disadvantages or problems may be expected when integrating OJS search with an organization-wide search (S4)?
- Which field types and schema should be defined?
- Which tokenizers and filters should be used?
- How many indexes/cores are required?
- To what architectural level will these indexes correspond (e.g. per journal, per installation)?
- When and how will documents be indexed (addition, update, optimization and deletion)?
- How can the index be re-built?
- How will data be sent to solr?
- Will documents be parsed with a native solr extension or is an external program required?
- Which further file formats will be supported?
- Will meta-data be extracted from documents?
- Will all manifestations of an article be supported (e.g. when both, HTML and PDF versions, are available)?
- Which search syntax will be supported? (Ideally the search syntax should be identical to the currently existing OJS search syntax.)
- How can auto-suggestions be implemented?
- How could the ranking be implemented (spanning several indexes where necessary)?
- Which after-search options (e.g. sorting) will be available?
- How could faceting be realized?
High-Level Feature Summary
The following table shows an overview of important requirements and deployment scenarios. It is meant as a high-level summary, not as an exhaustive list.
|Single Journal (S1)||Multi-Journal (S2)||Multi-Installation (S3)||Institution-wide Index (S4)|
|Document types||article meta-data, galleys and supp. files||article meta-data galleys, supp. files + arbitrary additional documents|
|Document source||single journal||several journals of a single installation||journals accross (groups of) installations||several installations + arbitrary external applications|
|Search fields (simple search)||author*, title, abstract, galley content, keyword search (discipline, subject, type**, coverage***)|
|Search fields (advanced)||author*, title, discipline, subject, type**, coverage***, galley content, supp. file content, publication date|
|Supported document languages||English, German, Spanish, Chinese, Japanese + a reasonable fallback for all other languages and foreign language citations/mix-ins|
|Document formats||Plaintext, HTML, PDF, PS, Microsoft Word|
|Basic search syntax||AND (default), OR, NOT, nesting of queries, wildcards (*), phrase search|
|Result presentation||Paged results are returned on article level ranked by term frequency (“TF-IDF”).|
|Optional (advanced) features||auto-suggestion, faceting, alternative ranking criteria, highlighting, search proposals (e.g. alternative spelling or “more-like-this”)|
* contains first name, middle name, last name, affiliation, biography
** usually contains a research approach or method for the article
*** consists of the article's geo coverage, chronological coverage and "sample" coverage
Test Data and Sample Queries
Requirements are covered and exemplified by a number of sample queries. These are an integral part of the requirements specification of this project. All sample queries are executed against a mixed-language, mixed-discipline corpus of OJS test journals and articles. Both, sample queries and sample data, are being provided by FUB and their partners.
Test data were taken from live OJS journals. Wherever possible complete copies of journals were made. When full copies were not available partial content (select journal issues and/or articles) were imported into an OJS test database for indexing and querying.
The following process was applied to collect sample queries:
- An online form simulates the OJS search form (simplified and advanced).
- Test users (editors and readers) of various OJS journals were asked to provide realistic test queries.
- Submitted test queries were executed against the test corpus.
- Result sets were returned to test users for review.
- Search results (precision, callback, ranking) were tuned according to user feedback.
OJS User Interface
The following sub-sections propose changes to the OJS user interface for potential additional search features. We will only describe new or changed features. A full description of the existing search interface is not in scope for this document.
Core Code Changes vs. Integration as a Plug-In
The question whether solr/Lucene should be integrated as a plug-in or whether changes should be made directly to OJS core code is not only one of the user interface. But it certainly makes a difference whether solr related configuration options are separated out into a plug-in or whether they appear on core set-up pages like administrative settings or journal set-up. The main question is whether solr/Lucene will be used by a majority of users or not. If a minority of users opts to use solr/Lucene then it would not be appropriate to “pollute” the core configuration pages with those options. If, however, many users are interested in using solr/Lucene, then hiding the feature away into a plug-in would make it unnecessarily difficult for those users to find the options they look for. It is difficult to decide this without asking a representative number of users. From prior experience with similar changes, though, it seems reasonable to assume that the majority of users will continue to use the existing search interface. This is above all because solr/Lucene requires Java to be present on the hosting environment (see installation requirements below) which is often not the case. It also requires a certain amount of additional installation and configuration. And it requires a servlet container like jetty to be up-and-running all the time. While we'll reduce the installation and maintenance overhead to a minimum, the solr/Lucene search back-end will still be “heavier” and more difficult to deploy than the default search implementation. There are important user groups, though, who definitely require an improved solr search back-end. These are OJS service providers and publishers of journals that contain content in non-Western languages which are not supported by the current search back-end. While the former are advanced users who will be well acquainted with OJS plug-ins, the latter may not. When removing solr/Lucene to a plug-in then this should be well advertised to the second user group, e.g. through OJS forums which rank well in Google. Other, more technical arguments are in favor of factoring the solr/Lucene integration into a plug-in: It will be easier for the PKP core development team to review and maintain the new code if it is concentrated in a single place. And it will be easier to port the code to other PKP applications like OCS or OMP because the interface points of a plug-in with the core code are relatively easy to identify. Providing solr integration as a plug-in will also make it easier to include block plug-ins as needed for use cases like faceting. A disadvantage of implementing the integration as a (generic) plug-in is that we'll probably have to introduce an important number of additional plug-in hooks into OJS core code with little future potential for re-use. This can however be mitigated by factoring the search plug-in into its own plug-in category later, if PKP wishes so. With these arguments in mind and after consulting the PKP core development team, we recommend integrating solr/Lucene and jetty as a generic OJS plug-in.
In accordance with the design principles defined above, the OJS search interface should be changed as little as possible. This means that existing features are to be maintained unchanged, no matter what search back-end will be used. The following sections only describe changes required by additional search features that are not part of the current search solution and may be optionally provided by the new solr search function. As before, all search features are open to the public. By using forced return field configurations for our search interface (see “Querying” below) we'll make sure that full texts of subscription-based journals cannot leak. Subscription-based journals may not want to enable highlighting, though (see “Result List” below).
The search syntax of the solr-driven search will be a super-set of the syntax currently provided by OJS. This means that all queries that work in the current OJS search will be supported in the same way by the solr back-end.
Additionally any search query understood by solr's “edismax” query parser (see the corresponding solr documentation) will be supported. Some advanced search options, only supported when searching via Lucene back-end are:
- a question mark as a wildcard allows matching a single letter
- phrase query with term proximity (e.g. “some phrase”~3 finds documents containing “some” and “phrase” with not more than three words in between)
- additional ranking parameters like term boost and field boost
- fuzzy queries (e.g. research~ would match words similar to “research”)
- range queries (e.g. on the publication date)
These details will be completely transparent to most end users while still giving advanced users the full query power of solr/Lucene directly from OJS if they wish so.
When typing a query in a search box (simple or advanced), then potential search terms starting with the same letters of the last entered search term will be proposed. The offered search terms will be taken from all terms indexed for the search field the user is typing in (“query term completion”).
Alternative Spelling Proposals
After executing a search, OJS may propose alternative spellings of the same search query. These alternative search proposals – if they exist – may be offered as hyperlinks above or below the result list. Clicking on one of these hyperlinks will immediately execute the alternative search and return the corresponding result set.
Results will be presented and paged in the same way as for the existing OJS search as long as no additional search features are being activated.
Ranking is according to the default Lucene TF-IDF ranking method, see the “Ranking” chapter below for details. We may optionally provide alternative ranking metrics, see “Custom Document Ranking” below.
As Lucene is not very good in retrieving documents far down the result list, we'll restrict the result list to 1000 documents independently of the actual size of the result set. This keeps users and web crawlers from executing overly expensive query operations.
We may implement result “highlighting”. If enabled (see “Configuring Search Features” below) then an extract from the full text may be provided containing highlighted search keywords from the query. This helps end users to better judge the relevance of search results.
We may implement “instant search” functionality. This means that searches are being executed in the background while the user is still entering a query. A few top results could be immediately displayed – without the user having to hit the submit button – using AJAX requests and dynamic HTML. Implementing “instant search” would require us to place the search query field(s) and the result set on the same page (see Google's instant search feature for an example). This would be a considerable deviation from the current “two-page” OJS search interface and would require us to adapt the search interface for the default OJS search, too. It doesn't mean that the default OJS search needs to be implemented with instant search but it would have to be implemented as a “one page” search solution, too.
Result Manipulation and Refinement
Currently the ordering of search results cannot be manipulated. Order is by “relevance” according to the default ranking method.
We may provide an optional configuration option to enable alternative ordering criteria, e.g. alphabetically by author or title or by publication date. When enabled as an optional search feature, such ordering criteria could appear as a drop-down at the top of the search result list.
We may also propose a dynamic list of filter criteria (e.g. authors, publication date ranges, disciplines, type, subject and coverage keywords) to further refine the result list. This is called “faceting”. Facets could be provided as a list of links organized by facet category (aka search field) in an optional block plug-in. This allows OJS administrators to easily enable/disable facets and flexibly place them according to their journal design.
Clicking on one of the facet links will re-execute the original search with an additional filter as defined by the clicked facet. Once a search has been re-executed with a given facet, the facet will be displayed above the result list with a “delete symbol” next to it. Clicking on the delete symbol will re-execute the search without the deleted facet.
While displaying a facet filtered search, facet categories corresponding to active filters will disappear from the list of available facets. Multiple facet filters can be applied by clicking another facet link while displaying an already filtered search.
If we want to support selection of multiple facets from one category then we could place check-boxes besides facets rather than implementing them as links. In this case we need a “Search Again” button below the facet list, so that all selected facet filters can be applied. For the sake of interface simplicity we do not recommend enabling selection of multiple facets from one category. Users can enable and delete facets from the same category to achieve similar results. Advanced users can use properly filtered queries directly from the search field.
Finally we may want to implement a "More like this" hyperlink besides every article in the result list. Clicking on this link will yield documents containing similar “interesting terms” as the chosen document. See the solr documentation for a definition of what is considered an “interesting term”.
Administration and Configuration Interface
All solr, Lucene and servlet container (e.g. jetty) related configurations in OJS will appear on a plug-in settings page as is the case for other generic OJS plug-ins.
Installation- vs. Journal-Level Configuration
Most of the before-mentioned search features could potentially be dis-/enabled and configured on journal level, e.g. highlighting, faceting, additional ordering criteria, “more-like-this”, etc. Other configuration options make more sense on an installation level, e.g. the configuration of the network endpoint of the solr server (see “Configuring the Deployment Scenario” below). The problem with journal specific configuration is that OJS has an installation-wide search option on it's central home page. This means that each of the journal-level options would have to be repeated on installation-level, too. This is comparable to OJS language options which exist for both, the installation and specific journals. While increasing configuration flexibility, providing journal-level configuration has a few drawbacks:
- It is considerably more implementation effort to have both, installation- and journal-level configuration.
- It will confuse some users to find the same configuration options in two different places. This has at least been a problem for internationalization options in the past.
- End users using the search function will find an inconsistent user interface with some options enabled for one journal and disabled for other journals of the same installation. This may be quite confusing.
With our project goal of simplicity in mind it therefore seems preferable to provide all or at least most search options on system level only as long as there is not a strong case for journal-specific configuration. This also implies a recommendation for the authorization model for search options: Most search options would be system-level and therefore be made by the OJS installation's administrators (admin role). Providers often do not give away the administrator credentials to journals they host. So this would be equivalent to reserving search configuration to providers, too. There is one notable exception to this principle: As we've seen before, faceting is best being implemented as a journal-level block plug-in so that it can easily be adapted to the journal-specific design and page layout. As this means that faceting has to be implemented as a separate plug-in anyway it doesn't seem to be a strong disadvantage to have it implemented on journal level. This also means that placement of faceting within the journal design, once faceting has been enabled system-wide by the administrator, would be the responsibility of journal managers.
Configuring the Deployment Scenario
We support two main deployment options (see “Deployment Options” below):
- a fully preconfigured local jetty/solr server (embedded deployment) and
- a central solr server running in an arbitrary servlet container somewhere on the network (network deployment).
The former is the default configuration. The embedded jetty server runs local to the OJS installation and listens on the loopback IP address (127.0.0.1) to protect it from exposure to other servers. To support the second deployment option we'll need a configuration option consisting of the host and port of the solr server. We recommend this to be an OJS administrator-level (installation-wide) option so that we have a unique and unambiguous solr endpoint to send article meta-data to. We can optionally provide an additional configuration parameter for the solr search handler to be used. This is “/solr/search” in the embedded deployment but advanced users may want to deploy additional preconfigured search handlers. This will enable them to work with installation-specific search parameters (e.g. ranking-related or for a sub-set of journals) without having to customize any OJS code. For the sake of simplicity we do not provide any means to directly set solr parameters from within OJS. Less advanced users should be able to use solr from a very simple interface while advanced users still may customize search to a very large extent by changing parameters directly in solr configuration files. Keeping solr configuration within solr's configuration files also helps keeping solr secure: Search endpoints can be constrained through mandatory configuration parameters which would not be possible when implementing client side configuration. Such configuration would have to be communicated over the network thereby being open to manipulation from the outside. The solr plug-in's home page will display a warning message, whenever the current configuration does not point to a running solr server. In this case, the plug-in will point to the README file distributed in it's home directory. This file will contain all necessary installation and configuration information to get up-and-running with OJS solr search.
We provide a shell script to start/stop the embedded solr server. This script could be started/stopped from OJS if (and only if) it should be run under the same user as PHP. This user depends on the local web server configuration. In most cases it will be either the web server's user or – in more advanced installations – a dedicated PHP user. There may be other difficulties in starting/stopping solr directly from within OJS, see “Starting/Stopping Solr” in the “Embedded Deployment” chapter below. If all preconditions for tool execution are met then we can place a Start/Stop button onto the solr plugin main page. This allows administrators to start/stop solr from within OJS which will further simplify work with the embedded scenario.
Configuring Search Features
If we follow the recommendation to keep all search configuration on installation level then the following features could be dis-/enabled system-wide through simple check-boxes on the search plug-in's settings page:
- alternative spelling proposals
- alternative order criteria
- instant search
- more-like-this links
- custom document ranking
Rather than providing many feature-specific configuration parameters it seems more appropriate to provide a well thought-out default configuration for all of these features to keep the user interface as simple as possible. It has to be kept in mind that advanced users will always be able to tune features directly in the solr configuration. Therefore it is recommended to only provide OJS configuration for what cannot be configured directly in solr and choose good defaults otherwise.
It may even be defined that search features like auto-suggestions, alternative spelling proposals or highlighting that occupy little screen real estate and do not have a strong performance impact could be implemented out-of-the-box without the possibility of disabling them. The difference in implementation (configurable or not) seems to be negligible. So it is essentially up to the project owner to take this decision based on a trade-off between flexibility and simplicity of the user interface.
There are two notable exceptions to the recommendation to keep search feature configuration limited to simple on/off switches:
- The configuration of the faceting block would be done through the usual interface in step 5 of the journal setup and the normal OJS design customization process.
- If the inclusion of additional ranking data (e.g. citation index, usage statistics, etc.) should be possible, then we'll need an interface where such ranking information can be uploaded or integrated from external sources. One possibility is the “Custom Document Ranking Factor Configuration” described below.
Custom Document Ranking Factor Configuration
If a custom document ranking factor (e.g. citation index data, usage metrics, etc.) should be supported this can easily be done as a generic input field on the article editing page. When custom document ranking is enabled in the plug-in then such a field will appear there. If editors insert a low value then the article will rank lower than default. A high value will increase it's mean ranking position. Numbers will be linearly normalized so that their mean is one. See the “Ranking” chapter below for internal implementation details and more examples of potential alternative ranking methods. Alternatively an import plug-in could be realized that allows import of document ranking data from different file formats (CSV, XML, etc.) or even pulling ranking data from an external source via HTTP.
Usually no special index administration should be necessary to maintain the solr index up-to-date. All index maintenance due to article additions, updates or deletions should be handled automatically. There are situations, though, in which it may be necessary to re-index some articles, e.g. when the solr index got lost, out-of-sync or corrupted.
Partial or full re-indexing
The existing OJS “re-indexing” button will trigger a re-indexing operation in solr if the solr plug-in is switched on. An additional drop-down field can be implemented to select a single journal for re-indexing.
Additionally we recommend exposing a CLI interface for index rebuild so that rebuilding indexes across several OJS instances can be easily automated if required.
Index optimization is most likely not relevant to the embedded scenario. Lucene does a good job in automatically joining index segments thereby keeping a good balance between index re-organization load and long term query/update performance.
To keep the OJS interface simple and easy to use, we recommend not to support index optimization from within OJS. Providers that work with large multi-installation indexes can use the default solr interface to optimize their index if required. Index optimization can also be scripted if a provider wishes to automate this process.
The following sections describe several aspects of the indexing back-end of the proposed search system. This comprises some changes to the OJS back-end but above all includes solr/Lucene configuration recommendations.
Index architecture is one of the most important aspects of solr configuration. We list available options in this area and provide recommendations with respect to the requirements specified for this project.
Single Index vs. Multi-Index Architecture
The main decision with respect to index architecture is whether to use a single index or multiple indexes (and corresponding solr cores).
Advantages of a single index for all journals and document types:
- enables search across various OJS instances
- easy installation, configuration and maintenance (no need for solr configuration when adding additional OJS instances)
- easy search across multiple document types: A single search across article meta-data, galleys and supplementary files with the intend to retrieve articles is possible.
- easy search across languages
- no need to merge, de-duplicate and rank search results from different indexes (distributed search)
Disadvantages of a single index:
- potential ranking problems when restricting search to heterogeneous sub-sets of an index (e.g. a single journal)
- potential namespace collisions for fields if re-using the same schema for different document types (e.g. supp. file title and galley title in the same field)
- scalability problems if scaling beyond tens of millions of documents
- adding documents invalidates caches for all documents (i.e. activity in one journal will invalidate the cache of all journals)
- the whole index may have to be rebuilt in case of index corruption
Implications of Multilingual Support for the Index Architecture
There are two basic design options to index a multilingual document collection:
- Use one index per language
- Use one field per language in a single index
See http://lucene.472066.n3.nabble.com/Designing-a-multilingual-index-td688766.html for a discussion of multilingual index design.
Advantages of a single index:
- One index is simpler to manage and query. A single configuration can be used for all languages.
- Results will already be joined and jointly ranked. No de-duplication of search results required.
Advantages of a multi-index approach:
- The multi-index approach may be more scalable in very large deployment scenarios - especially where a large number of OJS installations are indexed on a central search server.
- Language configurations may be modularized into separate solr core configurations. No re-indexing of all documents is required when a new language is being introduced into existing documents. It is questionable, though, whether journals will ever introduce a new language into already published articles. So this advantage is probably only theoretical in the case of OJS.
- The ranking metric "docFreq" is per-field while "maxDoc" is not. Using one index per language these parameters will be correct even when using a single field definition for all languages. We can easily work around this in a single-index design, however, by providing one field per language.
In our case the advantages of a single-index approach for multilingual content definitely outweigh its disadvantages.
Index Architecture Recommendations
The following sections provide index architecture recommendations for all deployment scenarios.
Single Index Architecture
We generally recommend a single-index architecture if possible.
Several disadvantages of the single index scenario are not relevant in scenarios S1 to S3:
- We have only one relevant document type: OJS articles. By properly de-normalizing our data we can easily avoid field name collisions or ranking problems due to re-use of fields for different content (e.g. we would certainly have two separate 'name' fields for article name and author name).
- It is not to be expected that the number of documents per journal (S1), installation (S2) or provider (S3) will exceed millions of articles. If it should happen then providers of this size will certainly have the skill available to configure a replicated search server while maintaining API compatibility based on our search interface documentation.
- In usual scenarios the cost of cache invalidation due to new galley or supplementary file upload seems reasonable. If the cost of cache invalidation or synchronous index update after galley/supp. file addition becomes prohibitive we can still choose a nightly update strategy (see “Pull Processing” below). This is in line with the current 24 hour index caching strategy.
- Our multilingual design can be implemented in a single index.
On the other hand there are advantages of a single index architecture (e.g. search across several OJS instances, simplicity of configuration, maintenance, etc.) which are relevant in our case, see above.
There are two potential problems that can occur when consolidating many journals in a single index:
- more costly index rebuild
- potential ranking distortions
The first point refers to the fact that if the whole index needs to be rebuilt (e.g. due to index corruption) we have to trigger the rebuild from all connected OJS instances. This cannot be automated within OJS as OJS does not allow actions across instances. It can, however, be easily automated via a simple custom shell script when we provide a CLI interface for index rebuilds which we recommend.
Whether ranking will suffer from a single-index approach depends on the heterogeneity of the journals added to the index. It may become a problem when search terms that have a high selectivity for one journal are much less selective for other journals thereby distorting Lucene's default inverse document frequency (IDF) scoring measure when restricting query results to a single journal.
An example will illustrate this: Imagine that you have two Mathematics journals. One of these journals accepts contributions from all sub-disciplines while the other is specialized on topology. Now a search on "algebraic topology" may be quite selective in the general Maths journal while it may hit a whole bunch of articles in the topology journal. This is probably not a problem as long as we search across both journals. If we search within the general maths journal only, then documents matching "algebraic topology" will probably receive lower scores than they should because the overall index-level document frequency for "algebraic topology" is higher than appropriate for the article sub-set of the general maths journal. This means that in a search with several search terms, e.g. "algebraic topology AND number theory" the second term will probably be overrepresented in the journal-restricted query result set. Only experiment with test data can show whether this is relevant in practice. It is fair to believe, though, that the majority of queries will be across all indexed journals and therefore not suffer such distortion. This is because most users do have an interest in their topic matter rather than being interested in a specific publication only.
NB: We do not have to bother about content heterogeneity on lower granularity levels, e.g. journal sections, as these cannot be selected as search criteria to limit search results.
The same ranking distortion could theoretically apply to multilingual content if we were to collect all languages in a single index field. In the proposed schema, however, we use a separate field per language, see “Multilingual Documents” below. As document frequency counts are per index field, we'll get correct language-specific document counts. The total document count will also be ok as we'll denormalize all language versions to the article level.
While we generally recommend a single index design there are cases where a multi-index design may be appropriate and can be optionally implemented by a provider:
- when frequent index corruption or cache invalidation turns out to be a relevant problem,
- when ranking distortions become relevant or
- when reaching scaling limits.
Whether these problems occur or not can only be decided by experimentation. While one index per OJS instance is supported, even in a network scenario, it must be kept in mind that multiple indexes may have disadvantages: From a user perspective the most relevant potential disadvantage is that searches across several journals will only be supported when those journals are in the same index. This is due to the fact that we do not recommend distributed search across several indexes because they are much more complex and therefore costly to implement and create difficult ranking problems we can hardly solve. See a full list above.
S1 and S2: Embedded Solr Core
While we generally recommend a single-index architecture for all deployment options, there are a few comments to be made with respect to specific employment scenarios.
In deployment scenario S1 and S2 we only search within the realm of a single OJS installation. This means that a single embedded solr core listening on the loopback IP interface could serve such requests, see “Embedded Deployment” below.
S3: Single-Core Dedicated Solr Server
In deployment scenario S3 we search across installations. This means that the default deployment approach with a per-installation embedded solr core will not be ideal as it means searching across a potentially large number of distributed cores. Therefore, the provider will probably want to maintain a single index for all OJS installations deployed on their network.
This has a few implications:
- We have to provide guidance on how to install, configure and operate a stand-alone solr server to receive documents from an arbitrary number of OJS installations.
- The OJS solr integration will need a configuration parameter that points to the embedded solr core by default but can be pointed to an arbitrary solr endpoint (host, port) on the provider's network. See “Configuration the Deployment Scenario” above.
- The OJS solr document ID will have to include a unique installation ID so that documents can be uniquely identified across OJS installations. See the “Data Model” and document update XML protocol specifications below.
S4: Multi-Core Dedicated Solr Server(s)
In deployment scenario S4 we have an unspecified number of disparate document types to be indexed. This means that the best index design needs to be defined on a per-case basis. We may distinguish two possible integration scenarios:
- display non-OJS search results in OJS
- include OJS search results into non-OJS searches
The present specification only deals with the second case as the first almost certainly requires provider-specific customization of OJS code that we do have no information about.
Our index architecture recommendation for the S4 scenario is to create a separate dedicated solr core with OJS documents exactly as in scenario S3. Then searches to the "OJS core" can be combined with queries to solr cores with non-OJS document types in federated search requests from arbitrary third-party search interfaces within the provider's network. (See http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set for one possible solution of federated search.)
This has the advantage that the standard OJS solr search support can be used unchanged based on the same documentation resources that we provide to support S3 (see previous section).
The only extra requirement to support the S4 scenario is to make sure that the unique document ID of other document types does not clash with the OJS unique article id. This is important so that a federated search can uniquely identify OJS documents among other application documents. When working with a globally unique installation ID such clashes are extremely improbable. Potential ID clashes are only a problem when using solr's built-in federated search feature. Otherwise the search client will query the cores separately and join documents based on application-specific logic (e.g. displaying separate result lists for different document types).
Our recommendation for the data model is based on the type of queries and results required according to our feature list. We also try to implement a data model that requires as little schema and index modifications in the future as possible to reduce maintenance cost.
Meta-data fields that we want to search separately (e.g. in an advanced search) must be implemented as separate fields in Lucene. Sometimes all text is joined in an additional "catch-all" field to support unstructured queries. We do not believe that such a field is necessary in our case as we'll do query expansion instead.
To support multilingual search and proper ranking of multilingual content we need one field per language for all localized meta-data fields, galleys and supplementary files.
In order to avoid ranking problems we also prefer to have separate fields per document format (e.g. PDF, HTML, MS Word) rather than joining all data formats into a single search field. We can use query expansion to cover all formats while still maintaining good ranking metrics even when certain formats are not used as frequently as other formats.
The relatively large number of required fields for such a denormalized multilingual/multiformat data model is not a problem in Lucene (see http://lucene.472066.n3.nabble.com/Maximum-number-of-fields-allowed-in-a-Solr-document-td505435.html). Storing sparse or denormalized data is efficient in Lucene, comparable to a NoSQL database.
We prefer dynamic fields over statically configured fields:
- Dynamic fields allow us to reduce our configuration to one generic field definition per analyzer chain (i.e. language).
- No re-configuration or re-indexing of the data schema will be required to support additional languages or document formats.
- No re-configuration of the data schema will be required to add additional meta-data fields.
The publication date will be indexed to a trie date type field.
Authors are not localized and will be stored verbatim in a multi-valued string type field.
Specific fields are:
- the globally unique document ID field ("article_id") concatenating a globally unique installation ID, the journal ID and the article ID,
- “inst_id” and “journal_id” fields required for administrative purposes,
- the authors field (“authors_txt”) which is the only multi-valued field,
- localized article meta-data fields ("title_xx_XX", "abstract_xx_XX", "discipline_xx_XX", "subject_xx_XX", "type_xx_XX", "coverage_xx_XX") where "xx_XX" stands for the locale of the field,
- the publication date field ("publicationDate_dt"),
- a single localized field for supplementary file data ("suppFiles_xx_XX") where "xx_XX" stands for the locale,
- localized galley full-text fields ("galleyFullText_mmm_xx_XX") where "mmm" stands for data format, eg. "pdf", "html" or "doc" and "xx_XX" stands for the locale of the document.
- fields to support result set ordering ("title_xx_XX_txtsort, authors_txtsort, issuePublicationDate_dtsort, etc.)
- fields to query across several values at the same time ("all_xx_XX" and "indexTerms_xx_XX"). These fields are necessary to enable correct "NOT" searches, e.g. keyword NOT author. If we do a disjunctive query across all fields with such a search then "keyword NOT author" may result in a hit in the subject field but not in the author field. According to the usual boolean logic, the resulting (TRUE in subject) OR (FALSE in author)" result will be interpreted as a hit for the corresponding document which is not what users will usually expect.
These fields will be analyzed for search query use cases, potentially including stemming (see “Analysis” below). The exact data schema obviously depends on the number of languages and data formats used by the indexed journals.
In the case of supplementary files there may be several files for a single locale/document format combination. As we only query for articles, all supplementary file full text can be joined into a single field per language/document format. And as we do not allow queries on specific supplementary file meta-data fields we can even further consolidate supplementary file meta-data into a single field per language.
To reduce index size and minimize communication over the network link all our fields are indexed but not stored. The only field to be stored in the index is the ID field which will also be the only field to be returned over the network in response to a the query request. Article data (title, abstract, etc.) will then have to be retrieved locally in OJS for display. As we are using paged result sets this can be done without relevant performance impact.
If we want to support highlighting then the galley fields need to be stored, too.
Further specialized fields will be required for certain use cases. If we want to support auto-suggestions, faceting or alternative spelling suggestions then we'll have to provide textual article meta-data fields in a minimally analyzed (lowercase only, non-localized) version. These fields will be called “xxxxx_spell” where “xxxxx” stands for the field name without locale extension.
Fields that we want to use as optional sort criteria need to be single valued, indexed, and not-tokenized . This means that sortable values will potentially have to be analyzed separately into “xxxxx_sort_xx_XX” fields where “xxxxx” stands for the field name and “xx_XX” for the locale (if any) of the sort field.
If we want to support the “more-like-this” feature then we may have to store term vectors for galley fields if we run into performance problems.
Further technical details of the data model can be found in plugins/generic/solr/embedded/solr/conf/schema.xml.
Document Submission and Preprocessing
Article data needs to be submitted to solr and preprocessed so that it can be ingested by solr's Lucene back-end. This is especially true for binary galley and supplementary file formats that need to be transformed into a UTF-8 character stream. The following sections will describe various options and recommendations with respect to document submission and preprocessing.
Existing OJS Document Conversion vs. Tika
The current OJS search engine implements document conversion based on 3rd-party commandline tools that need to be installed on the OJS server. Solr, on the other hand, is well integrated with Tika, a document and document meta-data extraction engine written in pure Java. We have to decide whether to re-use the existing OJS solution or whether to use Tika instead.
Advantages of the existing OJS conversion:
- We can re-use an established process that some OJS users already know about.
- Conversion of PostScript files can be provided out-of-the-box.
Advantages of Tika:
- According to our tests, Tika works at least one order of magnitude faster than the current OJS solution. This is especially important for large deployment scenarios, i.e. when re-indexing a large number of articles.
- Tika is easier to use and install than the current OJS solution. No additional 3rd-party tools have to be installed as is now the case (except for solr itself of course). Plain text, HTML, MS Word (97 and 2010), ePub and PDF documents are supported out-of-the-box by the code that comes with the standard solr distribution. Caution: Tika does not convert PostScript files!
- Can be deployed independently on the search server and does not need an OJS installation to work. In scenarios S3 and S4 this means considerably less infrastructure to be deployed on OJS nodes.
- Very well tested and maintained.
- Enables indexing of several additional source file types out-of-the-box, see https://tika.apache.org/1.0/formats.html and https://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml.
The only real disadvantage of Tika with respect to our requirements is that it does not support conversion of PS files. PS could be supported indirectly by first converting it to PDF locally and then submitting PDF to the solr server. It is however not clear, whether nowadays there exist OJS installations with an interest in solr that actually use Postscript as a publishing format. The advantage of solr being able to support the ePub format seems more important than the missing PS support.
Recommendation: Use the Tika conversion engine.
Local vs. Remote Processing
In the multi-installation scenarios S3 and S4 document preprocessing could be done locally to the installation or on the central solr server.
Advantages of local processing are:
- The solr server experiences less load for resource-intensive preprocessing tasks.
Advantages of remote processing are:
- Doing all processing on a single server will simplify deployment and maintenance as 3rd-party dependencies only need to be installed and configured on a single server. OJS installations can be added without additional search-related installation requirements.
- Solr preprocessing extensions like Cell or DIH work locally to the Solr core.
- We can keep load off the end-user facing OJS application servers for consistent perceived response time.
Recommendation: Use remote processing, mostly due to the reduced deployment cost and easy use of Solr extensions.
Push vs. Pull
Document load can be initiated on the client side (push processing) or on the server side (pull processing). Both options have their strengths and weaknesses.
Advantages of push configuration:
- Indexing can be done on-demand when new documents are added to OJS. This guarantees that the index is always up-to-date.
- Push is simpler to implement when implemented as a synchronous call without callback. This may be sufficient in our case, especially for the embedded scenario, although it implies a small risk that documents may not be indexed if something unexpected goes wrong during indexing.
- No solr-side import scheduler needs to be configured and maintained.
Advantages of pull configuration:
- Push processing means that editorial activity during daytime will cause update load peaks on the solr server exactly while it also experiences high search volume. This load can be quite erratic and fluctuative in larger system environments and therefore difficult to balance. In pull mode indexing schedules can be configured and co-ordinated in one single place (for scenarios S3 or S4) to balance document import load on the central search server and keep it to off hours.
Recommendation: Use the simpler push configuration by default but check its performance and reliability early on. If it turns out to be slow or unreliable, especially in the network deployment case, then provide instructions and sample configuration for an optional pull configuration for larger deployments, see “OJS/solr Protocol Specification” and “Deployment Options” below.
Both, push and pull processing, can be implemented with our without callback. We recommend callback for network deployment only where large amounts of data have to be indexed and full index re-builds can be very costly, see “OJS/solr Protocol Specification” below.
Implications of Multilingual Document Processing
The OJS search feature returns result sets on article level rather than listing galleys or supplementary files as independent entities. This means that ideally our index should contain one entry per article so that we do not have to de-duplicate and join result sets. Different language versions and formats of articles should be spread over separate fields rather than documents. Such a denormalized design also facilitates multilingual search and ranking. A detailed argumentation for this preferred index design will be given in the “Multilingual Documents” section below.
For document preprocessing this design implies that we have to join various binary files (galleys and supplementary files in all languages and formats) plus the article meta-data fields into a single solr/Lucene document. As we'll see in the “Solr Preprocessing plug-ins” section, this considerably influences and restricts the implementation options for document import.
Custom Preprocessing Wrapper vs. solr Plug-Ins
We have to decide whether we want to implement our own custom preprocessing wrapper to solr as in the current OJS search implementation or whether we want to re-use the preprocessing interface and capabilities provided by native solr import and preprocessing plug-ins.
Advantages of a custom preprocessing interface are:
- We could use an arbitrary data transmission protocol, e.g. re-use existing export formats like OAI or the OJS native export format or use solr's native document addition format directly over the wire. The former implies that we somehow have to interpret these formats on the server side. The latter means that we have to transform binaries into a UTF-8 character stream on the client side, see the discussion of local document preprocessing above.
- We could re-use the existing document conversion code, rather than using Tika. See the discussion of the existing OJS preprocessing code above.
Advantages of standard solr plug-ins:
- We can re-use solr's elaborate document preprocessing capabilities which are more powerful than those currently implemented in OJS.
- Tika is well integrated with solr through two different plug-ins: DIH and Cell. Using native solr plug-ins means that we can use Tika as a conversion engine without having to write custom Tika integration code.
- Custom remote preprocessing code to interpret OAI messages or OJS export formats is expensive: It means either implementing and maintaining a separate server-side PHP application or extending solr with custom Java code.
- Solr plug-ins support pull and push configurations out-of-the-box.
A priori both options have their strengths and advantages. In our case, though, the choice is relatively clear due to our preference for remote document preprocessing and Tika as an extraction engine. Having to maintain custom Java code or creating a separate server-side PHP preprocessing and Tika integration engine are certainly not attractive options for FUB or PKP.
Recommendation: The advantages of using established solr plug-ins for data extraction and preprocessing outweigh the advantages of a custom preprocessing interface in our case.
Solr Preprocessing plug-ins: IDH vs. Cell
Currently there are two native solr extensions that support Tika integration: The "Data Import Handler" (IDH) and the "Solr Content Extraction Library" (Solr Cell).
Cell is meant to index large amounts of files with very little configuration requirements. Cell does not support more complex import scenarios with several data sources and complex transformation requirements, though. It also does not support data pull. In our case, these disadvantages rule it out as a solution.
The second standard solr preprocessing plug-in, IDH, is a flexible extraction, transformation and loading framework for solr that allows integration of various data sources and supports both, pull and push scenarios.
Unfortunately even IDH has two limitations that are relevant in our case:
- IDH's XPath implementation is incomplete. It does not support certain types of XPath queries that are relevant to us: An IDH XPath query cannot qualify on two different XML attributes at the same time which rules out the possibility to transmit native OJS XML to IDH.
- Due to it's sequential “row” concept imposed on XML parsing, IDH also does not usually support denormalizing several binary documents into a single Lucene document. In fact no standard solr contribution is designed to do so out-of-the-box (see http://lucene.472066.n3.nabble.com/multiple-binary-documents-into-a-single-solr-document-Vignette-OpenText-integration-td472172.html). Only by developing a custom XML data transmission format with CDATA-embedded XML sub-documents did we manage to work around this limitation without having to resort to custom compiled Java code on the server side, see “XML format for article addition” below.
Recommendation: Use IDH for document preprocessing with a custom XML document transmission format.
Should we use Tika to retrieve Meta-Data from Documents?
Tika can retrieve document meta-data from certain document formats, e.g. MS Word documents. This functionality is also well integrated with IDH.
Using this meta-data is problematic, though:
- Document meta-data cannot be consistently retrieved from all document types.
- Even where the document theoretically allows for storage of a full meta-data set, these meta-data may be incomplete or inconsistent with OJS meta-data.
- We do have a full set of high-quality document meta-data in OJS that we can use instead.
Recommendation: Do not use Tika to extract document meta-data but use the data provided by OJS instead.
IDH supports several data transmission protocols, e.g. direct file access, HTTP, JDBC, etc. In our case we could use direct file access or JDBC for the embedded deployment scenario. But as we also have to support multi-installation scenarios we prefer channeling all data through the network stack so that we can use a single preprocessing configuration for all deployment options. Using the network locally is only marginally slower than accessing the database and file system directly. By far most processing time is spent for document conversion and indexing so document transmission will hardly become a performance bottleneck.
HTTP is the network protocol supported by IDH. HTTP can be used for push and pull configurations. It supports transmission of character stream (meta-)data as well as binary (full text) documents. Our recommendation is therefore to use HTTP as the only data transmission protocol in all deployment scenarios.
Non-HTTP protocols can still be optionally supported (e.g. for performance reasons) by making relatively small custom changes to the default IDH configuration.
Exact details of the transmission protocol will be laid out in the “OJS/solr Protocol Specification” below.
Submission and Preprocessing Recommendations
To sum up: Our analysis of the data import process revealed that the following requirements should be met by a data preprocessing solution:
- No custom Java programming should be required.
- Push and pull scenarios should be supported.
- Remote preprocessing should be supported.
- We have to support denormalization of various binary files into a single Lucene document.
- Preprocessing should be done with Tika using native solr plug-ins.
- Documents and meta-data should be sent over the network.
We provide a prototypical IDH configuration that serves all these import and preprocessing needs:
- We provide push and pull configurations. Push is supported by IDH's ContentStreamDataSource and pull is supported via the UrlDataSource.
- Both configurations do not require direct file or database access. All communication is over the network stack.
- In our prototype we demonstrate a way to use an IDH FieldReaderDataSource to pass embedded XML between nested IDH XPathEntityProcessor instances. This allows us to denormalize our article entity with a single IDH configuration. We also draw heavily on IDH's ScriptTransformer to dynamically create new solr fields when additional languages or file types are being indexed for the first time. This means that no IDH maintenance will be necessary to support additional locales.
- All file transformations are done via IDH's Tika integration (TikaEntityProcessor). We nest the Tika processor into an XPathEntityProcessor and combine it with a ScriptTransformer to denormalize several binary files into dynamic solr fields.
Please see plugins/generic/solr/embedded/solr/conf/dih-ojs.xml for details.
In the Lucene context, “analysis” means filtering the character stream of preprocessed document data (e.g. filter out diacritics), splitting it up into indexed search terms (tokenization) and manipulating terms to improve the relevance of search results (e.g. synonym injection, lower casing and stemming).
Precision and Recall
This part of the document describes how we analyze and index documents and queries to improve precision and recall of the OJS search. In other words: We have to include a maximum number of documents relevant to a given search query (recall) into our result set while including a minimum of false positives (precision).
Measures that may improve recall in our case are:
- not making a difference between lower and upper case letters
- removing diacritics to ignore common misspellings
- using an appropriate tokenization strategy (e.g. n-gram for logographic notation or unspecified languages and whitespace for alphabetical notation)
- using "stemmers" to identify documents containing different grammatical forms of the words in a query
- using synonym lists (thesauri) to include documents that contain terms with similar meaning
Measures that improve precision may be:
- ignore frequently used words that usually carry little distinctive meaning ("stopwords")
Often there is a certain conflict between optimizing recall and precision. Measures that improve recall by ignoring potentially significant differences between search terms may produce false positives thereby reducing precision.
Please observe that most of the above measures require knowledge about the text language, i.e. its specific notation, grammar or even pronunciation. A notable exception to this rule is n-gram analysis which is language-agnostic. Support for a broad number of languages is one of our most important requirements. Therefore appropriate language-specific treatment of meta-data and full text documents is critical to the success of the proposed design. We'll therefore treat language-specific analysis in detail in the following section.
Our general approach is to keep the analysis process as simple as possible by default. This also includes minimal stemming and language-specific analysis. This is to honor the “simplicity” design goal as specified for this project. Whenever we discover unsatisfactory relevance of result lists during testing (see our testing approach above), especially insufficient recall of multilingual documents, we'll further customize analysis chains. This ensures that additional complexity is only introduced when well justified by specific user needs.
It is one of the core requirements of this project to better support search in multilingual content. This is especially true for languages with logographic notation, such as Japanese or Chinese, that are not supported by the current OJS search implementation. We've already analyzed the impact of multilingual documents on index and data model design. The most important part of multilingual support lies in the analysis process, though. In fact, allowing for language-specific analysis is one of the reasons why we recommend a “one-field-per-language” data model.
There is no recommended default approach for dealing with multilingual content in solr/Lucene. The range of potential applications is so large that individual solutions have to be found for every use case. We'll therefore handle this question to a considerable amount of detail: First we'll list a few specific analysis requirements derived from the more general project requirements presented earlier. Then we'll discuss several approaches to multilingual analysis. Finally we'll recommend an individual solution for the use cases to be supported in this project.
Requirements for the analysis process must above all be derived from expected user queries and the corresponding correctly ranked result lists. The following list of analysis requirements are therefore derived from properties specific to multilingual OJS search queries:
- The OJS search form is language agnostic. Search terms can be entered in any language. Both, single and mixed-language queries, should be allowed.
- The indexing process should be able to deal with galleys and supplementary files in different languages.
- The indexing process should usually be able to rely on the locale information given for the galley or supplementary file being indexed. A language classifier might optionally be used for galleys whose locale information is unreliable or cannot be identified.
- The indexing process should be able to deal with mixed-language documents where short foreign-language paragraphs alternate with the main galley/supplementary file language. This means that e.g. an English search term entered into the OJS search box should find a German document containing the search word in an English-language citation.
- The following languages should be specifically supported: English, German, Spanish, Chinese, Japanese. Other languages should be supported by a generic analysis process that works reasonably well for multilingual documents.
- A process should be defined and documented for plugging in additional language-specific analysis chains on demand.
Further requirements derive from multilingual test queries. Consult the list of test queries linked in the main “Requirements” section above for details.
Language Recognition vs. Preset Language
When multilingual content should be analyzed in a language-specific manner (e.g. stemming, stopwords, etc.) we need to know the document language to be able to branch into the correct analysis chain. There are two basic approaches to obtain such language identity information: machine language recognition and user input.
Advantages of machine language recognition:
- Deals with incomplete or unreliable locale information of meta-data, galleys and supplementary files.
Advantages of preset languages:
- Simpler to implement.
Reliability of machine language recognition vs. preset languages mainly depends on the reliability of user input in the case of preset languages: In our case user provided language information will probably be quite reliable for meta-data and galleys. This is not the case for the content of supplementary files as these do not have a standardized locale field. This seems to be a minor problem, though: It is assumed that searches on supplementary file content are of minor importance in our case.
Our recommendation therefore is to work with preset languages to avoid unnecessary implementation/maintenance cost and complexity. If we see in practice that important test queries cannot be run with preset languages then we can still plug-in language recognition where necessary. We can use solr's “langid” plug-in in this case, see https://wiki.apache.org/solr/LanguageDetection. It provides field-level language recognition out-of-he-box.
Document vs. Paragraph-Level Language Recognition
The granularity of multilingual analysis has a great influence on implementation complexity and cost. While document-level language processing is largely supported with standard Lucene components, paragraph or sentence-level language recognition and processing requires considerable custom implementation work. This includes development and maintenance of custom solr/Lucene plug-ins based on 3rd-party natural language processing (NLP) frameworks like OpenNLP or LingPipe.
We identified the following implementation options for multilingual support:
- Allow language-specific treatment only on a document level and treat all documents as "monolingual". Document parts that are not in the main document language may or may not be recognized depending on the linguistic/notational similarity between the main document language and the secondary language.
- Allow language-specific treatment on document level and provide an additional "one-size-fits-all" analysis channel that works reasonably well with a large number of languages (e.g. using an n-gram approach, see below). Search queries would then be expanded across the language-specific and generic search fields. This will probably improve recall but reduce precision for secondary-language search terms.
- Perform paragraph or sentence-level language recognition and analyze text chunks individually according to their specific language. This should provide highest precision and recall but will be considerably more expensive to implement and maintain.
The advantage of the first two options is that they can be implemented with standard solr/Lucene components. The third option will require development and maintenance of custom solr/Lucene plug-ins and integration with third-party language processing tools. This is not an option in our case as it would require custom Java programming which has been excluded as a possibility for this project.
We recommend the second approach which will be further detailed in the next section.
Language-Specific Analysis vs. n-gram Approach
There are two basic approaches to deal with multilingual content: A generic n-gram approach that works in a language-agnostic manner and provides relatively good mixed-language analysis results. Alternatively language-specific analysis chains can be used to analyze text whose language is known at analysis time.
Advantages of an n-gram approach:
- relatively easy to implement with a single multilingual analyzer chain
- easy to introduce new languages (no additional configuration required)
- easy to query (no need for query expansion)
- can be used to index mixed-language documents
- no language identification required
- may speed up wildcard searches in some situations
Advantages of language-specific analysis chains:
- higher relevancy of search results (less false positives or false negatives)
- language information contributes to proper ranking of documents
- easier to tune in case of language-specific relevancy or ranking problems
- requires less storage space, especially when compared to multi-gram analysis (e.g. full 2-, 3- and 4-gram analysis for a single field).
While language-specific analysis chains may not be ideal for mixed-language content, it is improbable that n-gram analysis alone will provide satisfactory relevance of result sets.
We therefore recommend a mixed approach: We should provide language-specific analysis chains for the main language of a document or meta-data fields where the language is known and supported. All fields and documents may additionally undergo partial n-gram (e.g. edge-gram) analysis if we find that this is necessary to support multilingual document fields or fields that do not have a language specified. The results from both analysis processes will have to go into separate fields. This requires separate fields per language (see “Data Model” above) and query expansion to all language fields (see the “Query Transformation and Expansion” below).
Character Stream Filtering
According to our “simplicity by default” approach we do not recommend any character stream filtering unless specific test use cases require us to do so. The recommended stemming filters deal to a large extent with diacritics. Lower case filtering is done on a token level.
Tokenization differs for alphabetic languages on the one side and logographic languages on the other. We recommend standard whitespace tokenization for most Western languages while a bigram approach is usually recommended for Japanese, Chinese and Korean. We therefore recommend the solr CJK-tokenizer for these languages.
We recommend lowercase filtering for alphabetic languages and language-specific stopword filtering by default. In order to simplify analysis and avoid additional maintenance cost, we do not recommend synonym filtering unless required to support specific test cases.
We recommend solr's minimal language-specific stemming implementations where they exist. Should these yield insufficient recall during testing then we can replace them with more aggressive stemmers on a case-by-case basis.
One might even want to remove all stemming and cluster all alphabetical languages into a single analysis chain similarly to what currently is being done in standard OJS search. In order to keep flexibility for advanced use cases in scenarios S3 and S4 we do recommend language-specific analysis chains, though, even if not used out-of-the-box. It has to be kept in mind that this complexity is completely transparent to end users.
Keyword fields like discipline, subject, etc. are not usually passed through stemming filters. We therefore recommend a generic, language-agnostic analysis chain for all keyword fields.
We have to support a special analysis chain for the article and issue publication date so that range queries on the publication date can be supported. There are default analyzers and field types for dates which we recommend here.
Text fields to be sorted on must not be tokenized. Date fields to be sorted on must be of a different type as date fields to be queried on. We therefore provide special field types for sort ordering.
Theoretically chronological coverage could be analyzed with a location analyzer if (and only if) geographical coverage would be given in a well-defined latitude/longitude format. As this is not usually the case in OJS we recommend analyzing geographic coverage in the same way as other keyword fields.
Most use cases only require us to index fields. Storage is not required. The only field we need to store (and return from queries) is the document ID field which will be required by OJS to retrieve article data for display in result sets. There is a notable exceptions to this rule, though: If we enable highlighting then storage of galley fields is mandatory. This is necessary so that the highlighting component can return search terms in their original context. Therefore highlighting considerably increases storage space required by OJS solr indexes. This should be considered when deciding whether this feature is to be supported out-of-the-box.
Please see plugins/generic/solr/embedded/solr/conf/schema.xml for our recommended analysis configuration.
Query Entry and Auto-Suggest
Search queries entered into the search fields will be submitted to the OJS server as POST requests. This does not differ from the current search implementation. OJS will then start a nested HTTP request to the solr server which will return matching article IDs and (if enabled) advanced search information as highlighting, alternative spelling suggestions, similar articles, etc.
OJS will access its article database to present the result set. This works exactly in the same way as in the current OJS search implementation. The only difference is that article IDs are provided by solr rather than being retrieved from OJS' own index implementation.
Prefixed facet searches have the advantage that they only suggest searches that will actually return results. They can even inform about the number of documents that will be returned for each proposed search query. The problem with prefixed facet searches is that they do not scale linearly when searching in very large indexes. Term searches on the other hand are extremely fast as they can be provided almost without processing from the default Lucene term index. Prefixed facet searches have the same requirements with respect to the data model and analysis process as faceting or alternative spelling suggestions. See the data model and corresponding faceting/spelling chapters for details.
Solr provides a considerable number of query parsers. Prominent choices in our case are: “lucene” (the default Lucene query syntax), “dismax” (provides a simplified query syntax and allows additional fine tuning of query parameters, e.g. term boost) and “edismax” which improves on details of the dismax implementation and safety and supports both, simplified and default Lucene query syntax.
In principle the “edismax” parser would be preferred as it allows power users to enter advanced ranking parameters within the query. The “edismax” parser also has the most advanced parameters for OJS providers to customize search handlers. Due to a restriction with respect to the default search operator ("AND") in combination with fielded (multi-lingual) queries, the edismax parser will, however, not be able to provide all required functionality. We therefore currently recommend the Lucene parser as it supports all required search syntax plus the required default search operator in fielded searches. See http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-td3222594.html for a discussion of this restriction.
Query Transformation and Expansion
As we are using the “edismax” query parser, we have to translate OJS queries to solr's extended dismax syntax before posting them to the search handler:
- We add the search field in front of the query and wrap the query in parenthesis, i.e. searchfield:(some query).
- We expand all queries to the journal's configured languages and publication formats. E.g. when a query is done on the title field (e.g. titles:something), then we'll expand this query phrase to OR-joined query phrases on localized fields (e.g. titles_en_US:something OR titles_de_DE:something OR titles_es_ES:something OR titles_txt:something).
This design ensures that queries are always language agnostic and support mixed-language documents. In our example: if the Spanish version of a galley contains an English citation then this citation can be found correctly stemmed in “galley_full_text_pdf_en_US” or generically analyzed in “galley_full_text_pdf_txt”.
Query Analysis and Synonym Injection
We recommend that for query purposes, the query will be analyzed exactly in the same way as the queried document fields at indexing time. Therefore there isn't much to say about query analysis that has not been said before in the analysis chapter. There is only one possible exception to this rule: We may want to consider synonym injection at query time. This means that an additional analysis component could be added to the query analyzer that checks for synonyms, either in a static, manually maintained language-specific thesaurus or in online sources like WordNet. Whenever a synonym is found it will be injected to the token stream and handled as if it had been part of the original query. Whether or not query-side synonym injection should be implemented and from which source is to be decided by the project owner.
Ranking of OJS articles is done through the default solr/Lucene ranking algorithm. The algorithm is called “term frequency – inverse document frequency” or “TF-IDF” for short and will be outlined in the next few paragraphs. Lucene-specific details of the ranking algorithms are out of scope for this document and can be looked up in the Lucene documentation, above all in the JavaDoc of the StandardSimilarity class.
The term frequency TF(t, d) is the number of times the term t occurs in a given document d.
Inverse Document Frequency
The inverse document frequency of a term t in an index is:
IDF(t) = log(N / DF(t))
- t is an arbitrary dictionary term
- N is the total number of documents in an index. In the Lucene documentation this measure is usually referenced to as "maxDoc".
- DF(t) is the document frequency of term t in an index. The document frequency is defined as the number of documents containing the term t one or more times. In Lucene this is usually called "docFreq".
NB: IDF is finite if every dictionary term t occurs at least once in the document collection (as then DF(t) > 0). If we build our dictionary exclusively from the document collection itself then this is guaranteed and IDF is defined everywhere.
Combined Term / Inverse Document Frequency
Ranking in Lucene is some variant of the combined term/inverse document frequency:
TF-IDF(t, d) = TF(t, d) * IDF(t)
NB: TF-IDF is zero if a document does not contain the term t. If the dictionary contains terms that are not in the document collection then TF-IDF is defined to be zero for this term for all documents. We can usually avoid this by building our dictionary exclusively from the document collection.
Overlap Score Measure
Score(q, d) is the sum over all t in q of TF-IDF(t, d) where q is the set of all terms in a search query.
In other words: A search term contributes highest to a document's ranking for a given search query when the term occurs often in a document and the term has a high discriminatory significance in the document collection by choosing a small percentage of a collection's documents only.
In Lucene scoring of search queries with multiple terms is done with a “coordination factor” that works similarly to the approach just outlined. Lucene calculates a score for each field to derive it's document-level score. Lucene also further modifies this simple ranking approach for easy customization (e.g. term boost, field boost, document boost, etc.) and efficiency.
Vector Space Model
Another perspective on Lucene's scoring algorithm is by modeling documents as term vectors in a vector space spanning all terms of a dictionary. More precisely: If D is a set of terms (i.e. a “dictionary”) then a single document d can be modeled as a vector V(d) in a card(D)-dimensional vector space.
Example: When using TF-IDF as scoring measure then the n-th component of the vector produced by the vector function V(d) is the TF-IDF of the document for the n-th dictionary term.
In this model a similarity (or distance) measure can be defined that computes the similarity of two documents (or a document and a query).
A common similarity measure is the cosine similarity:
sim(d1, d2) = V(d1) . V(d2) / |V(d1)||V(d2)|
where the dot (.) stands for the vector dot product (inner product) and || is the Euclidean norm. This is equal to the inner product of V(d1) and V(d2) normalized to unit length.
The advantage of this model is that not only distances/similarities between documents but also between a search query and a document can easily be calculated:
sim(d, q) == v(d) . v(q)
where v() stands for the document vector V() normalized to unit length.
NB: The “more-like-this” feature relies on similarity calculations between documents. It therefore requires term vectors to be stored in the index.
Fine-Tuning the Ranking
By default we do not change any preconfigured ranking factors. By using the “edismax” query parser (see above) we do keep the option, though, to fine-tune client-side (query-level) ranking parameters should it become necessary. It is recommended to avoid client-side ranking adjustments. By configuring different solr search request handlers, different ranking approaches can be provided to different OJS installations by changing their search endpoint configuration (see deployment option configuration above).
Potential Additional Ranking Metrics
While TF-IDF is the default scoring/ranking model in Lucene it can be customized by providing so-called boost factors for different entities in the model. These are simple multipliers that can occur on term and document level that increase (multiplier > 1) or decrease (multiplier < 1) the scoring contribution of that entity.
In our case document boost opens up a few interesting possibilities to further tune the relevancy of search results. Here are a few examples of metrics that could be used to “boost” certain articles so that they rank higher in result lists:
- citation index data
- usage metrics, e.g. as supplied by the OJS.de statistics sub-project
- click-through popularity feed back from OJS, i.e. the number of times an article was actually opened after it being presented on a result list
- article recency, i.e. favor articles with a more recent publication date over older articles
The question is how such data could be provided to solr. I propose that we implement an API in OJS that can receive and store document-level boost data for each article. This can be implemented as a non-mandatory setting in the article settings table. If such boost data is present then it will automatically be sent to solr at indexing time. We'll have to implement a normalization method so that editors can enter arbitrary numbers that will then be translated to proper boost factors. Changing the boost data would mean that the article would have to be re-indexed (like any other change to search-related article meta-data).
Alternatively, advanced users can provide periodically updated files with document boost data, see http:// lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField. html.
Result presentation is mostly an OJS-side task. It will not differ from the current implementation except for a few details that will be outlined later. The solr search server returns ID fields only which can be used by OJS to retrieve all required additional data from its database. This implementation recommendation is due to two reasons:
- The solr index will consume considerably less space if the original text does not have to be stored.
- In a subscription-based scenario we want to avoid that full texts can be leaked if malicious users gain direct access to the Lucene server. It has to be admitted, though, that not storing full texts is a relatively weak protection and can be worked around to a certain extend. The best protection against full text leakage is proper firewalling of the solr server as described in the deployment option chapter below.
If we want to support highlighting then we have to store galley full texts for all interface languages. Highlighting requires the original non-analyzed text to be present so that the context of search terms can be retrieved. Other necessary changes to the result presentation have already been defined in the interface specification above. Please refer there for more details.
Paging is supported in solr through a query parameter. OJS will restrict search queries using solr's “start” and “rows” query parameters so that only actually displayed articles will be returned. This reduces the size of messages to be passed over the network.
Highlighting can be supported in solr by a simple additional search query parameter: “...&hl=on...” plus a few configuration parameters defining the highlighted fields and the amount of context to be returned. If highlighting should be supported then we propose to base it on galley full text. In this case solr will automatically return extracts from the galley full text containing query terms. These extracts can then be displayed to the end user as part of the result list.
NB: Highlighting requires original (non-tokenized) galley full text to be stored in the index which can considerably blow up index size.
The “edismax” server allows server-side ordering based on any indexed field. Server-side ordering is especially important when retrieving paged result sets. By default result sets will be sorted by the “virtual” field score which has been described above in the ranking section. Any other field can be specified with the sort parameter, e.g. “...&sort=authors_txtsort...”. See the data model section above for requirements of sortable fields.
As described in the interface specification above, we propose faceting for the following search fields:
- publication date
Some special filtering of these fields can be applied for faceting:
- It may make sense to only include the first author for faceting. It is up to the project owner to define this.
- Faceting on the publication date must be by date range rather than discrete date values.
- All other fields must be analyzed with the keyword analyzer and otherwise left intact so that facets display with the original spelling.
We propose that we only return faceting results for the currently chosen interface language. This means that the fields for faceting cannot be preconfigured in the OJS search request handler but must be provided in the query string, e.g. “...&facet.field=subjects_de_DE”. Facets will be selected as links or check-boxes in the faceting block plug-in as described in the interface section above. Selecting one or more factes will result in the original query being re-issued with an additional filter query, e.g. in the case of a date range “...&fq=publication_date_dt:([2006-01- 01T00:00:00Z TO 2007-01-01T00:00:00Z] NOT "2007-01-01T00:00:00Z") ...”.
Alternative Spelling Suggestions
Alternative spelling suggestions for a given search query can be provided based on solr's spellcheck search component. If this should be implemented then we recommend creating a multilingual dictionary based on a concatenated field that contains all meta-data and full text. The spellcheck dictionary will be stored in a separate Lucene index in …/files/solr/data/spellchecker. Such a spellcheck configuration has been implemented in the default configuration and details can be checked there. The dictionary needs to be rebuilt after large updates to the solr index. For performance reasons, we recommend issuing dictionary build commands on demand from the OJS installation after a few updates rather than updating after every commit. There is also an automatic “build after optimize” option for the spellchecker component. We cannot use this as we do not recommend optimization in the embedded scenario (see search interface specification above). We suggest to issue build commands to the usual OJS search interface with the following parameters: “q=nothing&rows=0&spellcheck=true&spellcheck.build=true&spellcheck.dictionary=default”. The q parameter is not usually required to build the dictionary. In our set-up with a preconfigured search request handler not providing it will result in an error.
"More like this"
The “more-like-this” (MLT) feature can be implemented by configuring the solr MLT search component. There are several options how to access the MLT component:
- as a dedicated request handler that usually is supplied a single document id,
- as a search component of another request handler or
- as a request handler that ingests full text and proposes similar documents from the index.
While the first option is the most frequently used option, we do not recommend it as it will require configuration of a second search request handler within OJS which is contrary to our “simplicity” policy. We rather propose to provide MLT functionality through our standard search handler as an optional search component. The search component will be disabled by default and only be activated when doing specific MLT requests. We recommend to do MLT requests by sending a GET request with the following parameters to the usual search request handler (“.../search”):
- mlt.fl=xxx where xxx is set to the document field corresponding to the galley formats with the current interface language, e.g. mlt.fl=galley_full_text_pdf_en_US,
- q=id:xxx where xxx is the solr document ID (see “Data Model” above) of the article we wish to base our search on,
- start=... and rows=... as for usual queries according to the current result set page selected by the user,
As we expect the MLT feature not to be used too frequently in most cases we propose not to store term vectors by default. This means that when an MLT request is being issued the corresponding document fields will have to be re-analyzed to derive term vector information. This is a slower than storing term vectors but saves considerable storage space. Term vectors can always be activated by advanced users if the default configuration is found to be too slow or resource consuming.
OJS/solr Protocol Specification
Internal API Specification
- If possible, hooks should not be placed in code that will later be refactored into a plugin (e.g. classes/search/ArticleSearchDAO.inc.php, and any MySQL index specific methods in classes/search/ArticleSearch.inc.php and classes/search/ArticleSearchIndex.inc.php).
- We'll introduce an ArticleSearchManager for consistency with naming conventions elsewhere and have article indexing functions delegated to that class. It'll be responsible for invoking whatever plugins are configured and new hooks will mostly be placed there. Hooks should be named accordingly.
Index Maintenance Protocol
The index maintenance protocol is responsible for enabling write access from OJS to the solr index. It provides functions for adding/updating and deleting documents. Both functions are batch functions that can be invoked with one or many articles at once.
As we've seen above, articles can be pushed or pulled. In the “push” configuration, OJS will take action whenever one or more articles need to be added or updated. In the “pull” configuration, solr will initiate index updates due to a central update schedule.
Adding an article to the OJS/solr index is done in three steps:
- First an XML document with all article meta-data, including the corresponding galley and supplementary file meta-data, is sent over HTTP POST to the OJS DIH request handler .../solr/ojs/dih that is part of the recommended default solr configuration. We can do this in non-blocking mode so that the OJS front-end remains responsive independent of indexing processing time.
- The XML links to all related full text documents which DIH asynchronously pulls from the OJS server for extraction and preprocessing.
- OJS will mark an article as "indexed" if the response given by solr indicates indexing success. The processing request will not exit until the indexing has completed.
- If processing returned an error then a notification will be provided to OJS editors so that they can correct the error and re-index the document.
A pull processing protocol may be implemented like this:
- First the solr server will send a parameter-less GET request to a well known, installation-wide OJS end point. Providers will have to configure appropriate DIH schedulers for this purpose. OJS will respond with a list of newly added articles since the last request. This list is encoded in the same XML format used for push processing so that the DIH push configuration can be used nearly unchanged for pull processing. Only the initial ContentStreamDataSource has to be changed to a URLDataSource.
- DIH will loop through the document list and request binary documents for galleys and supplementary files from their usual locations as indicated in the pulled XML document. This works exactly the same as for the push protocol described above.
- Finally DIH will send a confirmation XML to OJS containing the successfully indexed documents. These documents will be marked "indexed" in OJS and not be offered for indexing again.
Please note that we'll have to implement two additional OJS handler operation to support pull requests:
- An operation that returns the XML article list and
- a second operation that accepts an XML-encoded list of article IDs to be marked “indexed”.
As indexing is an idempotent deletion/addition process in Lucene, network link or processing errors during any of the above steps will not result in an incomplete or corrupt index as long as DIH compiles and returns a reliable list of indexed articles to OJS. If anything goes wrong in the process OJS will not mark the sent articles as “indexed” and the articles will be re-included for indexing during the next pull request. Additionally providers can monitor indexing errors solr-side.
XML Format for Article Addition
As laid out in the preprocessing section, we recommend using native solr plug-ins for data extraction. In our case we have chosen the Data Import Handler (DIH) for document extraction and preprocessing.
It has been evaluated whether existing OAI providers, i.e. NLM, Mods or MARC over OAI could be used with DIH. It has also been analyzed whether the native (or other existing) export formats could be imported. Unfortunately neither is not possible because DIH imposes limitations onto the OJS/solr data exchange format:
- DIH's XPath implementation is not complete. Only a subset of the XPath specification is supported. XPath queries that qualify on several attributes cannot be used which rules out OJS native export format. We have to provide a simple XML format that can be interpreted with DIH.
- DIH's Tika integration is usually restricted to a fixed number of binary documents per Lucene document. In our case, however, we have to support indexing of an arbitrary dynamic number of galleys and supplementary files per article. We work around this limitation by embedding CDATA-wrapped XML sub-documents for galleys and supplementary files into the main XML article list. Such documents can be extracted separately into fields and – together with a special field data source and custom DIH ScriptTransformer – make DIH "believe" that it is dealing with one binary file at a time. This workaround rules out both, OAI and OJS XML export formats as DIH source formats in our case.
Fortunately the required OJS/solr XML date exchange format is quite simple. A sample implementation exists which executes a pure SQL script to construct the XML for push to the solr test server from an arbitrary OJS database. The XML format is as follows:
- <articleList>...</articleList>: This is the root element containing a list of article entities.
- <article id=”...” instId=”...” journalId=”...”>...</article>: This element is the only allowed child element of the <articles> element and its sub-elements contain all meta-data and file information of a single OJS article. The ID attribute contains a combination of a universally unique OJS installation ID, the journal ID and the article ID. This is necessary so that IDs are unique even when providers collect article data from several installations into a single search index. “instId” and “journalId” contain the installation and journal IDs separately which will be required for administrative purposes, e.g. batch deletion of articles.
- <authorList><author>...</author>...</authorList>: Full names of one or several article authors. This and the following elements are placed below the <article> element. If the information for any of this or the following search fields is not available then the element will be missing completely. Order of elements matters in the case of authors.
- <titleList><title locale=”...”>...</title>...</titleList>: The article title together with its locale. Order of sub-elements does not matter for this or the following meta-data fields.
- <abstractList><abstract locale=”...”>...</abstract>...</abstractList>: Localized article abstracts.
- <disciplineList><discipline locale=”...”>...</discipline>...</disciplineList>: Localized article disciplines.
- <subjectList><subject locale=”...”>...</subject>...</subjectList>: Localized article subjects.
- <typeList><type locale=”...”>...</type>...</typeList>: Localized article types.
- <coverageList><coverage locale=”...”>...</coverage>...</coverageList>: A list of coverage keywords (concatenates geographic, time and sample coverage).
- <journalTitleList><journalTitle locale=”...”>...</journalTitle>...</journalTitleList>: The journal title together with its locale.
- <publicationDate>...</publicationDate>: The article's publication date in ISO 8601 format without second fractions (“YYYY-MM-DDTHH:MM:SSZ”). All dates are treated as UTC dates. OJS must translate local publication dates into UTC before sending them to solr.
- <issuePublicationDate>...</issuePublicationDate>: The issue's publication date in ISO 8601 format.
- <galley-xml>...</galley-xml>: A UTF-8 encoded CDATA field that contains an embedded XML file (including the <?xml …?> header. We have to embed this XML so that solr's DIH extension can treat it separately during import processing. This is a workaround so that we can import several binary files for a single article.
- <suppFile-xml>...</suppFile-xml>: A UTF-8 encoded CDATA field containing embedded XML with supplementary file data. See <galley-xml> above for an explanation why we embed a secondary XML character stream.
- <article id=”...” instId=”...” journalId=”...”>...</article>: This element is the only allowed child element of the <articles> element and its sub-elements contain all meta-data and file information of a single OJS article. The ID attribute contains a combination of a universally unique OJS installation ID, the journal ID and the article ID. This is necessary so that IDs are unique even when providers collect article data from several installations into a single search index. “instId” and “journalId” contain the installation and journal IDs separately which will be required for administrative purposes, e.g. batch deletion of articles.
Description of the embedded galley XML:
- <galleyList>...</galleyList>: Wraps a list of galleys. This is the root element of the XML file embedded in <galley-xml>...</galley-xml>.
- <galley locale=”...” mimetype=”...” url=”...” />: An element representing a single galley. It has no sub-elements. The mimetype attribute is the MIME type as stored in OJS' File class. The url attribute points to the URL of the full text file. DIH will pull the file from there over the network and extract its content.
Description of the embedded supplementary file XML:
- <suppFileList>...</suppFileList>: Wraps a list of supplementary files. This is the root element of the XML file embedded in <suppFile-xml>...</suppFile-xml>.
- <suppFile locale=”...” mimetype=”...” url=”...”>...</suppFile>: An element representing a single supplementary file. It contains further sub-elements with some supplementary file meta-data. See the <galley> element above definition of the mimetype and url attributes. OJS has to make sure that the locale is one of the valid OJS locales or “unknown”. This requires internal transformation of the supplementary file language to the OJS 5-letter locale format if possible.
- <titleList><title locale=”...”>...</title>...</titleList>: A supplementary files localized title information.
- <creatorList><creator locale=”...”>...</creator>...</creatorList>: Supplementary file creators.
- <subjectList><subject locale=”...”>...</subject>...</subjectList>: Supplementary file subjects.
- <typeOtherList><typeOther locale=”...”>...</typeOther>...</typeOtherList>: Supplementary file types.
- <descriptionList><description locale=”...”>...</description>...</descriptionList>: Supplementary file descriptions.
- <sourceList><source locale=”...”>...</source>...</sourceList>: Supplementary file sources.
- <suppFile locale=”...” mimetype=”...” url=”...”>...</suppFile>: An element representing a single supplementary file. It contains further sub-elements with some supplementary file meta-data. See the <galley> element above definition of the mimetype and url attributes. OJS has to make sure that the locale is one of the valid OJS locales or “unknown”. This requires internal transformation of the supplementary file language to the OJS 5-letter locale format if possible.
The <articleList> is the only mandatory element. All other <*List> elements have cardinality 0..1 with respect to their parent elements. All other elements have cardinality 0..n with respect to their parent elements.
The update handler listening at “http://127.0.0.1:8983/solr/ojs/dih” in the default embedded solr server configuration will be able to consume this XML format.
When a user updates an OJS article, galley or supplementary file, all documents and meta-data belonging to the same article will have to be re-indexed.
Lucene does not support partial update of already indexed documents. Therefore the OJS/solr protocol does not implement a specific update syntax. Adding a document with an ID that already exists in the index will automatically delete the existing document and add the updated document.
See the protocol for document addition for more details.
We propose to support four use cases:
- delete a single article from the index
- delete all articles of a journal
- delete all articles of an installation
- delete all articles in the index
Deletion of a single article from the index is required when an article is being unpublished in OJS (“rejected and archived”). Deletion of articles from a journal or installation will be required when (partially) re-building an index, see the interface specification above. Deletion of all articles in the index only differs from from the third case in scenarios S3 and S4. As these are installation-overarching operations we do not recommend providing an end-user interface for this task in OJS. We rather recommend that providers completely delete or move their index directory or build a new index in the background in a separate core and switch to this core after the re-build by using direct access to the solr web interface.
All other use cases can be supported by calling solr's native update handler “.../solr/ojs/update” with the usual <delete>...</delete> syntax from within OJS. We provide journal and installation IDs in our data model so that we can batch delete all documents from these entities with a simple delete search query.
When working with push updates then all deleted documents can immediately be pushed for re-indexing if required. When working with pull updates then deleted articles will be marked “not indexed” so that they'll be re-indexed automatically the next time a pull request arrives from the solr server.
The OJS/solr search protocol is the well documented “edismax” query and result format. We do not reproduce the general “edismax” syntax here. Please refer to the official solr documentation for details.
We implement a custom search request handler for OJS search queries. In the embedded scenario it will listen to requests at “http://127.0.0.1:8983/solr/ojs/search”. We do not place queries through solr's default “/select” request handler. This handler should not allow public access in the default configuration as it allows direct requests to solr's administrative request handler.
Configuring our own request handler has further advantages:
- We can preconfigure it with mandatory parameters that cannot be changed client-side. This helps to secure our request handler to a certain extent and reduces the amount of parameters that need to be passed in from OJS.
- We enable advanced users to set almost all search parameters (mandatory or default) without having to change OJS code.
- We enable advanced users to define their own restricted search endpoints (e.g. filtering on a certain category of journals) if they implement a provider-wide search server. These endpoints can then be configured as custom search endpoints in OJS, see “Configuring the Deployment Scenario” above.
We only use a subset of the “edismax” search options. This subset has been described in the “querying” chapter above. Please refer there for details on the protocol parts actually being used by OJS search access to solr.
According to our requirements, we need to cover a large range of deployment scenarios, from single journal deployments of OJS (S1) all the way up to large system landscapes including integration of OJS search with arbitrary search applications (S4). Fortunately the large majority of the configuration described in this document is independent of the deployment scenario. This means that only very few parameters will differ for the recommended configuration of different deployment scenarios. More specifically we recommend two deployment options:
- The single-journal and single-installation scenarios (S1 and S2) can be supported with an embedded solr server. The configuration for the embedded server will be part of the default OJS distribution. We call this deployment option the “embedded deployment”.
- The multi-installation and “just-another-app” scenarios (S3 and S4) can be supported with a central solr server reachable from all OJS servers over the provider's internal network. We call this deployment option the “network deployment”. We believe that even large OJS providers with one hundred or more journals will not require advanced solr scalability features like replication, see the “Index Architecture” discussion above. There is nothing, however, that keeps providers from replicating their OJS core to several servers if they wish so. Balancing between replicated servers can be done over an HTTP proxy or by configuring part of the OJS installations with one back-end and part with the other. Such configurations are out-of-scope for this document, though.
Common Deployment Properties
All OJS installation requirements apply unchanged. The following additional installation requirements must be met by the OJS server (embedded deployment) or the solr server (network deployment):
- Operating System: Any operating system that supports J2SE 1.5 or greater (this includes all recent versions of Linux and Windows).
- Disk Space: In the case of embedded deployment the disk the OJS installation resides on should have at least 150 MB of free disk space and the disk where the "files" directory resides on, should have enough free disk space to accommodate the search index created by solr. This should be no more than the double of the space occupied by galleys and supplementary files in that same folder. In the network deployment, disk space requirements for the servlet container and solr binaries depend on the chosen installation details. The space for the index should be at least double the space occupied by all galleys and supplementary files of the journals to be indexed.
- RAM: Memory requirements depend very much on the size of the indexed journals. If the journals have several GB of article galley files then for best performance a few GB of RAM will be required for the solr server and for the operating system's file cache. Smaller installations require less memory. We recommend starting the embedded server with default settings and only get back to it if performance problems occur in practice. In most cases, default settings will work well.
Both deployment options have in common that the solr client and configuration will be integrated into OJS as a generic plug-in. While the plug-in is disabled, the current OJS search function will work unchanged. Enabling the plug-in will switch to solr as a search back-end.
OJS plug-in code will be maintained within PKP's official github repository. The already existing SWORD plug-in creates a precedent for the integration of 3rd-party software libraries through PKP's plug-in mechanism. No Java software has been integrated into OJS by way of plug-ins so far. We therefore expect that a few additional integration techniques need to be developed.
Our proposed integration approach is described in the README.txt provided with the plug-in and will be summarized here.
For several reasons Java binaries for jetty or solr/Lucene should not be part of PKP's default distribution:
- The binaries are large and would inappropriately blow up the OJS distribution as many OJS users will not want to us solr search.
- When distributing binaries, PKP will have to take care to always upgrade binaries to the latest version and even release hot fixes when security updates occur. This adds a lot of unnecessary maintenance cost.
An integration as a git subproject as in the case of SWORD also does not seem appropriate as jetty/solr do not use git for their projects and maintaining our own jetty/solr binary git release server would be relatively costly.
We rather recommend that users download jetty/solr binaries from their original sources unchanged and extract them into well documented destinations within the solr OJS plug-in. A preconfigured installation script can then take care of copying or linking binaries to their required locations.
We cannot define a precise prescription for installing solr in a network deployment as this will largely depend on the provider's installation policy. Most providers will probably already have a preferred servlet container and may want to install and configure container and solr through OS-specific installation mechanisms.
Solr's example server does not come preconfigured with security in mind. Solr itself does not provide any authentication or authorization mechanisms. Securing solr must mostly be done through the servlet container and by properly protecting the server solr runs on. The following recommendations should be followed:
- Servers that host solr must be properly firewalled. Only search client applications should have (restricted) access to the solr search and update interfaces. In the case of the embedded scenario this means that solr should not be exposed to the network at all.
- Administrators should pay special attention to potential CSRF risks when developing their firewall strategy for solr. Clients with access to solr (e.g. browsers of admin staff) should be protected from 3rd-party “takeover”.
- Exposing solr to the public is strongly discouraged. If done, an authentication scheme must be implemented in the servlet container or HTTP proxy to limit access to solr's admin interface, the OJS DIH import handler, the default solr update handler and the generic select handler. A sample configuration for jetty using BASIC HTTP authentication is provided in the default configuration. This is not a recommended protection mechanism, though!
- We have chosen to provide custom search handlers rather than making search available through the generic select handler. The generic select handler allows unsecured access to update and admin handlers and may therefore NOT be exposed to the public.
- We recommend disabling remote streaming in solrconfig.xml: enableRemoteStreaming = false. Otherwise content of arbitrary files the solr process has access to locally or over the network will be exposed to whoever can access solr!
- We recommend disabling JMX unless actually used.
- We recommend never to use solr's example configuration unchanged as it is not secure.
As most providers operate in an Open Access scenario, we do not recommend access limitations to the search handler by default (except for the firewalling as described above). The default recommended configuration will expose the query interface to all users on the provider's network who have HTTP access to the solr endpoint.
In order to limit access in a subscription based environment and reduce the amount of data to be transferred over the network, our custom search handler was configured with mandatory (“invariant”) query parameters limiting – among other things – the returned fields to the article ID field and search score. Further recommendations for subscription-based journals have been given in this document where appropriate.
The default solr deployment descriptor has been provided in plugins/generic/solr/embedded/solr/conf/solrconfig.xml. This descriptor is recommended for both, embedded and network deployments.
A default jetty configuration has been developed for the embedded scenario, see plugins/generic/solr/embedded/etc/*.*
Details of recommended solr and servlet container configurations for both scenarios will be given in the following sections.
The embedded deployment option will work for the large majority of OJS users. With a few easy and well-documented additional installation steps it is possible to transform every OJS server into a solr server that should be reasonably secure for the majority OJS users. We have laid out these steps in the README.txt that comes with the plug-in and will be displayed on the plug-in home page as long as no working solr server has been configured for the plug-in.
The embedded deployment works with a preconfigured Jetty server and solr binaries directly deployed to the special plug-in directory “plugins/generic/solr/lib”. It is sufficient to download and extract the binaries and execute an installation script to get up and running, both on Linux and Windows operating systems. We pre-package all solr configuration required for embedded deployment inside the plug-in. No additional manual configuration is usually required. Transitive data, i.e. the index and the spellchecking dictionary, will by default be saved to the files_dir configured in config.inc.php. We'll create a “solr” sub-directory there for our purposes.
See plugins/generic/solr/embedded/bin/start.sh for further details of the configuration of the embedded scenario.
The configuration of the embedded scenario follows a "secure by default" approach. While we do recommend proper firewalling of the OJS server even in the embedded scenario, the default configuration will provide basic protection even with no firewall in place. We do this by binding the embedded Jetty server to the loopback device (127.0.0.1) which should prohibit external access to the server on most operating systems. The above comments about CSRF vulnerability of solr apply to the embedded deployment if users log into the OJS server and open a browser (or other client software with network access) there.
Even in the embedded scenario, jetty and solr will need to be upgraded from time to time, e.g. in case of security or performance updates. In this case the new versions can simply be extracted into “plugins/generic/solr/lib” following the instructions in README.txt.
In the embedded scenario, solr can be started from within OJS with a background exec() call of a start script running a daemonized version of jetty with proper start parameters. On Windows this will probably not work without additional installation steps to create a system service. We may alternatively try to work around this restriction by running Jetty within a “permanent” PHP background process (e.g. http://stackoverflow.com/questions/45953/php-execute-a-background-process). Whether this works has to be tested in practice. It doesn't seem to be a very scalable and reliable option, though.
Alternatively the Linux or Windows shell solr start script provided in plugins/generic/solr/embedded/bin can always be executed directly on the OJS server.
In the embedded scenario, the privileges of the web server / PHP user are probably appropriate for the solr server too. This will be the default case when starting solr from within OJS. Users are free, though, to execute the start script manually with any other user as long as they make sure that that user has write permissions to the solr index files.
Analyzing search query logs is a great tool to optimize search. We do not recommend enabling query logging by default in the embedded scenario, though. Most users opting for the embedded scenario will not be able to interpret query logs, so these logs will just unnecessarily occupy disk space. The default configuration sets logging levels to obtain enough information on the console when users need support through the forum or other remote communication means.
The network deployment option enables large service providers to connect any OJS installation to a solr server running in an arbitrary servlet container (e.g. Jetty or Tomcat) deployed somewhere on the local network. We do not give specific installation instructions for solr servers deployed like this as these instructions depend on the provider's individual OS and (in the case of Linux) OS distribution. We make sure, though, that providers can copy the solr configuration directory provided with the solr plug-in unchanged and plug it into their servlet container. This enables providers to get up-and-running with an OJS-compatible solr installation in very little time.
We'll also recommend providing full step-by-step installation instructions for a well-known Linux distribution e.g. Debian/Ubuntu. This can then usually easily be adapted to other distributions as well.
Feature Implementation Matrix
The feature implementation matrix details search front-end and back-end features that must be implemented to provide minimum search functionality that guarantees compatibility with the current OJS search as well as additional optional features. It also contains back-end features that provide additional administrative advantages to providers or improve index maintenance.
Every entry in the matrix contains
- a short feature description including its relevance to the different deployment options where applicable,
- the OJS authorization level to access the feature,
- whether it is a feature already present in OJS' search implementation or whether it is a new feature,
- a description of the feature's business value,
- an approximate implementation effort classification differentiating between the OJS back-end and user interface,
- alternative implementation options (if any),
- test cases (if defined by FUB or their partners) and
The complete feature implementation matrix can be found here: https://docs.google.com/spreadsheet/ccc?key=0ArYsBcy_S9NkdFlBS0VqcE9wQjFHU3NhOFBFT191dHc&pli=1#gid=0
The feature implementation matrix is meant as a specific guideline for the project owner to select and prioritize features to be implemented in future projects. It also may be used as an implementation guideline for 3rd party service providers executing future implementation projects.