- 1 Indexing
- 1.1 Requirements
- 1.2 Index Architecture
- 1.3 Data Model
- 1.4 Document Submission and Preprocessing
- 1.4.1 Existing OJS document conversion vs. Tika
- 1.4.2 Local vs. Remote Processing
- 1.4.3 Push vs. Pull
- 1.4.4 Implications of multilingual document processing
- 1.4.5 Custom pre-processing wrapper vs. solr plug-ins
- 1.4.6 Solr Preprocessing plug-ins: IDH vs. Cell
- 1.4.7 Should we use Tika to retrieve meta-data from documents?
- 1.4.8 Transmission Protocol
- 1.4.9 Requirements
- 1.4.10 Recommendation
- 1.5 Analysis
- 1.5.1 Precision and Recall
- 1.5.2 Multilingual Documents
- 1.5.3 Character Stream Filtering
- 1.5.4 Tokenizing
- 1.5.5 Token Filtering
- 1.5.6 Stemming
- 1.5.7 Date Indexing Support
- 1.5.8 Spatial Indexing Support
- 2 Ranking
- 3 Deployment Scenarios
- 3.1 Requirements
- 3.2 Recommendations common to all scenarios
- 3.3 Recommendation for Scenario S1 and S2: Embedded Solr Server
- 3.4 Recommendation for scenario S3 and S4: Shared Solr Server
- 4 OJS/solr Protocol Specification
- 4.1 Index maintenance protocol
- 4.2 Search protocol
- 4.3 Administration and Configuration Interface
- 4.4 Search Interface
|Single Journal (S1)||Multi-Journal (S2)||Multi-Installation (S3)||Institution-wide Index (S4)|
|Document types||article meta-data, galleys and supp. files||article meta-data galleys, supp. files + arbitrary additional documents|
|Document source||single journal||several journals of a single installation||journals accross (groups of) installations||several installations + arbitrary external applications|
|Search fields (simple search)||author*, title, abstract, galley content, keyword search (discipline, subject, type**, coverage***)|
|Search fields (advanced)||author*, title, discipline, subject, type**, coverage***, galley content, supp. file content, publication date|
|Document languages||tbd. (U Berlin)|
|Mixed-in languages||tbd. (U Berlin)|
|Document formats||Plaintext, HTML, PDF, PS, Microsoft Word|
|Multi-client capabilities||scope: per-journal vs. per-installation / features: languages, ranking|
* contains first name, middle name, last name, affiliation, biography
** usually contains a research approach or method for the article
*** consists of the article's geo coverage, chronological coverage and "sample" coverage
Test Data and Sample Queries
A large number of sample queries were an integral part of the requirements specification for this project. All sample queries are executed against a mixed-language, mixed-discipline corpus of OJS test journals and articles. Both, sample queries and sample data, have been provided by U Berlin and their partners.
Test data were taken from live OJS journals. Wherever possible we tried to work with complete copies of journals. When full copies were not available we imported partial content (select journal issues and/or articles) into an OJS test database for indexing and querying.
The following process was applied to collect sample queries:
- We constructed an online form that simulates the OJS search form (simplified and advanced).
- Test users (editors and readers) of various OJS journals were asked to provide realistic test queries.
- Submitted test queries were executed against the test corpus.
- Result sets were returned to test users for review.
- Search results (precision, callback, ranking) were tuned according to user feedback.
The main decision with respect to index architecture is whether to use a single index or multiple indexes (and corresponding solr cores).
Advantages of a single index for all journals and document types:
- easy maintenance
- easy search across multiple document types: A single search across article meta-data, galleys and supplementary files with the intend to retrieve articles is possible.
- no need to merge, de-duplicate and rank search results from different indexes (distributed search)
NB: Storing sparse or denormalized data (e.g. across document types) is efficient in Lucene, comparable to a NoSQL database.
Disadvantages of a single index:
- ranking problems when restricting search to heterogeneous sub-sets of an index (e.g. a single journal)
- potential namespace collisions for fields if re-using the same schema for different document types (e.g. supp. file title and galley title in the same field)
- scalability problems if scaling beyond tens of millions of documents
- adding documents invalidates caches for all documents (i.e. activity in one journal will invalidate the cache of all journals)
Implications of multilingual support for index architecture
There are two basic design options to index a multilingual document collection:
- Use one index per language
- Use one field per language in a single index
Advantages of a single index:
- One index is simpler to manage and query.
- Results will already be joined and jointly ranked. No de-duplication of search results required.
Advantages of a multi-index approach:
- The multi-index approach may be more scalable in very large deployment scenarios - especially where a large number of OJS installations are indexed on a central search server.
- Language configurations can be better modularized by providing separate solr core configurations for each of them and deploy them as needed. Even hot deployment of new languages is possible which may be an advantage for large OJS providers. No re-indexing of all documents is required when a new language is being introduced. It is questionable, though, whether journals will ever introduce a new language into already published articles. When introducing new languages for new articles only then the single index approach also does not require re-indexing.
- The ranking metric "docFreq" is per-field while "maxDoc" is not. Using one index per language these parameters will be correct even when using a single field definition for all languages. We can easily work around this in a single-index design, however, by providing one field per language.
See http://lucene.472066.n3.nabble.com/Designing-a-multilingual-index-td688766.html for a discussion of multilingual index design.
Single Index Architecture
Several disadvantages of the multi-index scenario are not relevant in scenarios S1 to S3:
- We have only one relevant document type: OJS articles. By properly de-normalizing our data we can easily avoid field name collisions or ranking problems due to re-use of fields for different content (e.g. we would certainly have two separate 'name' fields for article name and author name).
- It is not to be expected that the number of documents per journal (S1), installation (S2) or provider (S3) will exceed millions of articles. If it should happen then providers of this size will certainly have the skill available to configure a distributed search server while maintaining API compatibility based on our search interface documentation.
- In usual scenarios the cost of cache invalidation due to new galley or supplementary file upload seems reasonable. If the cost of cache invalidation or synchronous index update after galley/supp. file addition becomes prohibitive we can still choose a nightly update strategy. This is in line with the current 24 hour index caching strategy.
- Our multilingual design can be implemented in a single index.
Whether ranking will suffer from a single-index approach depends on the heterogeneity of the journals added to the index. It may become a problem when search terms that have a high selectivity for one journal are much less selective for other journals thereby distorting Lucene's default inverse document frequency (IDF) scoring measure when restricting query results to a single journal.
An example will illustrate this: Imagine that you have two Mathematics journals. One of these journals accepts contributions from all sub-disciplines while the other is specialized on topology. Now a search on "algebraic topology" may be quite selective in the general Maths journal while it may hit a whole bunch of articles in the topology journal. This is probably not a problem as long as we search across both journals. If we search within the general maths journal only, then documents matching "algebraic topology" will probably receive lower scores than they should because the overall index-level document frequency for "algebraic topology" is higher than appropriate for the article sub-set of the general maths journal. This means that in a search with several search terms, e.g. "algebraic topology AND number theory" the second term will probably be overrepresented in the journal-restricted query result set. Only experiment with test data can show whether this is relevant in practice. It is fair to believe, though, that the vast majority of queries will be across all indexed journals and therefore not suffer such distortion. This is because most users do have an interest in their topic matter rather than being interested in a specific publication.
NB: We do not have to bother about content heterogeneity on lower granularity levels, e.g. journal sections, as these cannot be selected as search criteria to limit search results.
The same ranking distortion could theoretically apply to multilingual content if we were to collect all languages in a single index field. In the proposed schema, however, we use a separate field per language. (See the multilingual analysis section below for details.) As document frequency counts are per index field, we'll get correct language-specific document counts. The total document count will also be ok as we'll denormalize all language versions to the article level.
S1 and S2: Embedded Solr Core
In deployment scenario S1 and S2 we only search within the realm of a single OJS installation. This means that a single embedded solr core listening on the loopback IP interface could serve such requests. This allows for very easy set-up and almost zero end-user configuration requirements.
S3: Single-Core Dedicated Solr Server
In deployment scenario S3 we search across installations. This means that the default deployment approach with a per-installation embedded solr core will not be ideal as it means searching across a potentially large number of distributed cores. Therefore, the provider will probably want to maintain a single index for all OJS installations deployed on their network.
This has a few implications:
- We have to provide guidance on how to install, configure and operate a stand-alone solr server to receive documents from an arbitrary number of OJS installations.
- The OJS solr integration will need a configuration parameter that points to the embedded solr core by default but can be pointed to an arbitrary solr endpoint (host, port) on the provider's network.
- The OJS solr document ID will have to include a unique installation ID so that documents can be uniquely identified across OJS installations.
S4: Multi-Core Dedicated Solr Server(s)
In deployment scenario S4 we have an unspecified number of disparate document types to be indexed. This means that the best index design needs to be decided upon on a per-case basis. We have to distinguish two possible integration scenarios:
- display non-OJS search results in OJS
- include OJS search results into non-OJS searches
The present specification can only deal with the second case as the first almost certainly requires provider-specific customization of OJS code that we do have no information about.
Our index architecture recommendation for the S4 scenario is to create a separate dedicated solr core with OJS documents exactly as in scenario S3. Then searches to the "OJS core" can be combined with queries to solr cores with non-OJS document types in federated search requests from arbitrary third-party search interfaces within the provider's network. (See http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set for one possible solution of federated search.)
This has the advantage that the standard OJS solr search support can be used unchanged based on the same documentation resources that we provide to support S3.
The only extra requirement to support the S4 scenario could be the inclusion of a globally unique "application ID" into the OJS core's unique document ID so that a federated search can uniquely identify OJS documents among other application documents. This will only be required when using solr's federated search. Otherwise the search client will query the cores separately and join documents based on application-specific logic (e.g. displaying separate result lists for different document types).
Our recommendation for the data model is based on the type of queries and results required according to our feature list. We also try to implement a data model that requires as little schema and index modifications in the future as possible to reduce maintenance cost.
Meta-data fields that we want to search separately (e.g. in an advanced search) must be implemented as separate fields in Lucene. Sometimes all text is joined in an additional "catch-all" field to support unstructured queries. We do not believe that such a field is necessary in our case as we'll do query expansion instead.
To support multilingual search (and ranking, see there) we need one field per language for all localized meta-data fields, galleys and supplementary files.
In order to avoid ranking problems we also prefer to have separate fields per document format (e.g. PDF, HTML, MS Word) rather than joining all data formats into a single search field. We can use query expansion to cover all formats while still maintaining good ranking metrics even when certain formats are not used as frequently as other formats.
The relatively large number of required fields for such a denormalized multilingual/multiformat data model is not a problem in Lucene (see http://lucene.472066.n3.nabble.com/Maximum-number-of-fields-allowed-in-a-Solr-document-td505435.html).
We prefer dynamic fields over statically configured fields:
- Dynamic fields allow us to reduce our configuration to one generic field definition per analyzer chain (i.e. language).
- No re-configuration or re-indexing of the data schema will be required to support additional languages or document formats.
- No re-configuration of the data schema will be required to add additional meta-data fields later.
- The ID field ("id").
- Localized article meta-data fields ("titles_xx_XX", "abstracts_xx_XX", "disciplines_xx_XX", "subjects_xx_XX", "types_xx_XX", "coverageGeo_xx_XX", "coverageChron_xx_XX", "coverageSample_xx_XX") where "xx_XX" stands for the locale of the field.
- A single localized field for supplementary file meta-data ("suppFile_xx_XX") where "xx_XX" stands for the locale.
- Localized galley and supplementary file full-text fields ("galley_full_text_mmm_xx_XX" and "suppFile_full_text_mmm_xx_XX") where "mmm" stands for data format, eg. "pdf", "html" or "word" and "xx_XX" stands for the locale of the document.
The exact data schema obviously depends on the number of languages and data formats used by the indexed journals.
In the case of supplementary files there may be several files for a single locale/document format combination. As we only query for articles, all supplementary file full text can be joined into a single field per language/document format. And as we do not allow queries on specific supplementary file meta-data fields we can even further consolidate supplementary file meta-data into a single field per language.
To reduce index size and minimize communication over the network link all our fields are indexed but not stored. The only field to be stored in the index is the ID field which will also be the only field to be returned over the network in response to a the query request. Article data (title, abstract, etc.) will then have to be retrieved locally in OJS for display. As we are using paged result sets this can be done without relevant performance impact.
Document Submission and Preprocessing
Existing OJS document conversion vs. Tika
Advantages of the existing OJS conversion:
- We can re-use an established process that some OJS users already know about.
Advantages of Tika:
- According to our tests, Tika works at least one order of magnitude faster than the current OJS solution. This is especially important for large deployment scenarios.
- Tika is easier to use and install than the current OJS solution. No additional 3rd-party tools have to be installed as is now the case (except for solr itself of course). HTML, MS Word and PDF documents are supported out-of-the-box by the code that comes with the standard solr distribution.
- Can be deployed independently on the search server and does not need an OJS installation to work. In scenarios S3 and S4 this means considerably less infrastructure to be deployed on OJS nodes.
- Very well tested and maintained.
- Enables indexing of additional source file types (TODO: which?).
Recommendation: Use the Tika conversion engine.
Local vs. Remote Processing
In the multi-installation scenarios S3 and S4 document preprocessing could be done locally to the installation or on the central solr server.
Advantages of local processing are:
- The solr server experiences less load for resource-intensive preprocessing tasks.
Advantages of remote processing are:
- Doing all processing on a single server will simplify maintenance as 3rd-party dependencies only need to be installed and configured on a single server. OJS installations can be added without any additional search-related installation requirements.
- Document uploads can be processed synchronously as the time-intensive preprocessing of binary documents is being done asynchronously on the remote server. Remote processing limits the time required locally to add a new document to the time it takes to send a relatively small HTTP POST request to the search server (see OJS/solr protocol specification below).
- We can keep load off the end-user facing OJS application servers for consistent perceived response time.
Recommendation: Use remote processing.
Push vs. Pull
Advantages of the push configuration:
- Indexing can be done on-demand when new documents are added to OJS. This guarantees that the index is always up-to-date.
Advantages of the pull configuration:
- Indexing schedules can be configured and co-ordinated in one single place (for scenarios S3 or S4). This may be interesting to balance the load on the central search server.
Recommendation: Use push configuration out-of-the-box but provide instructions and sample configuration for an optional pull configuration for larger deployments.
Implications of multilingual document processing
The OJS search feature returns result sets on article level rather than listing galleys or supplementary files as independent entities. This means that ideally our index should contain one entry per article so that we do not have to de-duplicate and join result sets. Different language versions and formats of articles should be spread over separate fields rather than documents. Such a denormalized design also facilitates multilingual search and ranking. A detailed argumentation for this preferred index design will be given below.
For document preprocessing this design implies that we have to join various binary files (galleys and supp. files in all languages and formats) plus the article meta-data fields into a single solr/Lucene document.
Custom pre-processing wrapper vs. solr plug-ins
We have to decide whether we want to implement our own preprocessing wrapper to solr (as has been done for the current OJS search implementation to integrate 3rd-party tools) or whether we want to re-use the preprocessing interface and capabilities provided by standard solr plug-ins that wrap the core solr server.
Advantages of a custom preprocessing interface are:
- We could use an arbitrary data transmission protocol, e.g. re-using existing export formats like the OJS native export format. This avoids cost for implementation and maintenance of a solr specific export format. We would have to write a custom conversion engine that outputs solr export format, though.
- We could re-use the existing document conversion code. As we prefer Tika for document conversion, this is not a very strong argument, either.
Advantages of standard solr plug-ins:
- We can re-use elaborate document preprocessing capabilities which we'd otherwise have to partially implement ourselves. This means we do not have to write our own wrapper code around Tika nor do we have to write code that converts the chosen OJS export format to the native solr document submission format.
- We can use Tika as a conversion engine without having to write a custom wrapper. Tika is well integrated with solr through two different plug-ins: DIH and Cell. Even our complex multilingual data model can be supported without custom Java programming.
- The preferred remote processing architecture can easily be implemented. Custom remote preprocessing code would be expensive in our case as it means either implementing and maintaining a separate PHP application or extending solr with custom Java code. As PKP does not have Java development knowledge it might even be impossible to maintain such code.
- Pull and push configurations can be supported out of the box without any additional implementation cost.
While using a solr plug-in means that the transmission protocol can no longer be arbitrary (and as we'll see later, it cannot be any of the existing OJS export formats), we'll still be able to use an export format that is much closer to OJS data structures than if we had to convert from, say, native OJS export format to the solr input format ourselves.
Recommendation: The advantages of using established solr plug-ins for data extraction and preprocessing strongly outweigh the (theoretical) advantages of writing our own solr preprocessing interface code.
Solr Preprocessing plug-ins: IDH vs. Cell
Currently there are two solr extensions that support Tika integration: The "Data Import Handler" (IDH) and the "Solr Content Extraction Library" (Solr Cell).
Cell is meant to index large amounts of files with very little configuration requirements. Cell does not support more complex import scenarios with several data sources and complex transformation requirements, though. It also does not support data pull. These disadvantages rule Cell out as a solution in our case although it may be the easier-to-configure pre-processing handler.
The second standard preprocessing plug-in, IDH, is a flexible extraction, transformation and loading framework for solr that allows integration of various data sources and supports both, pull and push scenarios.
Unfortunately even IDH has two limitations that are relevant in our case:
- IDH's XPath implementation does not support all types of queries. E.g. the fact that an XPath query cannot qualify on two different attributes rules out the possibility to transmit native OJS XML to IDH.
- IDH also does not usually support joining several binary documents into a single Lucene document. In fact no standard solr contribution is designed do so out-of-the-box (see http://lucene.472066.n3.nabble.com/multiple-binary-documents-into-a-single-solr-document-Vignette-OpenText-integration-td472172.html).
Using a carefully crafted XML format for data transmission, we can work around both limitation, though. This allows us to benefit from the advantages of IDH while still being able to work with our preferred index architecture and data model.
Recommendation: Use IDH for document preprocessing with a custom XML format.
Should we use Tika to retrieve meta-data from documents?
Tika can retrieve document meta-data from certain document formats, e.g. MS Word documents. This functionality is also well integrated with IDH.
Using this meta-data is problematic, though:
- Document meta-data cannot be consistently retrieved from all document types.
- Even where the document theoretically allows for storage of a full meta-data set, these meta-data may incomplete or even wrong.
- We do have a full set of high-quality document meta-data in OJS that we can use instead.
Recommendation: Do not use Tika to extract document meta-data but use the data provided by OJS instead.
Our preferred solr plug-in IDH supports a host of data transmission protocols (e.g. direct file access, HTTP, JDBC, etc.). In our case we could use direct file access or JDBC for the embedded deployment scenario. But as we also have to support a distributed scenario we should ideally submit all data over the network stack so that we do not have to maintain several different preprocessing configurations. As by far most of the processing time is spent for actual conversion and indexing of data and not for transmission over (local) networks, the transmission protocol clearly is not the limiting bottleneck.
HTTP is the network protocol supported by IDH. HTTP can be used for push and pull configurations. It supports transmission of ASCII (meta-)data as well as binary (full text) documents. Our recommendation is therefore to use HTTP as the only data transmission protocol in all deployment scenarios.
Non-HTTP protocols can still be optionally supported (e.g. for performance reasons) by making relatively small custom changes to the default IDH configuration.
Exact details of the transmission protocol will be laid out in subsequent sections.
Summing up the analysis our data import process has to meet the following requirements:
- No custom Java programming should be required.
- Push and pull scenarios should be supported.
- The indexing process must work over the network.
- We have to support our article-level multilingual/multiformat data model.
- Preprocessing should be done with Tika using solr plug-ins.
We provide a prototypical IDH configuration that serves all our import and preprocessing needs without having to develop custom Java classes:
- We provide push and pull configurations. Push is supported by IDH's ContentStreamDataSource and pull is supported via the UrlDataSource.
- Both configurations do not require direct file or database access. All communication is over the network stack.
- In our prototype we demonstrate a way to use an IDH FieldReaderDataSource to pass embedded XML between nested IDH XPathEntityProcessor instances. This allows us to denormalize our article entity with a single IDH configuration. We also draw heavily on IDH's ScriptTransformer to dynamically create new solr fields when additional languages or file types are being indexed for the first time. This means that no IDH maintenance will be necessary to support additional locales.
- All file transformations are done via IDH's Tika integration (TikaEntityProcessor). We nest the Tika processor into an XPathEntityProcessor and combine it with a ScriptTransformer to denormalize several binary files into dynamic solr fields.
Please see plugins/generic/solr/embedded/solr/conf/dih-ojs.xml for details.
Precision and Recall
This part of the concept describes how we analyze and index documents and queries to improve precision and recall of the OJS search. In other words: We have to include a maximum number of documents relevant to a given search query (recall) into our result set while including a minimum of false positives (precision). The ordering of the result set according to the relative relevance of documents is not being discussed here. See the "Ranking" chapter for details.
Measures that may improve recall in our case are:
- do not make a difference between lower and upper case letters
- remove diacritics to ignore common misspellings
- use an appropriate tokenization strategy (e.g. n-gram for logographic notation or unspecified languages and whitespace for alphabetical notation)
- use "stemmers" to identify documents containing different grammatical forms of the words in a query
- use phonetic analysis to ignore misspellings due to similar pronunciation
Measures that improve precision may be:
- ignore frequently used words that usually carry little distinctive meaning ("stopwords")
Often there is a certain conflict between optimizing recall and precision. Measures that improve recall by ignoring potentially significant differences between search terms may produce false positives thereby reducing precision.
Please observe that most of the above measures require knowledge about the text language, i.e. its specific notation, grammar or even pronunciation. A notable exception to this rule is n-gram analysis which is language-agnostic. Support for a broad number of languages is one of our most important requirements. Therefore appropriate language-specific treatment of meta-data and full text documents is critical to the success of the proposed design. We'll therefore treat language-specific analysis in detail in a separate section.
- The search form is language agnostic. Search terms can be entered in any language. Queries should even return results if they are mixed-language queries.
- Specific requirements can be derived from multilingual test queries as submitted by test users (see main "Requirements" section above).
- The indexing process should be able to deal with galleys and supplementary files in different languages.
- The indexing process should usually be able to rely on the locale information given for the galley/suppl. file being indexed. A language classifier might optionally be used for galleys whose locale information is unreliable or cannot be identified.
- The indexing process should be able to deal with mixed-language documents where short foreign-language paragraphs alternate with the main galley/suppl. file language.
- The following languages should be supported out of the box: ??? (tbd. by U Berlin).
- A process should be defined and documented for plugging in additional languages on demand.
Language-specific analysis vs. ngram approach
Advantages of an ngram approach:
- very easy to implement, a single analyzer chain will do the job
- no special index design required, all fields are multi-language
- very easy to extend to new languages (no additional configuration required)
- easy to query without query expansion
Disadvantages of an ngram approach:
- language information cannot contribute to the ranking of documents
- potentially inaccurate term recognition (false positives, false negatives)
- requires more storage space and
- may not deal well with a mixture of logographic/syllabic (e.g. Chinese, Japanese) and alphabetic (e.g. Western languages, Arab) writing systems.
Language recognition vs. preset language
One possibility to do field-level/document-level language recognition is http://lucene.apache.org/solr/api/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.html.
Advantages of language recognition:
- deals with incomplete locale information of meta-data, galleys and suppl. files
- required to support multi-lingual documents
Advantages of preset languages:
- much simpler to implement
- can be done using only solr/Lucene standard components (no additional plugins based on OpenNLP, LangPipe or similar required)
- may be more reliable in our case on a document level (if locales can be recognized)
Recommendation: We should try to work with preset languages to avoid unnecessary implementation/maintenance cost and complexity. Only if we see that in practice, certain important requirements cannot be met with preset languages, should we implement language recognition where necessary.
Document vs. paragraph-level language recognition
The granularity of multilingual analysis has a great influence on implementation complexity and cost. While document-level language processing is largely supported with standard Lucene components, paragraph or sentence-level language recognition and processing requires considerable custom implementation work. This includes development and maintenance of Java language solr/Lucene plug-ins based on 3rd-party natural language processing (NLP) frameworks like OpenNLP or LingPipe.
We identified the following implementation options for multilingual support:
- Allow language-specific treatment only on a document level and treat all documents as "monolingual". Document parts that are not in the main document language may or may not be recognized depending on the linguistic/notational similarity between the main document language and the secondary language.
- Allow language-specific treatment on a document level and provide an additional "one-size-fits-all" analysis channel that tries to generically tokenize all languages (e.g. using an n-gram approach). Search queries would then be expanded across the language-specific and generic search fields. This will probably improve recall but reduce precision for secondary-language search terms.
- Perform paragraph or sentence-level language recognition and analyze text chunks individually according to their specific language. This should provide highest precision and recall but will most probably be much more difficult to implement and maintain.
The advantage of the first two options is that they can be implemented with standard solr/Lucene components. The third option will require development and maintenance of custom solr/Lucene plug-ins and integration with third-party language processing tools.
Experiment based on sample search queries will show which of the three options works best in our case.
Character Stream Filtering
TODO: stopwords, synonyms, lowercase, ...
Date Indexing Support
- Publication Date
- Chron Coverage
Spatial Indexing Support
- Geo Coverage
TF(t, d) is the number of times the term t occurs in document d.
Inverse Document Frequency
IDF(t) == log(N / DF(t))
- t is a term from our dictionary
- IDF(t) is the inverse document frequency of term t in an index.
- N is the total number of documents in an index. In Lucene this is called "maxDoc".
- DF(t) is the document frequency of term t in an index. The document frequency is defined as the number of documents containing the term t one or more times. In Lucene this is "docFreq".
Obs: IDF is finite if every term t occurs at least once in the document collection. If we build our dictionary from the document collection then this is guaranteed.
Combined Term / Inverse Document Frequency
TF-IDF(t, d) == TF(t, d) * IDF(t)
Obs: TF-IDF is zero if a document does not contain the term t. If the dictionary contains terms that are not in the document collection then TF-IDF is defined to be zero for this term for all documents.
Overlap Score Measure
Score(q, d) is the sum over all t in q of TF-IDF(t, d) where q is the set of all terms in a search query.
In other words: A search term contributes highest to a document's ranking for a given search query when the term occurs often in a document and the term has a high discriminatory significance in the document collection by choosing few documents from a large collection.
Vector Space Model
If D is a set of terms (a dictionary) built from a collection of documents then a single document d can be regarded as a vector V(d) in a card(D)-dimensional vector space.
Example: When working with the TF-IDF as scoring measure then the n-th component of the vector produced by the vector function V(d) is the TF-IDF of the document for the n-th dictionary term.
In this model a similarity (or distance) measure can be defined that computes the similarity of two documents. A common similarity measure is the cosine similarity:
sim(d1, d2) == V(d1) . V(d2) / |V(d1)||V(d2)|
where the dot (.) stands for the vector dot product (inner product) and || is the Euclidean norm. This is equal to the inner product of V(d1) and V(d2) normalized to unit length.
The advantage of this model is that not only distances/similarities between documents but also between a search query and a document can easily be calculated:
sim(d, q) == v(d) . v(q)
where v() stands for the document vector V() normalized to unit length.
Potential Additional Metrics
- Citation Index Data (document boost)
- Usage Metrics, e.g. as supplied by the OJS.de statistics sub-project (document boost)
- Click-through popularity feed back to solr via OJS (document boost)
- Document recency
The following deployment scenarios must be supported by our architecture:
- S1: search across articles of a single journal
- S2: search across multiple journals of a single OJS installation
- S3: search across various OJS installations within the provider's network
- S4: search across various applications (including one or more OJS installations) within the provider's network
Recommendations common to all scenarios
- integrate into OJS as a plug-in comparable to sword
- Properly firewall servers that host solr. Only search client applications should have (properly restricted) access to the solr search interfaces. The update, admin, debug and analysis interfaces should be either disabled or only available to properly authorized clients. Administrators should pay special attention to potential CSRF risks when developing their firewall strategy for solr.
- Use BASIC authentication for admin interface, index update, debug and analysis.
- Disable remote streaming in solrconfig.xml: enableRemoteStreaming == false.
- Disable JMX.
- Remove unused request handlers and example configuration.
As most providers operate in an Open Access scenario, we do not recommend access limitations to index data by default. The default solr server configuration will expose the index to all users on the provider's network who have HTTP access to the solr endpoint.
TODO: How do we limit access in a subscription based environment? How is this being done right now in OJS? TODO: Limit access to DIH push endpoint.
- Which OS and servlet containers will be supported?
- What are client-specific configurations and how do we implement such multi-client capabilities?
- Do we need distributed search or replication?
- What are (dis-)advantages of the implementation as a plug-in?
Recommendation for Scenario S1 and S2: Embedded Solr Server
- maintain a pre-configured Jetty server + solr binaries to be included into a special plug-in directory
- pre-package all solr configuration inside the plug-in
- change the default port to avoid clashes with other servlet containers on the same system
In the embedded scenario we follow a "secure by default" approach.
We recommend pre-configuring all common security recommendations described above except for the firewalling which can only be done by a sufficiently privileged server administrator.
The firewalling issue can be partially mitigated by binding the Jetty server to the loopback device (127.0.0.1) which should prohibit external access in many cases even without proper firewalling in place. This does not mitigate CSRF attacks, though!
Jetty and solr will need to be upgraded from time to time, e.g. in case of security or performance updates. In this case the new versions can simply be extracted into plugins/generic/solr/lib and the symlinks pointing to them be updated.
For a Windows installation this won't work. In this case the updated files will have to be copied into their proper place in the plug-in. It will be easy, though, to re-build the plug-in on a Linux box and zip it for distribution.
- Should we move the Jetty work directory under ojs/cache?
- Configure admin-extra.html
- How will we start jetty? From PHP with exec() of a script that runs jetty? How can we daemonize jetty on Windows so that we can do this from PHP? (e.g. http://stackoverflow.com/questions/45953/php-execute-a-background-process ?)
- Should we log requests in Jetty? And what do we do with these requests? Maybe as a debug option at dev time?
- Should we recommend a separate user to run solr under even in the embedded scenario? Or could it be the PHP/Apache user (which would simplify file permissions a lot)?
- How can we configure jetty authentication to use a password that is non-public without the user having to change etc/realm.properties?
- Which solr request handlers should remain in solr/conf/solrconfig.xml? (E.g. disable the admin/analysis/debug/example request handlers in solrconfig.xml.)
- As an additional security layer: Add a warning to the plug-in that the server should be firewalled before starting Jetty and warn admins and end users about CSRF risk.
- connect to a Jetty server deployed somewhere in the network
- provide a set of solr configuration recommendations together with the necessary binaries
- provide a default set-up based on a well-known Linux distribution e.g. Debian/Ubuntu
OJS/solr Protocol Specification
Index maintenance protocol
Adding an article to the OJS/solr index is done in three steps:
- First an XML document with all article meta-data, including the corresponding galley and supplementary file meta-data, is sent asynchronously (via OJS background processing framework) over HTTP POST to the well-known DIH endpoint .../dih-ojs of the solr target instance.
- Then DIH will asynchronously pull full text documents from their normal OJS locations and tokenize them one by one.
- OJS will mark an article as "indexed" if the response given by solr indicates indexing success. The processing request will not be deleted until the indexing has successfully completed.
This allows us to benefit from the advantages of push communication while still keeping the DIH interface responsive enough to trigger indexing of documents within the OJS article editing process.
As laid out in the pre-processing section of the present document, we prefer using native solr plug-ins for data extraction. In our case we have chosen the Data Import Handler (DIH) for document extraction and pre-processing. The OJS/solr interface therefore is subject to certain limitations imposed by DIH:
- DIH's XPath implementation is not complete. Only a subset of the XPath specification is supported. XPath queries that qualify on several attributes cannot be used for example which rules out OJS native export format. We have to provide a simple XML format that can be interpreted with DIH.
- DIH's Tika integration is usually restricted to a single binary document per Lucene document. In our case, however, we have to support indexing of an arbitrary number of galleys and supplementary files in different languages and formats for a single article. We found a way to work around this limitation by providing embedded CDATA-wrapped XML sub-documents for galleys and supplementary files within the main XML document. Such documents can be extracted separately in DIH and together with a custom DIH ScriptTransformer make DIH "believe" that it is dealing with a single binary file per document.
The exact XML format will be described below.
Pulling an article into the index is done in three steps:
- First the solr server will send a GET request to a well known OJS end point. Providers will have to configure appropriate DIH schedulers for this purpose. OJS will respond with a list of newly added articles since the last request. This list is in the same XML format used for the push processing to reduce extra configuration requirements.
- DIH will then loop through the document list and further request binary documents for galleys and supplementary files from their usual locations as indicated in the pulled XML document.
- Finally DIH will send a confirmation XML to OJS containing the successfully indexed documents. These documents will be marked "indexed" in OJS and not be offered for indexing again.
As indexing is an idempotent deletion/addition process in Lucene, communication or processing errors during any of the above steps will not result in an incomplete or corrupt index. In the worst case the same document will be retrieved several times.
XML format for article addition
When a user updates an OJS article, galley or supplementary file, all documents and meta-data belonging to the same article will have to be re-indexed.
Lucene does not support partial update of already indexed documents. Therefore the OJS/solr protocol does not implement a specific update syntax. Adding a document with an ID that already exists in the index will automatically delete the existing document and add the updated document.
See the protocol for document addition for more details.
The OJS/solr search protocol is simply the default solr query and result format. Please consult the solr documentation for details.
Administration and Configuration Interface
Configuring the Deployment Scenario
Partial or full re-indexing
Index optimization is most likely not relevant to the embedded scenario. To keep the OJS interface simple and easy to use, we do not support index optimization from OJS. Providers that work with large multi-installation indexes can use the usual solr interface to optimize their index if required.
The search syntax of the solr/Lucene-driven search will be a super-set of the syntax currently provided by OJS. This means that all queries that work in the current OJS search will be supported in the same way by the solr/Lucene back end. We do allow additional advanced search options, though, that are supported by Lucene only.
TODO: Describe chosen query parser.
e.g. using WordNet or user-defined synonyms lists
- Highlighting (requires term-vectors and stored fields)
Result Manipulation (After-Search)
- "More like this" feature