Difference between revisions of "OJSdeStatisticsConcept"

From PKP Wiki
Jump to: navigation, search
(Define the data loading protocol.)
(Correct search ranking boost range (1.0-2.0))
 
(46 intermediate revisions by one user not shown)
Line 54: Line 54:
 
* Implement a feature "similar (more viewed) articles".
 
* Implement a feature "similar (more viewed) articles".
  
The following sections will provide analysis and implementation recommendations for these requirement areas.
+
The following sections will provide analysis and implementation recommendations for these requirement areas. Main section titles contain Roman numbers in brackets that symbolize the project phase in which this part will be specified and implemented (I, II or III). Sections with "PKP" in their title describe features that will be implemented by PKP.
  
= Data Extraction and Storage =
+
= Data Extraction and Storage (I) =
  
 
This section defines how we intend to log, pseudonimize and store usage events.
 
This section defines how we intend to log, pseudonimize and store usage events.
Line 97: Line 97:
 
* referent section
 
* referent section
 
** document URL (This will be the canonical "best URL" produced by <code>PKPRouter::url() + ...::getBest...Id()</code>)
 
** document URL (This will be the canonical "best URL" produced by <code>PKPRouter::url() + ...::getBest...Id()</code>)
 +
** internal document ID (This will be an OJS-specific ID similar to the default DOI that can be produced under all circumstances and will help us to uniquely identify a document in the return data)
 
** optional: one or more unique document IDs (e.g. DOI, URN, etc.) (can be easily retrieved from the pubId-plugins for objects where IDs are defined)
 
** optional: one or more unique document IDs (e.g. DOI, URN, etc.) (can be easily retrieved from the pubId-plugins for objects where IDs are defined)
  
Line 114: Line 115:
 
** HTTP user agent (if available, use <code>PKPRequest::getUserAgent()</code>)
 
** HTTP user agent (if available, use <code>PKPRequest::getUserAgent()</code>)
  
* service type section (omitted, only for relevant for link resolvers)
+
* [http://www.openurl.info/registry/docs/xsd/info:ofi/fmt:xml:xsd:sch_svc service type] section
 +
** We'll designate statistics for abstracts with the "abstract" flag.
 +
** Galleys will be flagged as "fulltext".
 +
** Other publication objects will not be flagged.
  
  
Line 198: Line 202:
 
As the log expiry time will be relatively low, we do not see the necessity to rotate the log table or otherwise improve scalability.
 
As the log expiry time will be relatively low, we do not see the necessity to rotate the log table or otherwise improve scalability.
  
= Log Transfer =
+
= Log Transfer (II) =
  
 
=== Transformation into Context Objects ===
 
=== Transformation into Context Objects ===
Line 206: Line 210:
 
=== OAI interface ===
 
=== OAI interface ===
  
tbd.
+
A few properties distinguish the OAI interface required by OA-S from the default OJS OAI implementation:
 +
* We have to protect the interface with BASIC HTTP authentication.
 +
* We do not export meta-data about publication objects (articles, etc.) but about usage events. This also implies that data cannot be retrieved via the usual OJS database tables but must be read from the event log.
  
* OA-S provides a [http://transfer.cms.hu-berlin.de/oas_validator/index.php validator service] for the data provider OAI interface.
+
The [http://sourceforge.net/p/openaccessstati/code-0/3/tree/trunk/data-provider/ existing data provider for OA-S context objects] serves as a practical example and guideline for our work. While exact implementation details must be decided at implementation time it currently looks as if we could fully re-use the OAI base classes provided by PKP core.
* A protocol to confirm successful reception of raw data has not yet been implemented (see email Julika, 31.10.2012)
+
 
 +
We cannot re-use the JournalOAI and OAIDAO classes as they assume publication objects to be published. We also cannot reuse the existing OAIHandler as we have to provide an authenticated interface.
 +
 
 +
We therefore have to implement the following classes and methods to support the required OAI interface:
 +
* An OA-S specific handler method which I propose to integrate in the existing plug-in handler class OasHandler.
 +
* A subclass of the OAI class that provides the connection between the event log DAO and the OAI interface.
 +
* An OA-S specific OAI format that converts event log data to XML context objects.
 +
* The event log DAO will inherit from PKPOAIDAO so that a specific OAI DAO will not be necessary.
 +
 
 +
Once the OAI interface has been implemented it must be validated. OA-S provides a [https://transfer.cms.hu-berlin.de/oas_validator/index.php validator service] for the data provider OAI interface.
 +
 
 +
 
 +
Further information:
 +
* The exact OAI context object format has been [http://www.dini.de/fileadmin/oa-statistik/projektergebnisse/Specification_V5.pdf sufficiently specified] by the OA-S project and does not have to be documented here.
 +
* A protocol to confirm successful reception of raw data has not yet been implemented (see email Julika, 31.10.2012). This is necessary to confirm log deletion. We are implementing a timeout right now to delete log events (and to confirm to privacy regulations). See the details above.
  
 
=== Authentication ===
 
=== Authentication ===
Line 217: Line 237:
 
To make our implementation as configuration-less as possible we recommend implementing HTTP BASIC authentication in PHP rather than relying on a web server implementation (like Apache's). See http://php.net/manual/en/features.http-auth.php for details how this can be done in PHP.
 
To make our implementation as configuration-less as possible we recommend implementing HTTP BASIC authentication in PHP rather than relying on a web server implementation (like Apache's). See http://php.net/manual/en/features.http-auth.php for details how this can be done in PHP.
  
To simplify configuration we recommend using a standard username "oas" that will be the same for all OJS installations.
+
To simplify configuration we recommend using a standard username "ojs-oas" that will be the same for all OJS installations.
  
 
We recommend adding a single setting to the plugin for the password. This must be set by the end user. We preset the password with a random string so that the OAI interface will not be exposed inadvertently when first activating the OA-S plug-in.
 
We recommend adding a single setting to the plugin for the password. This must be set by the end user. We preset the password with a random string so that the OAI interface will not be exposed inadvertently when first activating the OA-S plug-in.
Line 223: Line 243:
 
Whenever a request to the OAI interface comes in we'll check the authentication credentials coming with the request. If they are missing we'll challenge the client with an HTTP BASIC authentication answer.
 
Whenever a request to the OAI interface comes in we'll check the authentication credentials coming with the request. If they are missing we'll challenge the client with an HTTP BASIC authentication answer.
  
= Retrieving metrics from the Service Provider =
+
= Retrieving metrics from the Service Provider (II) =
  
The specification of the return format has not yet been officially documented by OA-S and is currently unknown to us. We only have an example of the possible JSON return format:
+
The specification of the return format has not yet been officially documented by OA-S.
 +
 
 +
We do have samples of the possible return formats, though.
 +
 
 +
Here a sample JSON response:
  
 
     {
 
     {
Line 236: Line 260:
 
     ...
 
     ...
 
     }
 
     }
 +
    Source: Email Julika Mimkes, 31.10.2012
 +
 +
And here a CSV sample:
 +
 +
    date;identifier;counter;counter_abstract;robots
 +
    2012-11-10;oai:abc.de:00001;5;8;1
 +
    2012-11-10;oai:abc.de:02444;6;2;7
 +
    2012-11-10;oai:abc.de:05555;12;9;1
 +
    Source: Email Matthias Hitzler, 26.02.2013
  
    Source: Email Julika, 31.10.2012
+
According to information provided by the OA-S project (Source: Email Matthias Hitzler, 26.02.2013), the following details apply to the return interface:
 +
* Data is made available as plain text files on a web server.
 +
* Data arrives with a lag of three days currently. This enables the service provider to make sure that data is reliable and does not have to be revised later on.
 +
* A SOAP/SUSHI interface is in preparation but has not yet been implemented.
 +
* The URL of the return file server is esx-143.gbv.de/data_provider_name (test environment) and esx-144.gbv.de/data_provider_name (production environment).
 +
* A cronjob runs on a daily, weekly or monthly basis, depending on the return interval solicited during the registration phase. The cronjob does not currently run at a specific time.
 +
* The server provides a folder structure that makes it easy to identify available data. The path pattern is /data_provider_name/YYYY/MM/startdate_enddate.{csv,json}. Dates are given in the format 'Y-m-d'. Example: 2012-01-01_2012-01-07.csv for weekly generation or 2012-01-01_2012-01-01.csv for daily retrieval.
 +
* Files that have been written will not change. It is therefore sufficient to poll for new files and load these.
  
JSON parsing support has been integrated into PHP from version 5.2 onwards (see http://www.php.net/manual/en/json.installation.php). This is a rather recent version and it would be better if we could support earlier 5.x versions, too.
+
JSON parsing support has been integrated into PHP [http://www.php.net/manual/en/json.installation.php from version 5.2 onwards]. This is a rather recent version and it would be better if we could support earlier 5.x versions, too.
  
OA-S informed us that they also support a CSV format. Reading CSV files is supported in PHP 4 and 5 as long as the CSV conforms to basic escaping protocols. To achieve a full audit trail and allow for re-import of files it would be a good idea anyway to save downloaded statistics information in file format first and load it asynchronously into the database.
+
Reading CSV files is [http://www.php.net/manual/en/function.fgetcsv.php supported in PHP 4 and 5] as long as the CSV conforms to basic escaping protocols. To achieve a full audit trail and allow for easy re-import of files we recommend to stage statistics files locally first and load them asynchronously into the database. This combines well with the fact that OA-S provides plain text files on their server. Scanning and downloading these files will be easy and fast.
  
 
We therefore propose a load protocol that implements the following steps:
 
We therefore propose a load protocol that implements the following steps:
  
# Regularly poll the OA-S server for new data. This can be done on a daily basis and triggered either by a cron job (via OJS' scheduled tasks) or by arbitrary user requests to avoid dependency on a cron job similarly to what has been specified for the SALT interface. Perceived response time can be further improved by executing background jobs after sending the HTML response to the client, e.g. in a shutdown function. We have to make sure to ignore user abort requests when we rely on this method. We have to be careful to avoid race conditions when we
+
# Regularly poll the OA-S server for new data. This should be done on a daily basis and will be triggered via OJS' scheduled tasks. We'll remember the last file successfully loaded and then scan the well-known folder structure of the OA-S server for new files until we hit a 404 not found response.
# Whenever new data is available: Download all new metrics data into a well-known staging folder in the OJS files directory.
+
# Whenever new data is available: Download the new metrics file to a well-known staging folder in the OJS files directory.
# Regularly poll the staging folder for new files. The scheduling for this will be done in the same way as for OA-S server polling. If we trigger load processes in user requests then I propose loading a maximum of a single file per request to make sure that response time will not suffer too much.
+
# Regularly poll the staging folder for new files. The scheduling for this will be done in the same way as for OA-S server polling.
 
# When a new file is present then try to parse and load it. Once we "claim" a file, we immediately remove it from the "hot" staging folder to avoid race conditions. When the parsing is successful then move the file to the file archive. Otherwise move it to the rejection folder where it can be analyzed, corrected and manually moved to the staging folder again.
 
# When a new file is present then try to parse and load it. Once we "claim" a file, we immediately remove it from the "hot" staging folder to avoid race conditions. When the parsing is successful then move the file to the file archive. Otherwise move it to the rejection folder where it can be analyzed, corrected and manually moved to the staging folder again.
 
# When a file contains data that has been loaded before then the last write wins. This guarantees that the loading process will be fully idempotent.
 
# When a file contains data that has been loaded before then the last write wins. This guarantees that the loading process will be fully idempotent.
Line 253: Line 293:
 
If data is lost in the database then files can be moved from the archive back to the staging folder where they'll be discovered and automatically loaded again.
 
If data is lost in the database then files can be moved from the archive back to the staging folder where they'll be discovered and automatically loaded again.
  
= Value Added Services =
+
The cronjob configuration requirements will be documented in a README and on the plugin settings page. Polling and loading can also be triggered by user action. This is implemented through an "update statistics data" button on the plugin settings page and is meant as a fallback for those who are unable to or do not want to configure a cron job.
  
NB: The specification and implementation of the OA-S value added services are not part of the project phase OA-S I. We collect unstructured material for use in later project phases with exception of the common API for value added services which must be fully specified now for timeline synchronization with other project members and to clearly specify data requirements on our part.
+
See the specification of the metrics table below for a detailed description of the load target.
  
== Common API for Value Added Services ==
+
= Value Added Services (II / III) =
 +
 
 +
== Common API for Value Added Services (II) ==
  
 
Some of the use cases for value added services require us to integrate various usage statistics in the same place (e.g. OAS, COUNTER/OJS, ALM). Examples of this would be:
 
Some of the use cases for value added services require us to integrate various usage statistics in the same place (e.g. OAS, COUNTER/OJS, ALM). Examples of this would be:
Line 286: Line 328:
 
* The exact meaning of the input variables will be explained in a later section (see below).
 
* The exact meaning of the input variables will be explained in a later section (see below).
  
Furthermore we recommend the following specific API for publication objects (issue and article galleys, articles, supplementary files).
+
Furthermore we recommend the following specific API for OJS objects (application, journal, issue, issue and article galleys, articles, supplementary files).
 
* <code>getMetrics($metricType = null, $columns = null, $filters = null, $orderBy = null, $range = null)</code> to return a report with the given publication object pre-filtered.
 
* <code>getMetrics($metricType = null, $columns = null, $filters = null, $orderBy = null, $range = null)</code> to return a report with the given publication object pre-filtered.
* This is for convenience only as we have quite a few use cases with that filter on a single publication object.
+
* This is for convenience only as we have quite a few use cases with that filter on a single publication object. These methods should only be implemented when actually needed.
 
* These methods return <code>null</code> in case metrics are not defined for the given object or filter.
 
* These methods return <code>null</code> in case metrics are not defined for the given object or filter.
* <code>Article/ArticleGalley/IssueGalley/SuppFile::getViews()</code> would be renamed to <code>...::getReport()</code> with the above signature throughout OJS.
+
* <code>Article/ArticleGalley/IssueGalley/SuppFile::getViews()</code> would be renamed to <code>...::getMetrics()</code> with the above signature throughout OJS.
 
* If no <code>$metricType</code> is given for <code>getMetric()</code> then the main/primary metric type will be used.
 
* If no <code>$metricType</code> is given for <code>getMetric()</code> then the main/primary metric type will be used.
 
* This API can be extended to issues and journals for aggregate metrics retrieval.
 
* This API can be extended to issues and journals for aggregate metrics retrieval.
 
* The exact meaning of the input variables will be explained in a later section (see below).
 
* The exact meaning of the input variables will be explained in a later section (see below).
 +
 +
The journal and application object will be extended to return information required to configure a main metric for the respective context:
 +
* Both objects contain a <code>getDefaultMetricType()</code> that will return the currently configured main metric type (or <code>null</code> if no default metric can be found).
 +
* Both objects contain a <code>getMetricTypes()</code> that returns the metric types available for selection as potential main metric in the respective context.
 +
* See the interface specification for main metric selection below.
  
 
Plugins can internally decide whether to actually retrieve metrics from the <code>MetricsDAO</code> (see next paragraph) or whether to retrieve metrics from an external location (e.g. a web service).
 
Plugins can internally decide whether to actually retrieve metrics from the <code>MetricsDAO</code> (see next paragraph) or whether to retrieve metrics from an external location (e.g. a web service).
Line 312: Line 359:
 
* publication object (represented by assocId + assocType)
 
* publication object (represented by assocId + assocType)
 
* time (day)
 
* time (day)
* metric ("COUNTER", "OA-S", "ALM-Facebook", ...)
+
* metric ("OJS/COUNTER", "OA-S/COUNTER", "ALM-Facebook", ...)
  
 
Aggregation hierarchies over these dimensions:
 
Aggregation hierarchies over these dimensions:
 
* publication object: assocType
 
* publication object: assocType
* publication object: author
 
 
* publication object: article -> issue -> journal  (as aggregation hierarchy, not the objects themselves)
 
* publication object: article -> issue -> journal  (as aggregation hierarchy, not the objects themselves)
 
* time: month -> year
 
* time: month -> year
Line 335: Line 381:
  
 
We therefore propose a single new database table 'metrics' with the following columns:
 
We therefore propose a single new database table 'metrics' with the following columns:
* assocId: foreign key to article, article or issue galley, supplementary file
+
* assoc_id: foreign key to article, article or issue galley, supplementary file
* assocType: determining the type of the assoc id (article, article or issue galley, supplementary file)
+
* assoc_type: determining the type of the assoc id (article, article or issue galley, supplementary file)
 +
* article_id, issue_id, journal_id: denormalized entries for the publication dimension for improved performance and simplified aggregation
 
* day: the lowest aggregate level of the time dimension
 
* day: the lowest aggregate level of the time dimension
 
* month (optional): required only if we want to support month-only aggregation, either because some metrics cannot be provided on a daily basis or to compress historical data
 
* month (optional): required only if we want to support month-only aggregation, either because some metrics cannot be provided on a daily basis or to compress historical data
* metricType: e.g. "OA-S-Counter", "OJS-Counter", "ALM-Facebook", etc.
+
* metric_type: e.g. "oas::counter", "ojs::counter", "alm::facebook", etc.
* loadId (optional): a plug-in specific load identifier used to implement a restartable and scalable load process, e.g. a file name, file id or run id
+
* load_id (optional): a plug-in specific load identifier used to implement a restartable and scalable load process, e.g. a file name, file id or run id
* country (optional): the lowest aggregate level of the source geography dimension
+
* country_id (optional): the lowest aggregate level of the source geography dimension
* metric: an integer or float column that represents the aggregate metric
+
* metric: an integer column that represents the aggregate metric
  
The dimension columns (all except 'metric') should have a multi-column unique index placed on them to enforce data consistency.
+
Indexes should be placed only when really needed by actual reports so that load performance can be optimized.
  
 
The number of data items in the proposed table will be considerably less than what we currently need to store raw event data. I therefore do not believe that such a table will reach the underlying database's scalability limit. If such a thing would happen we could still purge old metric data or reduce the granularity of historic data (e.g. from per-day to per-month).
 
The number of data items in the proposed table will be considerably less than what we currently need to store raw event data. I therefore do not believe that such a table will reach the underlying database's scalability limit. If such a thing would happen we could still purge old metric data or reduce the granularity of historic data (e.g. from per-day to per-month).
Line 393: Line 440:
 
* The <code>$columns</code> variable is an unordered (non-hashed) array.
 
* The <code>$columns</code> variable is an unordered (non-hashed) array.
 
* The array contains a list of identifiers (PHP constants) for the lowest hierarchy aggregation level to be included in the report.
 
* The array contains a list of identifiers (PHP constants) for the lowest hierarchy aggregation level to be included in the report.
 +
* The available constants are <code>STATISTICS_DIMENSION_{JOURNAL_ID,ISSUE_ID,ARTICLE_ID,ASSOC_TYPE,ASSOC_ID,DAY,MONTH,COUNTRY,METRIC_TYPE}</code>.
 +
* Obs.: The <code>STATISTICS_DIMENSION_{JOURNAL_ID,ISSUE_ID,ARTICLE_ID}</code> constants are aggregation-level columns! Use them to include the article/issue/journal IDs additionally to the publication object's own ID. When an ID on a given level is not available (e.g. an <code>ARTICLE_ID</code> for issue galley statistics) then the value of that column will be <code>null</code>.
 
* Precisely zero or one value can be defined for each aggregation hierarchy. There may be several values per dimension if, and only if, there are several aggregation hierarchies for that dimension (e.g. authors and assocType for the publication object dimension). The dimension hierarchy specification will be checked for consistency wrt this requirement before executing the report.
 
* Precisely zero or one value can be defined for each aggregation hierarchy. There may be several values per dimension if, and only if, there are several aggregation hierarchies for that dimension (e.g. authors and assocType for the publication object dimension). The dimension hierarchy specification will be checked for consistency wrt this requirement before executing the report.
 
* A single scalar value can be given in case a single column should be specified.
 
* A single scalar value can be given in case a single column should be specified.
 
* Dimension hierarchies not included in this specification will be aggregated over (implicit selection of the top hierarchy level).
 
* Dimension hierarchies not included in this specification will be aggregated over (implicit selection of the top hierarchy level).
 
* The input variable could probably be named <code>$aggregation</code> or something with <code>$dimension...</code> in it for better conceptual consistency. I chose to name it <code>$columns</code>, though, so that those who do not have a firm grip of OLAP concepts immediately understand what this actually means.
 
* The input variable could probably be named <code>$aggregation</code> or something with <code>$dimension...</code> in it for better conceptual consistency. I chose to name it <code>$columns</code>, though, so that those who do not have a firm grip of OLAP concepts immediately understand what this actually means.
 +
* If no value is given (empty array) then the lowest available granularity (all dimensions) should be used.
  
 
An optional report-level '''filter specification''' (<code>$filter</code>):
 
An optional report-level '''filter specification''' (<code>$filter</code>):
 
* The <code>$filter</code> variable is a hashed array.
 
* The <code>$filter</code> variable is a hashed array.
* It contains identifiers (PHP constants) for the filtered hierarchy aggregation level as keys.
+
* It contains identifiers (PHP constants) for the filtered hierarchy aggregation level as keys. The constants are the same as for the dimension hierarchy specification above, see there.
 
* As values it either contains a hashed array with from/to entries with two specific hierarchy element IDs in case of a range filter or an unordered array of one or more hierarchy element IDs in case of an element selection filter.
 
* As values it either contains a hashed array with from/to entries with two specific hierarchy element IDs in case of a range filter or an unordered array of one or more hierarchy element IDs in case of an element selection filter.
 
* If only a single element is to be filtered then the value can be given as a single ID rather than an array.
 
* If only a single element is to be filtered then the value can be given as a single ID rather than an array.
Line 408: Line 458:
 
An optional '''result order specification''' (<code>$orderBy</code>):
 
An optional '''result order specification''' (<code>$orderBy</code>):
 
* The <code>$orderBy</code> variable is a hashed array.
 
* The <code>$orderBy</code> variable is a hashed array.
* It contains identifiers (PHP constants) for the filtered hierarchy aggregation level as keys.
+
* It contains identifiers (PHP constants) for the filtered hierarchy aggregation level as keys. The constants are the same as for the dimension hierarchy specification above, see there. If you want to order by the metric value then you can use the constant <code>STATISTICS_METRIC</code>.
* For each identifier you can specify one of the two values "asc" or "desc".
+
* For each identifier you can specify one of the two values "asc" (<code>STATISTICS_ORDER_ASC</code>) or "desc" (<code>STATISTICS_ORDER_DESC</code>).
 
* It can only include hierarchy aggregation levels that have also been specified in <code>$columns</code>. The order specification will be checked for consistency wrt this requirement before executing the report.
 
* It can only include hierarchy aggregation levels that have also been specified in <code>$columns</code>. The order specification will be checked for consistency wrt this requirement before executing the report.
  
An optional result '''range specification''' ($range): This is the usual PKP paging object <code>DBResultRange</code>.
+
An optional result '''range specification''' ($range): This is the usual PKP paging object <code>DBResultRange</code>. If no range is given then a maximum of <code>STATISTICS_MAX_ROWS</code> rows will be returned.
  
 
With this input design in mind we can make '''usage recommendations for access APIs''' specified in previous paragraphs (access through report plugins, publication objects and the <code>MetricsDAO</code>):
 
With this input design in mind we can make '''usage recommendations for access APIs''' specified in previous paragraphs (access through report plugins, publication objects and the <code>MetricsDAO</code>):
Line 436: Line 486:
 
* Specific reporting use cases can be tuned by introducing further aggregates (for better "dicing" support) or indexes (for "slices" that select at most a one-digit percentage of the cube). We do not implement such support unless it turns out in practice that certain use cases could not be implemented otherwise.
 
* Specific reporting use cases can be tuned by introducing further aggregates (for better "dicing" support) or indexes (for "slices" that select at most a one-digit percentage of the cube). We do not implement such support unless it turns out in practice that certain use cases could not be implemented otherwise.
  
=== A Simple Cross-Plugin Report Generator ===
+
== Value-Added Features (II / III / PKP) ==
  
On the basis of the PHP input format specification we can easily define an HTTP GET version of the input protocol. Translating such a protocol one-to-one to a call on the internal API is trivial.
+
Most of the following features are publicly available to all OJS users (readers). Plugin activation and configuration is always a journal manager's task. We'll only indicate role permissions when deviating from this default.
  
This will allow us to easily define reports which then can be presented in various formats (XML, CSV, HTML, PDF) without having to implement or even think about a heavy report generator front-end. Such a front-end can be implemented at any time if we like but I'd rather not try to re-invent Excel Pivot Tables in OJS.
+
=== Selecting a "Main Metric" (II) ===
  
With an HTTP GET protocol exporting live OLAP data to Excel or other OLAP tools will be extremely easy. Thus data can be flexibly analyzed outside OJS while common predefined reports with dynamic choices (e.g. time) can be implemented in OJS with a light weight front-end or on any page with a simple download/page link.
+
Based on the metrics API it will be possible to choose a "main metric" in the site and journal settings. Site-level configuration will be available to administrators and journal-level configuration to journal managers. Users with these roles will be presented with a list of all currently available metrics and may choose the one that should be used for features that need a single metric to be defined (e.g. the most-viewed articles feature).
  
This gives advanced OJS users nice opportunities for very easy but still powerful definition of custom reports in any supported output format everywhere in OJS or even for inclusion on external web pages!
+
The list of metrics in the site settings will contain only site-level plugins while the list in the journal settings will contain site-level plugins plus statistics plugins active in the journal context. The site-level default metric defaults to the first available metrics plug-in. If no metric has been chosen for a journal then the site-level main metric will be used for that journal.
  
== Editors and Authors ==
+
According to Alec (Source: Email, 26.02.2013), the journal-level setting will be temporarily placed into the the manager's "Stats & Reports" page of the journal setup and will later have to be migrated into the new tabbed OJS setup pages.
  
tbd.
+
All features specified in this document can be implemented via batch operations through a single (and in one case three) calls of the statistics API. In principle such batch operations could be provided via external data sources. But due to performance reasons, some of the value-added features may require the "main metric" to be cached locally (i.e. in the metrics table). In this case we propose that users should receive a specific warning when selecting such a metric as "main metric" indicating which of the features will not be available. As we don't know right now, whether this is necessary or not (in principle it is not...), such a warning message will not be implemented right now and should be added as needed.
  
== Readers ==
+
=== Most-Viewed Articles (II) ===
  
tbd.
+
The most-viewed article plug-in will show a list of the 10 articles ranking highest for the selected "main metric" throughout a journal. To identify an article's rank we'll sum up the metrics of its galleys excluding the article's abstract (Source: Email Bozana, 06.03.2013). The articles will be presented as title links in a block plug-in.
  
== Search Ranking ==
+
Journal reader will be able to choose from three time settings: previous month, previous year and "all times". This will define the time span from which statistics will be read. By default "previous month" will be selected.
  
tbd.
+
The most-viewed article block will not be available on site level as the sidebar cannot be easily configured at site level (Source: Email Bozana, 05.03.2013).
  
== Statistics Reports ==
+
This feature uses three separate batch requests to the metrics API. We currently assume that it can therefore be provided for all metric providers, even those that do not use the metrics table. It may be, though, that this is not correct. In case it turns out that the feature cannot be used unless we have locally cached data, it must be adapted to check metrics availability first.
  
tbd.
+
=== Search Result Ranking (III) ===
  
== Most-Viewed Articles ==
+
The "main metric" mentioned above can be used as a ranking factor. The "classic" SQL-based OJS search feature does not support ranking. The new Lucene plug-in can be configured to use external data to rank search results, though.
 +
 
 +
In principle there are two different ways to make Lucene aware of statistics data for ranking purposes:
 +
# The metric data could be submitted to Lucene at indexing time.
 +
# Alternatively metric data can be provided as an [http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes|"external file field"] without re-indexing articles.
 +
 
 +
As statistics data can change frequently we'd like to avoid having to re-index articles whenever their usage data changes. Having to do so would mean a considerable performance impact. We therefore recommend the second solution.
 +
 
 +
The recommended solution requires regular generation of a metrics report for all articles. We propose generating such a report on a daily basis via cron job. The cron job would have to execute the following steps:
 +
# Generate a customized metrics report for all articles.
 +
# Save the metrics report as an external solr index file (or copy and update the existing file if it already exists).
 +
# Trigger a "commit" operation to the index so that the new file will be recognized and used.
 +
# Delete the previous file.
 +
 
 +
In the embedded configuration we'll need a separate cron job to create the file. As this cron job runs locally, we can use the existing rebuildSearchIndex.php script for it. In the central server configuration we can extend the existing "pull" cron job for the same task.
 +
 
 +
To support the central indexing server use case, the cron job will have to (partially) update rather than overwrite the existing file. We cannot update the file "in place" as it will be locked on Windows machines. We'll rather copy and update the file and delete the previously active file once the commit operation has been issued to Solr which should unlock the previously used file. The file extension will have to be a running number as Solr will always use the last file in alphabetical order.
 +
 
 +
We recommend implementing this feature as an additional, optional search feature of the Lucene plug-in. This means that another search feature option will be added to the Lucene plug-in's configuration page. Ticking the corresponding configuration check box would activate the feature. If we implement the feature like this then it is probably not necessary to change OJS core. As soon as the feature has been enabled, the Lucene plugin will amend all search queries with an additional boost query that refers to the external field.
 +
 
 +
The customized metrics report can be easily generated via the Statistics API and provided over HTTP so that it becomes available to remote indexing servers. It contains a list of all unique index object IDs (instId + '-' + articleId) and the corresponding metric boost value, normalized to values between 1.0 (no usage = no boost) and 2.0 (highest usage). The metrics report can be implemented as an operation of the LuceneHandler class and will only be available when the corresponding search feature has been enabled in the plugin.
 +
 
 +
It has to be decided whether we'd like to use all-time metric data or whether we'd rather restrict usage data to those accrued over the last month or last year. In principle, this decision could be delegated to the journal manager by providing an additional configuration option in the plugin configuration.
 +
 
 +
An additional option would complicate the user interface, though, and it could not be obvious to the journal manager what this option actually means. We therefore recommend to use a reasonable default and add such an option only if it turns out in practice that there is a real need for it. We recommend using "all-time" metric data for ranking as monthly or yearly data may be systematically distorted in favor of the current edition of a journal which does not seem reasonable.
 +
 
 +
We currently assume that this feature can be provided through a single batch request to the metrics API. This means that it should be available for all metrics, even those that do not cache metric data locally in the metrics table.
 +
 
 +
=== Search Result Ordering (III) ===
 +
 
 +
Lucene search supports sorting search results by several sort criteria. We'd like to provide the same functionality for the SQL-based search implementation and include the main metric as one of the sort criteria.
 +
 
 +
'''Porting the "sort search results" feature:'''
 +
 
 +
As the "classic" OJS search feature is implemented in OJS core this means that we'll have to port code from the Lucene plugin to the <code>ArticleSearch</code> class in OJS core. Secondly we'll have to extend the <code>ArticleSearchDAO</code> to support sorting. Sorting must be done before limiting/offsetting results so it cannot be done in <code>ArticleSearch</code> but must be done in SQL. The SQL in <code>ArticleSearchDAO::getPhraseResults()</code> already joins published article objects and orders results so that we probably won't see any performance deterioration due to sorting by different criteria.
 +
 
 +
'''Sorting by metrics data:'''
 +
 
 +
Sorting by metrics data can either be implemented by joining the metrics table directly or by preparing a separate metrics report through the metrics API. The first option would be preferable from a performance point-of-view. Unfortunately some statistics plug-ins will probably not use the metrics table to store aggregate metrics data. To support such plug-ins it we'll instead retrieve metrics data in a batch request through the metrics API. This can still be a performance problem as we have to provide all IDs of articles for which we require metrics. If the remote statistics protocol does not support such a request it would have to retrieve metrics one by one. In this case this use case  could only be made available for metrics cached in the local database (i.e. in the metrics table)'''.
 +
 
 +
In principle sorting could be offered for all locally cached metrics and time windows ("all-time", "last month", "last year"). This would lead to an explosion of sort criteria, though, which would compromise usability of the search interface: If we had three eligible metrics we would get nine (3 times 3) extra sort criteria, etc. We therefore recommend supporting only the "main metric" (if it is cached locally) and let the user choose between two time windows ("all-time" and "current month"). A third time window ("last year") could be added easily if actually required by end users.
 +
 
 +
All changes required to provide the search result ordering by metrics data must be implemented in OJS core as it should be available for the "classic" SQL based search which is implemented in core.
 +
 
 +
=== Most-Read Articles of the same Author (III) ===
 +
 
 +
The article (abstract) page should include optionally configurable lists of articles relevant to the currently viewed article. One of these should be a list of all articles by the same author ordered by the "main metric", in descending order. The ordering should be done based on "all-time" metric data. See the reasoning in the search ranking section above.
 +
 
 +
This feature can be implemented as a generic plug-in via a template hook on the article page. Internally the plugin would be called with the article ID, could then retrieve the corresponding author information and all articles published by the same author.
 +
 
 +
The author is not a dimension of its own in the metrics table as an article may be written by more than one author, so the author dimension is not additive. This means that we cannot query the author directly through the metrics DAO. As there will usually be a relatively low number of articles per author we can make a single call to the metrics API, though, with an explicit filter on the author's article IDs which we fetch in a separate database query.
 +
 
 +
As we only need a single batch request to the statistics API we assume that such a request can be fulfilled by all implementors of the API including those not storing metrics data in the local metrics table.
 +
 
 +
=== "Similar" Articles (III) ===
 +
 
 +
The article abstract page should (optionally) recommend a list of "similar" articles. We propose implementing this feature as a separate generic plugin hooking into the article page template.
 +
 
 +
To support this use case we recommend implementing the "similar article" search as part of the core OJS search API. Currently such a feature is only available for the Lucene plug-in. This allows us to implement the "similar articles" plugin without having to make it directly aware of specific search implementations. The plugin would call the search API with an article ID which would retrieve similar articles based on the given ID.
 +
 
 +
To implement similarity search for the current SQL based search implementation we recommend finding similar articles based on subject keywords. We could construct an OR-joined search request with the subject keywords of the article against the subject keyword field of all articles of the same journal (or site-wide if the search query is not restricted to a journal). If the article does not have any subject keywords assigned then an empty result set will be returned.
 +
 
 +
We recommend ordering the result set by the default ranking score of the search implementation and displaying only the first 10 results directly on the article page. In the case of a Lucene search this means that results can be ranked (among other factors) by the "main metric" as described above. In the case of the SQL-based search, results could still be pre-ordered by the "main metric". This could however place less relevant articles to the top of the list just because they have been used more. This doesn't seem appropriate in this specific case where "similarity" should be the main ranking factor. We therefore propose adding a "see more" link to the bottom of the list which will take the user to the default search interface where more than 10 results can be displayed and (among other search features) sorting by metric will be available.
 +
 
 +
=== Integration with the Tag Cloud Plugin (III) ===
 +
 
 +
The existing tag cloud plug-in already uses the search API to retrieve articles matching a given keyword. It therefore will automatically rank search results by "main metric" if the Lucene search is activated and ranking by metric has been enabled.
 +
 
 +
In the case of SQL based search, pre-ordering of results by the "main metric" could be implemented without extra implementation cost via the search API but is not recommended due to the same reasons laid out in the case of the "similar article" feature. Users who wish to re-order results by metric can immediately do so in the search UI after clicking on one of the keywords.
 +
 
 +
=== Statistics Reports (PKP/Bruno) ===
 +
 
 +
NB: As agreed with Juan and Bruno, this functionality will be implemented by PKP.
 +
 
 +
Proposal for implementation: On the basis of the PHP input format specification of the front-end API (see above) we can easily define an HTTP GET version of the input protocol. Translating such a protocol one-to-one to a call on the internal API is trivial.
 +
 
 +
This will allow us to easily define reports which then can be presented in various formats (XML, CSV, HTML, PDF) without having to implement or even think about a heavy report generator front-end. Such a front-end can be implemented at any time if we like but I'd rather not try to re-invent Excel Pivot Tables in OJS.
 +
 
 +
With an HTTP GET protocol exporting live OLAP data to Excel or other OLAP tools will be extremely easy. Thus data can be flexibly analyzed outside OJS while common predefined reports with dynamic choices (e.g. time) can be implemented in OJS with a light weight front-end or on any page with a simple download/page link.
  
tbd.
+
This gives advanced OJS users nice opportunities for very easy but still powerful definition of custom reports in any supported output format everywhere in OJS or even for inclusion on external web pages. It will also make it trivial to replace existing statistics reports by links to the generic report generator. Loads of duplicate code can thereby be deleted from the code base and cross-metric reports will be available for the first time.
  
== Similar Articles ==
+
=== Editors, Authors and Readers Pages (PKP/Juan) ===
  
tbd.
+
NB: As agree with Juan, this functionality will be implemented by PKP based on the statistics API.

Latest revision as of 12:33, 22 May 2013

Overview

The OA-S Project

The German Open Access Statistics (OA-S) project intends to provide a framework and infrastructure that enables OJS users to establish alternative usage statistics for Open Access content and build value added services on top of such statistics.

This specification describes the use cases, requirements and implementation recommendations for the integration of OA-S into OJS.

The basic idea of OA-S can be described as a co-operation of two institutions:

  1. The data provider (in this case operators of OJS installations) tracks access to documents (in our case article and issue galleys, abstracts and supplementary files).
  2. Access data will then be made available through a protected OAI interface and harvested by the OAS service provider.
  3. The service provider will clean the raw data and produce aggregate metrics based on the COUNTER standard. At a later stage LocEc and IFABC statistics may also be supported.
  4. The data provider retrieves metrics from the service provider on a daily basis and stores them in OJS.

These metrics can then be used in different ways in OJS:

  • They can be displayed to the editors, authors and readers.
  • They can be used in search for ranking.
  • Editors could produce statistics reports.
  • We could provide a "most viewed article" feature.
  • We could implement a feature that displays "other (more viewed) articles of the same author".
  • We could display "similar (more viewed) articles" on an article's page.

We'll use the terms "data provider" and "service provider" from here on without further explanation. "Data provider" and "OJS user" can be used synonymously. We use the term "end user" to refer to users accessing an OJS site.

Requirements for OJS OA-S Integration

The requirements for OJS OA-S integration can be divided into four areas:

  1. log data extraction and storage (OJS)
  2. log transfer (OJS -> OA-S)
  3. metrics retrieval (OA-S -> OJS)
  4. value added services (OJS)

The requirements in these areas are...

Log data extraction and storage:

  • We log usage events for access to issue and article galleys, article abstracts and supplementary files.
  • Logged data must be pseudonimized immediately.
  • Logged data must be deleted immediately after it has been successfully transferred to the service provider or after it expires.

Log transfer:

  • Log data must be transformed into context objects.
  • We then have to provide an HTTP BASIC authenticated OAI interface from which the service provider will harvest raw log data.

Metrics retrieval:

  • We retrieve final metrics data from the service provider via a JSON web service.
  • Metrics data will then be stored into an OLAP data model which allows us easy data access, both granular and aggregate.

Value added services:

  • Display granular (per-object) metrics to the readers, authors and editors.
  • Use metrics as a search ranking criteria together with the Lucene plugin.
  • Define reports on OA-S metrics similarly to already existing OJS reports.
  • Implement a "most viewed articles" feature.
  • Implement an "another (more viewed) articles of the same author" feature.
  • Implement a feature "similar (more viewed) articles".

The following sections will provide analysis and implementation recommendations for these requirement areas. Main section titles contain Roman numbers in brackets that symbolize the project phase in which this part will be specified and implemented (I, II or III). Sections with "PKP" in their title describe features that will be implemented by PKP.

Data Extraction and Storage (I)

This section defines how we intend to log, pseudonimize and store usage events.

Log Events

We consider access to the following URLs in OJS as usage events:

article abstracts:

  • ../article/view(Article)/<article-id>
  • Access to ../article/view(Article)/<article-id>/<galley-id> or any other article page will NOT be counted.

article galleys:

  • .../article/viewFile/<article-id>/<article-galley-id>
  • .../article/download/<article-id>/<article-galley-id>
  • Access to .../article/view(Article)/... and .../article/viewDownloadInterstitial/... or any other article or article galley page will NOT be counted unless the galley is a remote or HTML galley.
  • NB: This differs from the usage event definition used for the current COUNTER plug-in!

supplementary files:

  • .../article/downloadSuppFile/<article-id>/<supp-file-id>
  • Access to .../rt/suppFileMetadata/... and .../rt/suppFiles/... or any other supp file page will NOT be counted.

issue galleys:

  • .../issue/viewFile/<issue-id>/<issue-galley-id>
  • .../issue/download/<issue-id>/<issue-galley-id>
  • Access to .../issue/viewIssue/... and .../issue/viewDownloadInterstitial/... or any other issue page will NOT be counted to avoid double counting.


OA-S expects us to deliver certain usage event data. OA-S provides sample code for DSpace log extraction. The same code is also provided via SVN. We analyzed the specification as well as the sample code to define the following list of all required or optional data items. The corresponding proposed OJS/PHP data source has been included between parentheses:

  • usage event timestamp (PHP's time() function +/- local time offset)
  • administration section
    • HTTP status code (will be always 200 in our case as we won't produce view events for non-200 responses)
    • downloaded document size (difficult from PHP, we may use connection_aborted() to identify some incomplete downloads but this won't be reliable as PHP may end before the download actually finishes when the web server buffers (part of) the response. Stackoverflow agrees with me here.)
    • actual document size (PKPFile::getFileSize() for galleys and supplementary files, I propose 0 in the case of article abstracts, NB: not implemented in the OA-S sample code!)
    • document format (MIME type, PKPFile::getFileType())
    • URI of the service (e.g. OJS journal URL, Config::getVar('general', 'base_url'))
  • referent section
    • document URL (This will be the canonical "best URL" produced by PKPRouter::url() + ...::getBest...Id())
    • internal document ID (This will be an OJS-specific ID similar to the default DOI that can be produced under all circumstances and will help us to uniquely identify a document in the return data)
    • optional: one or more unique document IDs (e.g. DOI, URN, etc.) (can be easily retrieved from the pubId-plugins for objects where IDs are defined)
  • referring entity section
    • HTTP referrer (if available, $_SERVER['HTTP_REFERER'])
    • optional: additional identifiers of the referring entity (e.g. DOI, ...) if available (not implemented in OJS)
  • requester section
    • hashed + salted IP (PKPRequest::getRemoteAddr() + hashing)
    • hashed C class (PKPRequest::getRemoteAddr() + truncation + hashing)
    • hostname of the requesting entity (if available), truncated to the second level domain (recommendation: do not implemented in OJS as this would require one DNS request per view event which would be very expensive, exception: use the hostname if it is present in $_SERVER "for free", use the algorithm from the sample code in logfile-parser/lib/oasparser.php, get_first_level_domain())
    • optional: classification, see the sample code, logfile-parser/lib/oasparser-webserver-dspace.php for an example (seems to be a temporary implementation, too)
      • internal: Usage events due to internal requirements, e.g. automated integrity checks, availability checks, etc.
      • administrative: Usage events that happen due to administrative decisions, e.g. for quality assurance. (proposal: use this category for accesses by logged in editors, section editors, authors, etc.)
      • institutional: Usage events triggered from within the institution running the service for which usage events are being collected.
    • optional: hashed session ID or session (recommendation: do not send)
    • HTTP user agent (if available, use PKPRequest::getUserAgent())
  • service type section
    • We'll designate statistics for abstracts with the "abstract" flag.
    • Galleys will be flagged as "fulltext".
    • Other publication objects will not be flagged.


According to OA-S information, these data items will be used for the following purposes:

  • IP address and timestamp are used to recognize double downloads as defined by the COUNTER standard. Such "double clicks" will be counted as a single usage event.
  • The C class of the IP address will furthermore be used to recognize robots and exclude their usage from the statistics.
  • The file information (url, name, document id, url parameters, etc.) are used to uniquely identify the document which has been accessed.
  • The HTTP status code will used as only successful access may be counted.
  • The size of the document is used to identify full downloads (e.g. 95% of the file downloaded). Partial or aborted downloads will not be counted as usage event.
  • The HTTP user agent will be used to identify robots and to remove their usage from the statistics.
  • The referrer information is used to analyze how users found the service and can be used to improve the service (potential sources: search engines, organizational web portal).


To capture the required information, I recommend implementing a specialized view event hook that all statistics plug-ins can subscribe to. This allows us to better standardize OJS view events (e.g. to simulate Apache log events as much as possible) and keeps code overhead to a minimum. If an additional OJS hook should be avoided we can hook into the existing TemplateManager::display and FileManager::downloadFile hooks and filter these generic events to identify view events we are interested in.

If we implement a statistics event hook then I recommend that such a hook provide data items similar to the variables available in Apache's log component. This enables us to easily switch the statistics hook later, e.g. based on Apache logs or shared storage as has been proposed.

Privacy Protection

We assume that many OJS providers using the OA-S extension will be liable to German privacy laws. While OJS users will have to evaluated their legal situation on a per-case basis and we cannot guarantee in any way that OJS conforms to all legal requirements in individual cases, we provide basic technical infrastructure that may make it easier for OJS users to comply with German privacy law.

Legal Requirements

The OA-S project commissioned two legal case studies with respect to German privacy law: one describes the legal context of OA-S application at University Stuttgart, the other focuses more generally on OA-S users, especially project members. The first report has been done during an earlier phase of the OA-S project when privacy-enhancing measures, like the use of a SALT for IP hashing, were not yet specified. The second report is more recent. It assumes an enhanced privacy infrastructure, i.e. the use of a SALT to pseudonimize IP addresses. We therefore base our implementation recommendations on the results of the second report.

The report recommends that data providers liable to German privacy law implement the following infrastructure:

  • All personal data must be pseudonymized immediately (within a few minutes) after being stored. This can be achieved by hashing IP addresses with a secret salt. The salt must have a length of at least 128 bits and must be cryptographically secure. The salt must be renewed about once a month and may not be known to the OA-S service provider. The salt will be distributed through a central agent to all data providers. A single salt can be used for all data providers if they do not share pseudonimized data. Pseudonimized data must be immediately transferred to the service provider and thereafter deleted by the data provider, i.e. every five minutes.
  • Data providers have to provide the means for end users to deny data collection ("opt-out"). The cited report comes to the conclusion that an active "opt-in" of end users is not necessary if data will be reliably pseudonymized. It recommends an "opt-out" button which, if clicked, could result in a temporary cookie being set in the end user's browser. Whenever such a cookie is present, usage data may not be collected. The report recommends against setting a permanent cookie as this may now or in the future require active "opt-in" on the part of the end user. Alternatively the user's IP address could be blacklisted while using the service, i.e. entered into a table and all data from that IP would then not be stored. The blacklist entry would have to be deleted after the user session expires.
  • Data providers have to inform end users about their right to opt out of data collection before they start using the service. They also have to inform the end user that opting out of the service will result in a temporary cookie being set in their browsers. This information must be available not only once when the user starts using the service but permanently, e.g. through a link.
  • Data providers will have to implement further organizational measures (registration of data processing organizations, reporting data usage to end users on demand)

Salt Management Interface

As pointed out in the previous section, we'll have to salt our pseudomization hash function. Within the OA-S project, University Library Saarbrücken (SULB) provides a central SALT distribution mechanism as described in the OA-S technical manual for new repositories. SALTs will be provided on a monthly basis and have to be downloaded to OJS. SULB provides a Linux shell script (alternative unprotected link) to download SALTS. We rather recommend to download the SALT from with OJS directly to avoid the additional complexity of calling a shell script and to better support Windows users. The SALT can be downloaded from an HTTP BASIC protected location.

A new salt is usually being provided at the beginning of each month. We recommend the following algorithm for salt management to be implemented in OJS:

   Whenever we receive a log event:
       IF the download timestamp of the current SALT is within the current month THEN
           use the current SALT to pseudonimize log data
       ELSE
           IF the "last download time" lies within the last fifteen minutes THEN
               use the current SALT
           ELSE
               authenticate to "oas.sulb.uni-saarland.de"
               download "salt_value.txt"
               set the "last download time" to the current time() value
               IF the downloaded SALT is different from the current SALT THEN
                   replace the current SALT with the downloaded SALT
                   set the timestamp of the SALT to the current time() value
                   use the new SALT to pseudonimize log data
               ELSE
                   use the current SALT

Pseudonimization

The OA-S sample application provides an algorithm for IP pseudomization based on the SALT value retrieved from SULB.

We recommend using this exact PHP function to pseudonimize IPs in OJS.

Opt-Out and Privacy Information

We propose to implement a small block plug-in that allows for opt-out and privacy information display. The block plug-in will provide a single "privacy" link in the sidebar.

The block plug-in will only appear on pages that may trigger a usage event, i.e. the article abstract page, the issue galley page, the article galley pages and the supplementary file page.

Clicking on the privacy link will open up a plug-in-specific page that contains the privacy information as well as an opt-out button. Clicking on the opt-out button will set a temporary cookie with a validity of one year.

If the opt-out cookie is present in the session then no OA-S statistics events will be logged at all. Cookies will be renewed whenever the user accesses OJS.

Data Storage

Statistics events have to be temporarily stored between the time they arrive and the time they're being harvested by the service provider. For intermediate data storage I recommend a plug-in specific internal database table that contains the fields mentioned in #Log Events above. Personal data (IP) must be stored in its pseudonimized form (see #Privacy Protection).

Due to privacy restrictions and to avoid scalability problems, we should delete log data as soon as it has been successfully transferred to the service provider. Unfortunately, OA-S has not yet specified a protocol that allows us to determine when data has been received by the service provider's server (Source: Email Julika 30.10.2012). OA-S uses the OAI protocol to harvest statistics data. This protocol does not support success/failure messages by itself. Harvesters usually retry access when there was a communications failure. Although access to the OAI interface is authenticated, this means that we cannot delete data whenever we receive an OAI request. We rather have to define a fixed maximum time that log data may be kept in the log and then delete it, independently of its transfer status.

We therefore recommend to save log data indexed by its time stamp. Whenever we manipulate the log table (i.e. when receiving a log event) we'll automatically delete all expired log data.

As the log expiry time will be relatively low, we do not see the necessity to rotate the log table or otherwise improve scalability.

Log Transfer (II)

Transformation into Context Objects

The chosen log format for statistics events has been kept as close as possible to the required context object specification. We'll use PHP's XML DOM to build context objects from records. This will be implemented as a simple filter class that takes a log record array as input and returns the XML DOM object as output.

OAI interface

A few properties distinguish the OAI interface required by OA-S from the default OJS OAI implementation:

  • We have to protect the interface with BASIC HTTP authentication.
  • We do not export meta-data about publication objects (articles, etc.) but about usage events. This also implies that data cannot be retrieved via the usual OJS database tables but must be read from the event log.

The existing data provider for OA-S context objects serves as a practical example and guideline for our work. While exact implementation details must be decided at implementation time it currently looks as if we could fully re-use the OAI base classes provided by PKP core.

We cannot re-use the JournalOAI and OAIDAO classes as they assume publication objects to be published. We also cannot reuse the existing OAIHandler as we have to provide an authenticated interface.

We therefore have to implement the following classes and methods to support the required OAI interface:

  • An OA-S specific handler method which I propose to integrate in the existing plug-in handler class OasHandler.
  • A subclass of the OAI class that provides the connection between the event log DAO and the OAI interface.
  • An OA-S specific OAI format that converts event log data to XML context objects.
  • The event log DAO will inherit from PKPOAIDAO so that a specific OAI DAO will not be necessary.

Once the OAI interface has been implemented it must be validated. OA-S provides a validator service for the data provider OAI interface.


Further information:

  • The exact OAI context object format has been sufficiently specified by the OA-S project and does not have to be documented here.
  • A protocol to confirm successful reception of raw data has not yet been implemented (see email Julika, 31.10.2012). This is necessary to confirm log deletion. We are implementing a timeout right now to delete log events (and to confirm to privacy regulations). See the details above.

Authentication

The OA-S specification demands protecting the OA-S OAI interface with HTTP BASIC authentication. Unfortunately a more secure authentication protocol is not supported by OA-S.

To make our implementation as configuration-less as possible we recommend implementing HTTP BASIC authentication in PHP rather than relying on a web server implementation (like Apache's). See http://php.net/manual/en/features.http-auth.php for details how this can be done in PHP.

To simplify configuration we recommend using a standard username "ojs-oas" that will be the same for all OJS installations.

We recommend adding a single setting to the plugin for the password. This must be set by the end user. We preset the password with a random string so that the OAI interface will not be exposed inadvertently when first activating the OA-S plug-in.

Whenever a request to the OAI interface comes in we'll check the authentication credentials coming with the request. If they are missing we'll challenge the client with an HTTP BASIC authentication answer.

Retrieving metrics from the Service Provider (II)

The specification of the return format has not yet been officially documented by OA-S.

We do have samples of the possible return formats, though.

Here a sample JSON response:

   {
     "from": "2012-10-01",
     "to": "2012-10-30",
     "entrydef": ["identifier", "date", "counter", "LogEc", "IFABC", "RobotsCounter", "RobotsLogEc", "RobotsIFABC"],
     "entries": [
       {"identifier": "oai:DeinDienst.de:00001", "date": "2012-10-10", "counter": 0, "LogEc": 0, "IFABC": 1, "RobotsCounter": 0, "RobotsLogEc": 0, "RobotsIFABC": 0},
       {"identifier": "oai:DeinDienst.de:00037", "date": "2012-10-11", "counter": 0, "LogEc": 0, "IFABC": 1, "RobotsCounter": 0, "RobotsLogEc": 0, "RobotsIFABC": 0},
   ...
   }
   Source: Email Julika Mimkes, 31.10.2012

And here a CSV sample:

   date;identifier;counter;counter_abstract;robots
   2012-11-10;oai:abc.de:00001;5;8;1
   2012-11-10;oai:abc.de:02444;6;2;7
   2012-11-10;oai:abc.de:05555;12;9;1
   Source: Email Matthias Hitzler, 26.02.2013

According to information provided by the OA-S project (Source: Email Matthias Hitzler, 26.02.2013), the following details apply to the return interface:

  • Data is made available as plain text files on a web server.
  • Data arrives with a lag of three days currently. This enables the service provider to make sure that data is reliable and does not have to be revised later on.
  • A SOAP/SUSHI interface is in preparation but has not yet been implemented.
  • The URL of the return file server is esx-143.gbv.de/data_provider_name (test environment) and esx-144.gbv.de/data_provider_name (production environment).
  • A cronjob runs on a daily, weekly or monthly basis, depending on the return interval solicited during the registration phase. The cronjob does not currently run at a specific time.
  • The server provides a folder structure that makes it easy to identify available data. The path pattern is /data_provider_name/YYYY/MM/startdate_enddate.{csv,json}. Dates are given in the format 'Y-m-d'. Example: 2012-01-01_2012-01-07.csv for weekly generation or 2012-01-01_2012-01-01.csv for daily retrieval.
  • Files that have been written will not change. It is therefore sufficient to poll for new files and load these.

JSON parsing support has been integrated into PHP from version 5.2 onwards. This is a rather recent version and it would be better if we could support earlier 5.x versions, too.

Reading CSV files is supported in PHP 4 and 5 as long as the CSV conforms to basic escaping protocols. To achieve a full audit trail and allow for easy re-import of files we recommend to stage statistics files locally first and load them asynchronously into the database. This combines well with the fact that OA-S provides plain text files on their server. Scanning and downloading these files will be easy and fast.

We therefore propose a load protocol that implements the following steps:

  1. Regularly poll the OA-S server for new data. This should be done on a daily basis and will be triggered via OJS' scheduled tasks. We'll remember the last file successfully loaded and then scan the well-known folder structure of the OA-S server for new files until we hit a 404 not found response.
  2. Whenever new data is available: Download the new metrics file to a well-known staging folder in the OJS files directory.
  3. Regularly poll the staging folder for new files. The scheduling for this will be done in the same way as for OA-S server polling.
  4. When a new file is present then try to parse and load it. Once we "claim" a file, we immediately remove it from the "hot" staging folder to avoid race conditions. When the parsing is successful then move the file to the file archive. Otherwise move it to the rejection folder where it can be analyzed, corrected and manually moved to the staging folder again.
  5. When a file contains data that has been loaded before then the last write wins. This guarantees that the loading process will be fully idempotent.

If data is lost in the database then files can be moved from the archive back to the staging folder where they'll be discovered and automatically loaded again.

The cronjob configuration requirements will be documented in a README and on the plugin settings page. Polling and loading can also be triggered by user action. This is implemented through an "update statistics data" button on the plugin settings page and is meant as a fallback for those who are unable to or do not want to configure a cron job.

See the specification of the metrics table below for a detailed description of the load target.

Value Added Services (II / III)

Common API for Value Added Services (II)

Some of the use cases for value added services require us to integrate various usage statistics in the same place (e.g. OAS, COUNTER/OJS, ALM). Examples of this would be:

  • display of all article-specific metrics to end readers on the article abstract page
  • selection of a metric for search ranking
  • cross-metric reports for OJS editors.

To implement such use cases, we need an implementation-agnostic cross-plugin API that allows us to treat statistics from various sources conformly. Thus I recommend two additions to core OJS:

  1. A specialized plug-in API similar to the API for public identifiers to be implemented by all ReportPlugin classes.
  2. A multi-dimensional online analytical processing (OLAP) database table for metric storage.

The following two sections will describe our recommendations for these additions in more detail.

Plug-In API

The proposed metrics API allows granular access to any number of metrics provided by different plug-ins. If there are several metric types, we have to define a site- or journal-specific primary metric type which will then be used where a single "most important" metric is required (e.g. for search ranking). We propose that such a selection is done in the site or journal settings by the journal manager or OJS admin respectively.

I propose a change to the ReportPlugin base class. Similarly to PubIdPlugin, this class could serve as a plug-in agnostic metric provider:

  • Statistics plug-ins (e.g. COUNTER, OA-S, ALM) should provide a ReportPlugin.
  • If a plug-in needs hooks (e.g. to track usage events) it should be nested in a GenericPlugin.

We recommend the following specific API for ReportPlugin:

  • getMetrics($metricType = null, $columns = null, $filters = null, $orderBy = null, $range = null)
  • getMetricTypes()
  • getMetricDisplayType($metricType)
  • getMetricFullName($metricType)
  • These methods should return null in case of plug-ins that do not wish to provide metrics, e.g. the current articles, reviews and subscriptions reports
  • The exact meaning of the input variables will be explained in a later section (see below).

Furthermore we recommend the following specific API for OJS objects (application, journal, issue, issue and article galleys, articles, supplementary files).

  • getMetrics($metricType = null, $columns = null, $filters = null, $orderBy = null, $range = null) to return a report with the given publication object pre-filtered.
  • This is for convenience only as we have quite a few use cases with that filter on a single publication object. These methods should only be implemented when actually needed.
  • These methods return null in case metrics are not defined for the given object or filter.
  • Article/ArticleGalley/IssueGalley/SuppFile::getViews() would be renamed to ...::getMetrics() with the above signature throughout OJS.
  • If no $metricType is given for getMetric() then the main/primary metric type will be used.
  • This API can be extended to issues and journals for aggregate metrics retrieval.
  • The exact meaning of the input variables will be explained in a later section (see below).

The journal and application object will be extended to return information required to configure a main metric for the respective context:

  • Both objects contain a getDefaultMetricType() that will return the currently configured main metric type (or null if no default metric can be found).
  • Both objects contain a getMetricTypes() that returns the metric types available for selection as potential main metric in the respective context.
  • See the interface specification for main metric selection below.

Plugins can internally decide whether to actually retrieve metrics from the MetricsDAO (see next paragraph) or whether to retrieve metrics from an external location (e.g. a web service).

Metrics Table

It is entirely possible to specify a common data model for aggregate metrics storage in OJS as OJS front-end statistics use cases are common to all statistics plug-ins. We therefore recommend to consolidate the current plug-in specific metrics stores into a single metrics table.

While saving on development time and complexity, such a table is crucial to implement responsive and flexible cross-metrics reporting, too. Among other front-end use cases, such a table will help us in our goal to replace plug-in specific reports with a simple report generator for all metrics. It would also help us to implement requirements as "time-based reporting" or any other use case that requires access to aggregate metrics data (which are most!) in a simple and efficient way across all metrics plugins.

Having local aggregate metrics data is necessary to provide speedy reports. I do not believe that building cross-metric aggregate reports based on on-the-fly access to a remote web service can be done reliably and with acceptable response times. Plugins like ALM that choose to implement a plug-in specific metrics storage or wish to retrieve metrics from a web service on the fly will still be able to integrate with all OJS front-end features through the API described in the previous paragraph.

Conceptually, the proposed table represents a multi-dimensional OLAP cube. It allows both, granular (drill down) and aggregate (dice and slice), access to metrics data.

The proposed cube has the following properties...

Dimensions:

  • publication object (represented by assocId + assocType)
  • time (day)
  • metric ("OJS/COUNTER", "OA-S/COUNTER", "ALM-Facebook", ...)

Aggregation hierarchies over these dimensions:

  • publication object: assocType
  • publication object: article -> issue -> journal (as aggregation hierarchy, not the objects themselves)
  • time: month -> year

Additional dimensions may have to be defined to cater to special administrative or front-end use cases, such as

  • geography: Where did the usage events originate from?
  • source file: This enables us to implement a scalable file based, restartable load process. Details of such a process are not in scope of this document.
  • ...

Facts:

  • Facts would be represented as a single integer or float data type dimensioned as just outlined.
  • Dimensions should be modeled additively so that we can use them in reports that "slice" and "dice" the conceptual data cube. This excludes "from-to" notation for the date.
  • Monthly data that is not available on a daily basis can be modeled additively in such a table by leaving the day-aggregation level empty.

This is pretty much a standard OLAP design and should therefore serve all aggregate data needs that may come up in reports or elsewhere.

While the conceptual level seems rather complicated, we can implement such a conceptual model with a single additional table in core OJS as all dimension tables either already exist in the database or could be implemented "virtually" by on-the-fly aggregation, e.g. for the date hierarchy.

We therefore propose a single new database table 'metrics' with the following columns:

  • assoc_id: foreign key to article, article or issue galley, supplementary file
  • assoc_type: determining the type of the assoc id (article, article or issue galley, supplementary file)
  • article_id, issue_id, journal_id: denormalized entries for the publication dimension for improved performance and simplified aggregation
  • day: the lowest aggregate level of the time dimension
  • month (optional): required only if we want to support month-only aggregation, either because some metrics cannot be provided on a daily basis or to compress historical data
  • metric_type: e.g. "oas::counter", "ojs::counter", "alm::facebook", etc.
  • load_id (optional): a plug-in specific load identifier used to implement a restartable and scalable load process, e.g. a file name, file id or run id
  • country_id (optional): the lowest aggregate level of the source geography dimension
  • metric: an integer column that represents the aggregate metric

Indexes should be placed only when really needed by actual reports so that load performance can be optimized.

The number of data items in the proposed table will be considerably less than what we currently need to store raw event data. I therefore do not believe that such a table will reach the underlying database's scalability limit. If such a thing would happen we could still purge old metric data or reduce the granularity of historic data (e.g. from per-day to per-month).

For a simple estimate we assume the following maximum dimension cardinality:

  • assocId/assocType: 3000
  • day: 365 * 10 = 3650
  • metricType: 5
  • loadId is not included in the calculation as we assume that there will be at most a single load ID per day which makes loadId isomorphic to the day dimension
  • country is not included as we do not intend to implement this dimension for the time being

We further assume completely dense data (which gives us an upper bound way above the probable data distribution).

Under these pessimistic assumptions we get a maximum of 3000 * 3650 * 5 = 54.75 * 10^6 rows which is comfortably within the range supported by current MySQL versions and probably way above what even large OJS installations will ever encounter. Compressing historical data per day will provide a further theoretical compression ratio of 12:365 (about 96%) for completely dense data. In practice compression ratios will be lower with non-dense data but probably still well above 50%.

Access to the metrics table would be mediated through a MetricDAO. The MetricDAO should provide methods to insert/update metrics and to administer scalable load processes. The MetricDAO should never be used for direct metrics access except for access from statistics plugins that will route data through their getReport()-API. This allows us to support common front-end features for plug-ins like ALM even if those do not support access through the MetricDAO.

Input and Output Formats (Aggregation, Filters, Metrics Data)

To support our use cases through a common API, our input format should support the following filter and aggregation requirements:

  • specify dimension hierarchy elements for columns ("dicing") to define the aggregation level
  • specify report-level filters ("slicing") through one of...
    • dimension element ranges (for ordered dimensions)
    • dimension element selection (for discrete dimensions)
  • specify the result order
  • specify result ranges (for use cases like paging, "top 10", etc.)

More specifically the input format will consist of...

Metrics selection ($metricType):

  • The metrics dimension is special for two reasons: It is not additive and it cannot be sensibly ordered.
  • This means that...
    • It must either be included on column level or a single metric must be chosen as a report-level filter.
    • It does not make sense to filter it by range.
  • Advantages of implementing metrics selection as a separate, mandatory input parameter are...
    • We automatically enforce consistency with the additional restrictions of the dimension.
    • The common scenario that a single metric value or a list of metrics should be retrieved for a publication object can be conveniently supported without a complex column/filter specification.
    • The default metrics API becomes very similar to the API for public IDs which makes it easier to understand and use in most cases.
  • It has to be kept in mind, though, that conceptually, metrics are just another dimension. Future extensions to the API should not repeat the same pattern for other dimensions unless there are similarly compelling reasons for it. Additional dimensions should usually be supported through the $filter/$columns/... variables as specified below!
  • The semantics of the $metricType variable is like this...
    • Setting $metricType to null or not setting it at all will select the main/primary metric and the metric column will not be included in the report.

If no further columns are specified then a single scalar value will be returned.

    • Setting $metricType to a scalar value will select that metric as a report-level filter without selecting it as a column. If no further columns are specified then a single scalar value will be returned.
    • Setting $metricType to an unordered array of several metric dimension elements will select those metrics as report-level filters and at the same time include the metrics column in the report.
    • Setting $metricType to "*" will select all metrics available in the given context (i.e. publication object, plug-in etc.) and at the same time include the metrics column in the report.

An optional dimension hierarchy specification ($columns):

  • The $columns variable is an unordered (non-hashed) array.
  • The array contains a list of identifiers (PHP constants) for the lowest hierarchy aggregation level to be included in the report.
  • The available constants are STATISTICS_DIMENSION_{JOURNAL_ID,ISSUE_ID,ARTICLE_ID,ASSOC_TYPE,ASSOC_ID,DAY,MONTH,COUNTRY,METRIC_TYPE}.
  • Obs.: The STATISTICS_DIMENSION_{JOURNAL_ID,ISSUE_ID,ARTICLE_ID} constants are aggregation-level columns! Use them to include the article/issue/journal IDs additionally to the publication object's own ID. When an ID on a given level is not available (e.g. an ARTICLE_ID for issue galley statistics) then the value of that column will be null.
  • Precisely zero or one value can be defined for each aggregation hierarchy. There may be several values per dimension if, and only if, there are several aggregation hierarchies for that dimension (e.g. authors and assocType for the publication object dimension). The dimension hierarchy specification will be checked for consistency wrt this requirement before executing the report.
  • A single scalar value can be given in case a single column should be specified.
  • Dimension hierarchies not included in this specification will be aggregated over (implicit selection of the top hierarchy level).
  • The input variable could probably be named $aggregation or something with $dimension... in it for better conceptual consistency. I chose to name it $columns, though, so that those who do not have a firm grip of OLAP concepts immediately understand what this actually means.
  • If no value is given (empty array) then the lowest available granularity (all dimensions) should be used.

An optional report-level filter specification ($filter):

  • The $filter variable is a hashed array.
  • It contains identifiers (PHP constants) for the filtered hierarchy aggregation level as keys. The constants are the same as for the dimension hierarchy specification above, see there.
  • As values it either contains a hashed array with from/to entries with two specific hierarchy element IDs in case of a range filter or an unordered array of one or more hierarchy element IDs in case of an element selection filter.
  • If only a single element is to be filtered then the value can be given as a single ID rather than an array.
  • If no filter is given for a dimension then it is assumed that data should be aggregated over all dimension elements (implicit selection of the top hierarchy element).
  • If no filter is given at all then all dimensions will be aggregated over.

An optional result order specification ($orderBy):

  • The $orderBy variable is a hashed array.
  • It contains identifiers (PHP constants) for the filtered hierarchy aggregation level as keys. The constants are the same as for the dimension hierarchy specification above, see there. If you want to order by the metric value then you can use the constant STATISTICS_METRIC.
  • For each identifier you can specify one of the two values "asc" (STATISTICS_ORDER_ASC) or "desc" (STATISTICS_ORDER_DESC).
  • It can only include hierarchy aggregation levels that have also been specified in $columns. The order specification will be checked for consistency wrt this requirement before executing the report.

An optional result range specification ($range): This is the usual PKP paging object DBResultRange. If no range is given then a maximum of STATISTICS_MAX_ROWS rows will be returned.

With this input design in mind we can make usage recommendations for access APIs specified in previous paragraphs (access through report plugins, publication objects and the MetricsDAO):

  • Cross-metrics (plugin-agnostic) reports (e.g. on site, journal or article level) can best be supported by direct access to the MetricsDAO. Otherwise we'd have to implement ordering and paging in PHP code which would be unnecessarily complex and slow. We therefore propose that plug-ins that do not wish to store their data in the metrics table will hook into calls to the MetricsDAO and insert their own results into the intermediate result from the MetricsDAO before returning it to the client.
  • Access to specific metrics or metrics that can be served from a single plug-in should always be done through the plug-in API so that plug-ins that do not want to hook into the MetricsDAO will be supported ootb for these use cases.

Output format:

  • We return report data in flat tabular format through the usual db results iterator (DAOResultFactory) from the various getReport() versions.
  • We include all hierarchy levels above the specified dimension hierarchy levels as columns by default. This adds no relevant additional cost on the database side and considerably simplifies our input format.
  • Dimension hierarchy columns will appear in a pre-defined order. Clients will have to re-order the columns if needed.

Further implementation details:

  • To keep our API as simple as possible, we do not provide a full cross-table implementation but rather place all dimensions into columns followed by a single aggregate fact column. If it turns out that more advanced cross-table requirements exist somewhere in the front-end then flat tabular data will have to be "pivoted" on the fly. This can be done through a standard transformation function somewhere in the OJS core. As not all supported database versions support pivoting, we have to do this in PHP anyway.
  • Currently all our dimensions are discrete and can be ordered somehow (except for the metric type, see above). So we support dimension element ranges and selection without further restrictions. Some of the possible filter settings may not make a lot of sense, though (e.g. summing up metrics for countries from 'D' to 'F' or selecting completely disparate dates). I don't think it's necessary to implement further restrictions, though. Users get what they ask for. ;-)
  • When calling the API from the context of a publication object then publication object may neither be included as a column nor as a filter.
  • When calling the API from the context of a plug-in then only metrics available from that plug-in may be selected.
  • We do not support filtering based on aggregate metric data (e.g. "all articles with accesses > 100 in the last year"). That said: most requirements like this can can be simulated by combining ranges, ordering and dimension selection anyway.
  • Support for this API can easily be implemented within the MetricsDAO due to the design of proposed metrics table:
    • Dimension selection corresponds precisely to a SELECT statement on that table.
    • Dimension filtering corresponds precisely to a WHERE statement on the same table or on one of the (potentially virtual) dimension hierarchy columns.
    • Ordering can be easily implemented with a ORDER BY clause.
    • Result ranges can be defined through the standard paging mechanism.
  • Specific reporting use cases can be tuned by introducing further aggregates (for better "dicing" support) or indexes (for "slices" that select at most a one-digit percentage of the cube). We do not implement such support unless it turns out in practice that certain use cases could not be implemented otherwise.

Value-Added Features (II / III / PKP)

Most of the following features are publicly available to all OJS users (readers). Plugin activation and configuration is always a journal manager's task. We'll only indicate role permissions when deviating from this default.

Selecting a "Main Metric" (II)

Based on the metrics API it will be possible to choose a "main metric" in the site and journal settings. Site-level configuration will be available to administrators and journal-level configuration to journal managers. Users with these roles will be presented with a list of all currently available metrics and may choose the one that should be used for features that need a single metric to be defined (e.g. the most-viewed articles feature).

The list of metrics in the site settings will contain only site-level plugins while the list in the journal settings will contain site-level plugins plus statistics plugins active in the journal context. The site-level default metric defaults to the first available metrics plug-in. If no metric has been chosen for a journal then the site-level main metric will be used for that journal.

According to Alec (Source: Email, 26.02.2013), the journal-level setting will be temporarily placed into the the manager's "Stats & Reports" page of the journal setup and will later have to be migrated into the new tabbed OJS setup pages.

All features specified in this document can be implemented via batch operations through a single (and in one case three) calls of the statistics API. In principle such batch operations could be provided via external data sources. But due to performance reasons, some of the value-added features may require the "main metric" to be cached locally (i.e. in the metrics table). In this case we propose that users should receive a specific warning when selecting such a metric as "main metric" indicating which of the features will not be available. As we don't know right now, whether this is necessary or not (in principle it is not...), such a warning message will not be implemented right now and should be added as needed.

Most-Viewed Articles (II)

The most-viewed article plug-in will show a list of the 10 articles ranking highest for the selected "main metric" throughout a journal. To identify an article's rank we'll sum up the metrics of its galleys excluding the article's abstract (Source: Email Bozana, 06.03.2013). The articles will be presented as title links in a block plug-in.

Journal reader will be able to choose from three time settings: previous month, previous year and "all times". This will define the time span from which statistics will be read. By default "previous month" will be selected.

The most-viewed article block will not be available on site level as the sidebar cannot be easily configured at site level (Source: Email Bozana, 05.03.2013).

This feature uses three separate batch requests to the metrics API. We currently assume that it can therefore be provided for all metric providers, even those that do not use the metrics table. It may be, though, that this is not correct. In case it turns out that the feature cannot be used unless we have locally cached data, it must be adapted to check metrics availability first.

Search Result Ranking (III)

The "main metric" mentioned above can be used as a ranking factor. The "classic" SQL-based OJS search feature does not support ranking. The new Lucene plug-in can be configured to use external data to rank search results, though.

In principle there are two different ways to make Lucene aware of statistics data for ranking purposes:

  1. The metric data could be submitted to Lucene at indexing time.
  2. Alternatively metric data can be provided as an "external file field" without re-indexing articles.

As statistics data can change frequently we'd like to avoid having to re-index articles whenever their usage data changes. Having to do so would mean a considerable performance impact. We therefore recommend the second solution.

The recommended solution requires regular generation of a metrics report for all articles. We propose generating such a report on a daily basis via cron job. The cron job would have to execute the following steps:

  1. Generate a customized metrics report for all articles.
  2. Save the metrics report as an external solr index file (or copy and update the existing file if it already exists).
  3. Trigger a "commit" operation to the index so that the new file will be recognized and used.
  4. Delete the previous file.

In the embedded configuration we'll need a separate cron job to create the file. As this cron job runs locally, we can use the existing rebuildSearchIndex.php script for it. In the central server configuration we can extend the existing "pull" cron job for the same task.

To support the central indexing server use case, the cron job will have to (partially) update rather than overwrite the existing file. We cannot update the file "in place" as it will be locked on Windows machines. We'll rather copy and update the file and delete the previously active file once the commit operation has been issued to Solr which should unlock the previously used file. The file extension will have to be a running number as Solr will always use the last file in alphabetical order.

We recommend implementing this feature as an additional, optional search feature of the Lucene plug-in. This means that another search feature option will be added to the Lucene plug-in's configuration page. Ticking the corresponding configuration check box would activate the feature. If we implement the feature like this then it is probably not necessary to change OJS core. As soon as the feature has been enabled, the Lucene plugin will amend all search queries with an additional boost query that refers to the external field.

The customized metrics report can be easily generated via the Statistics API and provided over HTTP so that it becomes available to remote indexing servers. It contains a list of all unique index object IDs (instId + '-' + articleId) and the corresponding metric boost value, normalized to values between 1.0 (no usage = no boost) and 2.0 (highest usage). The metrics report can be implemented as an operation of the LuceneHandler class and will only be available when the corresponding search feature has been enabled in the plugin.

It has to be decided whether we'd like to use all-time metric data or whether we'd rather restrict usage data to those accrued over the last month or last year. In principle, this decision could be delegated to the journal manager by providing an additional configuration option in the plugin configuration.

An additional option would complicate the user interface, though, and it could not be obvious to the journal manager what this option actually means. We therefore recommend to use a reasonable default and add such an option only if it turns out in practice that there is a real need for it. We recommend using "all-time" metric data for ranking as monthly or yearly data may be systematically distorted in favor of the current edition of a journal which does not seem reasonable.

We currently assume that this feature can be provided through a single batch request to the metrics API. This means that it should be available for all metrics, even those that do not cache metric data locally in the metrics table.

Search Result Ordering (III)

Lucene search supports sorting search results by several sort criteria. We'd like to provide the same functionality for the SQL-based search implementation and include the main metric as one of the sort criteria.

Porting the "sort search results" feature:

As the "classic" OJS search feature is implemented in OJS core this means that we'll have to port code from the Lucene plugin to the ArticleSearch class in OJS core. Secondly we'll have to extend the ArticleSearchDAO to support sorting. Sorting must be done before limiting/offsetting results so it cannot be done in ArticleSearch but must be done in SQL. The SQL in ArticleSearchDAO::getPhraseResults() already joins published article objects and orders results so that we probably won't see any performance deterioration due to sorting by different criteria.

Sorting by metrics data:

Sorting by metrics data can either be implemented by joining the metrics table directly or by preparing a separate metrics report through the metrics API. The first option would be preferable from a performance point-of-view. Unfortunately some statistics plug-ins will probably not use the metrics table to store aggregate metrics data. To support such plug-ins it we'll instead retrieve metrics data in a batch request through the metrics API. This can still be a performance problem as we have to provide all IDs of articles for which we require metrics. If the remote statistics protocol does not support such a request it would have to retrieve metrics one by one. In this case this use case could only be made available for metrics cached in the local database (i.e. in the metrics table).

In principle sorting could be offered for all locally cached metrics and time windows ("all-time", "last month", "last year"). This would lead to an explosion of sort criteria, though, which would compromise usability of the search interface: If we had three eligible metrics we would get nine (3 times 3) extra sort criteria, etc. We therefore recommend supporting only the "main metric" (if it is cached locally) and let the user choose between two time windows ("all-time" and "current month"). A third time window ("last year") could be added easily if actually required by end users.

All changes required to provide the search result ordering by metrics data must be implemented in OJS core as it should be available for the "classic" SQL based search which is implemented in core.

Most-Read Articles of the same Author (III)

The article (abstract) page should include optionally configurable lists of articles relevant to the currently viewed article. One of these should be a list of all articles by the same author ordered by the "main metric", in descending order. The ordering should be done based on "all-time" metric data. See the reasoning in the search ranking section above.

This feature can be implemented as a generic plug-in via a template hook on the article page. Internally the plugin would be called with the article ID, could then retrieve the corresponding author information and all articles published by the same author.

The author is not a dimension of its own in the metrics table as an article may be written by more than one author, so the author dimension is not additive. This means that we cannot query the author directly through the metrics DAO. As there will usually be a relatively low number of articles per author we can make a single call to the metrics API, though, with an explicit filter on the author's article IDs which we fetch in a separate database query.

As we only need a single batch request to the statistics API we assume that such a request can be fulfilled by all implementors of the API including those not storing metrics data in the local metrics table.

"Similar" Articles (III)

The article abstract page should (optionally) recommend a list of "similar" articles. We propose implementing this feature as a separate generic plugin hooking into the article page template.

To support this use case we recommend implementing the "similar article" search as part of the core OJS search API. Currently such a feature is only available for the Lucene plug-in. This allows us to implement the "similar articles" plugin without having to make it directly aware of specific search implementations. The plugin would call the search API with an article ID which would retrieve similar articles based on the given ID.

To implement similarity search for the current SQL based search implementation we recommend finding similar articles based on subject keywords. We could construct an OR-joined search request with the subject keywords of the article against the subject keyword field of all articles of the same journal (or site-wide if the search query is not restricted to a journal). If the article does not have any subject keywords assigned then an empty result set will be returned.

We recommend ordering the result set by the default ranking score of the search implementation and displaying only the first 10 results directly on the article page. In the case of a Lucene search this means that results can be ranked (among other factors) by the "main metric" as described above. In the case of the SQL-based search, results could still be pre-ordered by the "main metric". This could however place less relevant articles to the top of the list just because they have been used more. This doesn't seem appropriate in this specific case where "similarity" should be the main ranking factor. We therefore propose adding a "see more" link to the bottom of the list which will take the user to the default search interface where more than 10 results can be displayed and (among other search features) sorting by metric will be available.

Integration with the Tag Cloud Plugin (III)

The existing tag cloud plug-in already uses the search API to retrieve articles matching a given keyword. It therefore will automatically rank search results by "main metric" if the Lucene search is activated and ranking by metric has been enabled.

In the case of SQL based search, pre-ordering of results by the "main metric" could be implemented without extra implementation cost via the search API but is not recommended due to the same reasons laid out in the case of the "similar article" feature. Users who wish to re-order results by metric can immediately do so in the search UI after clicking on one of the keywords.

Statistics Reports (PKP/Bruno)

NB: As agreed with Juan and Bruno, this functionality will be implemented by PKP.

Proposal for implementation: On the basis of the PHP input format specification of the front-end API (see above) we can easily define an HTTP GET version of the input protocol. Translating such a protocol one-to-one to a call on the internal API is trivial.

This will allow us to easily define reports which then can be presented in various formats (XML, CSV, HTML, PDF) without having to implement or even think about a heavy report generator front-end. Such a front-end can be implemented at any time if we like but I'd rather not try to re-invent Excel Pivot Tables in OJS.

With an HTTP GET protocol exporting live OLAP data to Excel or other OLAP tools will be extremely easy. Thus data can be flexibly analyzed outside OJS while common predefined reports with dynamic choices (e.g. time) can be implemented in OJS with a light weight front-end or on any page with a simple download/page link.

This gives advanced OJS users nice opportunities for very easy but still powerful definition of custom reports in any supported output format everywhere in OJS or even for inclusion on external web pages. It will also make it trivial to replace existing statistics reports by links to the generic report generator. Loads of duplicate code can thereby be deleted from the code base and cross-metric reports will be available for the first time.

Editors, Authors and Readers Pages (PKP/Juan)

NB: As agree with Juan, this functionality will be implemented by PKP based on the statistics API.