Difference between revisions of "OJSdeStatisticsConcept"

From PKP Wiki
Jump to: navigation, search
m (OA-S subproject)
(Add section markers.)
Line 54: Line 54:
 
* Implement a feature "similar (more viewed) articles".
 
* Implement a feature "similar (more viewed) articles".
  
The following sections will provide analysis and implementation recommendations for these requirement areas.
+
The following sections will provide analysis and implementation recommendations for these requirement areas. Main section titles contain Roman numbers in brackets that symbolize the project phase in which this part will be specified and implemented (I, II or III). Sections with "PKP" in their title describe features that will be implemented by PKP.
  
= Data Extraction and Storage =
+
= Data Extraction and Storage (I) =
  
 
This section defines how we intend to log, pseudonimize and store usage events.
 
This section defines how we intend to log, pseudonimize and store usage events.
Line 220: Line 220:
  
 
Further information and todos:
 
Further information and todos:
 +
* The exact OAI context object format has been [http://www.dini.de/fileadmin/oa-statistik/projektergebnisse/Specification_V5.pdf sufficiently specified] by the OA-S project and does not have to be documented here.
 
* OA-S provides a [http://transfer.cms.hu-berlin.de/oas_validator/index.php validator service] for the data provider OAI interface. This validator is currently offline.
 
* OA-S provides a [http://transfer.cms.hu-berlin.de/oas_validator/index.php validator service] for the data provider OAI interface. This validator is currently offline.
 
* A protocol to confirm successful reception of raw data has not yet been implemented (see email Julika, 31.10.2012)
 
* A protocol to confirm successful reception of raw data has not yet been implemented (see email Julika, 31.10.2012)
Line 267: Line 268:
 
The cronjob configuration requirements will be documented in a README and on the plugin settings page. Polling and loading can also be triggered by user action. This is implemented through an "update statistics data" button on the plugin settings page and is meant as a fallback for those who are unable or do not want to configure a cron job.
 
The cronjob configuration requirements will be documented in a README and on the plugin settings page. Polling and loading can also be triggered by user action. This is implemented through an "update statistics data" button on the plugin settings page and is meant as a fallback for those who are unable or do not want to configure a cron job.
  
= Value Added Services =
+
= Value Added Services (II / III) =
  
 
NB: The specification and implementation of the OA-S value added services are not part of the project phase OA-S I. We collect unstructured material for use in later project phases with exception of the common API for value added services which must be fully specified now for timeline synchronization with other project members and to clearly specify data requirements on our part.
 
NB: The specification and implementation of the OA-S value added services are not part of the project phase OA-S I. We collect unstructured material for use in later project phases with exception of the common API for value added services which must be fully specified now for timeline synchronization with other project members and to clearly specify data requirements on our part.
  
== Common API for Value Added Services ==
+
== Common API for Value Added Services (II) ==
  
 
Some of the use cases for value added services require us to integrate various usage statistics in the same place (e.g. OAS, COUNTER/OJS, ALM). Examples of this would be:
 
Some of the use cases for value added services require us to integrate various usage statistics in the same place (e.g. OAS, COUNTER/OJS, ALM). Examples of this would be:
Line 450: Line 451:
 
* Specific reporting use cases can be tuned by introducing further aggregates (for better "dicing" support) or indexes (for "slices" that select at most a one-digit percentage of the cube). We do not implement such support unless it turns out in practice that certain use cases could not be implemented otherwise.
 
* Specific reporting use cases can be tuned by introducing further aggregates (for better "dicing" support) or indexes (for "slices" that select at most a one-digit percentage of the cube). We do not implement such support unless it turns out in practice that certain use cases could not be implemented otherwise.
  
=== A Simple Cross-Plugin Report Generator ===
+
== Value-Added Features (II / III) ==
  
On the basis of the PHP input format specification we can easily define an HTTP GET version of the input protocol. Translating such a protocol one-to-one to a call on the internal API is trivial.
+
=== Most-Viewed Articles (II) ===
  
This will allow us to easily define reports which then can be presented in various formats (XML, CSV, HTML, PDF) without having to implement or even think about a heavy report generator front-end. Such a front-end can be implemented at any time if we like but I'd rather not try to re-invent Excel Pivot Tables in OJS.
+
tbd.
  
With an HTTP GET protocol exporting live OLAP data to Excel or other OLAP tools will be extremely easy. Thus data can be flexibly analyzed outside OJS while common predefined reports with dynamic choices (e.g. time) can be implemented in OJS with a light weight front-end or on any page with a simple download/page link.
+
=== Search Ranking (III) ===
  
This gives advanced OJS users nice opportunities for very easy but still powerful definition of custom reports in any supported output format everywhere in OJS or even for inclusion on external web pages!
+
tbd.
  
== Editors and Authors ==
+
=== Statistics Reports (III) ===
  
 
tbd.
 
tbd.
  
== Readers ==
+
=== Similar Articles (III) ===
  
 
tbd.
 
tbd.
  
== Search Ranking ==
+
=== Editors and Authors (III?, PKP/Juan?) ===
  
 
tbd.
 
tbd.
  
== Statistics Reports ==
+
=== Readers (PKP/Juan) ===
  
 
tbd.
 
tbd.
  
== Most-Viewed Articles ==
+
=== A Simple Cross-Plugin Report Generator (PKP/Bruno) ===
  
tbd.
+
On the basis of the PHP input format specification of the front-end API (see above) we can easily define an HTTP GET version of the input protocol. Translating such a protocol one-to-one to a call on the internal API is trivial.
  
== Similar Articles ==
+
This will allow us to easily define reports which then can be presented in various formats (XML, CSV, HTML, PDF) without having to implement or even think about a heavy report generator front-end. Such a front-end can be implemented at any time if we like but I'd rather not try to re-invent Excel Pivot Tables in OJS.
  
tbd.
+
With an HTTP GET protocol exporting live OLAP data to Excel or other OLAP tools will be extremely easy. Thus data can be flexibly analyzed outside OJS while common predefined reports with dynamic choices (e.g. time) can be implemented in OJS with a light weight front-end or on any page with a simple download/page link.
 +
 
 +
This gives advanced OJS users nice opportunities for very easy but still powerful definition of custom reports in any supported output format everywhere in OJS or even for inclusion on external web pages!

Revision as of 17:23, 25 February 2013

Overview

The OA-S Project

The German Open Access Statistics (OA-S) project intends to provide a framework and infrastructure that enables OJS users to establish alternative usage statistics for Open Access content and build value added services on top of such statistics.

This specification describes the use cases, requirements and implementation recommendations for the integration of OA-S into OJS.

The basic idea of OA-S can be described as a co-operation of two institutions:

  1. The data provider (in this case operators of OJS installations) tracks access to documents (in our case article and issue galleys, abstracts and supplementary files).
  2. Access data will then be made available through a protected OAI interface and harvested by the OAS service provider.
  3. The service provider will clean the raw data and produce aggregate metrics based on the COUNTER standard. At a later stage LocEc and IFABC statistics may also be supported.
  4. The data provider retrieves metrics from the service provider on a daily basis and stores them in OJS.

These metrics can then be used in different ways in OJS:

  • They can be displayed to the editors, authors and readers.
  • They can be used in search for ranking.
  • Editors could produce statistics reports.
  • We could provide a "most viewed article" feature.
  • We could implement a feature that displays "other (more viewed) articles of the same author".
  • We could display "similar (more viewed) articles" on an article's page.

We'll use the terms "data provider" and "service provider" from here on without further explanation. "Data provider" and "OJS user" can be used synonymously. We use the term "end user" to refer to users accessing an OJS site.

Requirements for OJS OA-S Integration

The requirements for OJS OA-S integration can be divided into four areas:

  1. log data extraction and storage (OJS)
  2. log transfer (OJS -> OA-S)
  3. metrics retrieval (OA-S -> OJS)
  4. value added services (OJS)

The requirements in these areas are...

Log data extraction and storage:

  • We log usage events for access to issue and article galleys, article abstracts and supplementary files.
  • Logged data must be pseudonimized immediately.
  • Logged data must be deleted immediately after it has been successfully transferred to the service provider or after it expires.

Log transfer:

  • Log data must be transformed into context objects.
  • We then have to provide an HTTP BASIC authenticated OAI interface from which the service provider will harvest raw log data.

Metrics retrieval:

  • We retrieve final metrics data from the service provider via a JSON web service.
  • Metrics data will then be stored into an OLAP data model which allows us easy data access, both granular and aggregate.

Value added services:

  • Display granular (per-object) metrics to the readers, authors and editors.
  • Use metrics as a search ranking criteria together with the Lucene plugin.
  • Define reports on OA-S metrics similarly to already existing OJS reports.
  • Implement a "most viewed articles" feature.
  • Implement an "another (more viewed) articles of the same author" feature.
  • Implement a feature "similar (more viewed) articles".

The following sections will provide analysis and implementation recommendations for these requirement areas. Main section titles contain Roman numbers in brackets that symbolize the project phase in which this part will be specified and implemented (I, II or III). Sections with "PKP" in their title describe features that will be implemented by PKP.

Data Extraction and Storage (I)

This section defines how we intend to log, pseudonimize and store usage events.

Log Events

We consider access to the following URLs in OJS as usage events:

article abstracts:

  • ../article/view(Article)/<article-id>
  • Access to ../article/view(Article)/<article-id>/<galley-id> or any other article page will NOT be counted.

article galleys:

  • .../article/viewFile/<article-id>/<article-galley-id>
  • .../article/download/<article-id>/<article-galley-id>
  • Access to .../article/view(Article)/... and .../article/viewDownloadInterstitial/... or any other article or article galley page will NOT be counted unless the galley is a remote or HTML galley.
  • NB: This differs from the usage event definition used for the current COUNTER plug-in!

supplementary files:

  • .../article/downloadSuppFile/<article-id>/<supp-file-id>
  • Access to .../rt/suppFileMetadata/... and .../rt/suppFiles/... or any other supp file page will NOT be counted.

issue galleys:

  • .../issue/viewFile/<issue-id>/<issue-galley-id>
  • .../issue/download/<issue-id>/<issue-galley-id>
  • Access to .../issue/viewIssue/... and .../issue/viewDownloadInterstitial/... or any other issue page will NOT be counted to avoid double counting.


OA-S expects us to deliver certain usage event data. OA-S provides sample code for DSpace log extraction. The same code is also provided via SVN. We analyzed the specification as well as the sample code to define the following list of all required or optional data items. The corresponding proposed OJS/PHP data source has been included between parentheses:

  • usage event timestamp (PHP's time() function +/- local time offset)
  • administration section
    • HTTP status code (will be always 200 in our case as we won't produce view events for non-200 responses)
    • downloaded document size (difficult from PHP, we may use connection_aborted() to identify some incomplete downloads but this won't be reliable as PHP may end before the download actually finishes when the web server buffers (part of) the response. Stackoverflow agrees with me here.)
    • actual document size (PKPFile::getFileSize() for galleys and supplementary files, I propose 0 in the case of article abstracts, NB: not implemented in the OA-S sample code!)
    • document format (MIME type, PKPFile::getFileType())
    • URI of the service (e.g. OJS journal URL, Config::getVar('general', 'base_url'))
  • referent section
    • document URL (This will be the canonical "best URL" produced by PKPRouter::url() + ...::getBest...Id())
    • optional: one or more unique document IDs (e.g. DOI, URN, etc.) (can be easily retrieved from the pubId-plugins for objects where IDs are defined)
  • referring entity section
    • HTTP referrer (if available, $_SERVER['HTTP_REFERER'])
    • optional: additional identifiers of the referring entity (e.g. DOI, ...) if available (not implemented in OJS)
  • requester section
    • hashed + salted IP (PKPRequest::getRemoteAddr() + hashing)
    • hashed C class (PKPRequest::getRemoteAddr() + truncation + hashing)
    • hostname of the requesting entity (if available), truncated to the second level domain (recommendation: do not implemented in OJS as this would require one DNS request per view event which would be very expensive, exception: use the hostname if it is present in $_SERVER "for free", use the algorithm from the sample code in logfile-parser/lib/oasparser.php, get_first_level_domain())
    • optional: classification, see the sample code, logfile-parser/lib/oasparser-webserver-dspace.php for an example (seems to be a temporary implementation, too)
      • internal: Usage events due to internal requirements, e.g. automated integrity checks, availability checks, etc.
      • administrative: Usage events that happen due to administrative decisions, e.g. for quality assurance. (proposal: use this category for accesses by logged in editors, section editors, authors, etc.)
      • institutional: Usage events triggered from within the institution running the service for which usage events are being collected.
    • optional: hashed session ID or session (recommendation: do not send)
    • HTTP user agent (if available, use PKPRequest::getUserAgent())
  • service type section (omitted, only for relevant for link resolvers)


According to OA-S information, these data items will be used for the following purposes:

  • IP address and timestamp are used to recognize double downloads as defined by the COUNTER standard. Such "double clicks" will be counted as a single usage event.
  • The C class of the IP address will furthermore be used to recognize robots and exclude their usage from the statistics.
  • The file information (url, name, document id, url parameters, etc.) are used to uniquely identify the document which has been accessed.
  • The HTTP status code will used as only successful access may be counted.
  • The size of the document is used to identify full downloads (e.g. 95% of the file downloaded). Partial or aborted downloads will not be counted as usage event.
  • The HTTP user agent will be used to identify robots and to remove their usage from the statistics.
  • The referrer information is used to analyze how users found the service and can be used to improve the service (potential sources: search engines, organizational web portal).


To capture the required information, I recommend implementing a specialized view event hook that all statistics plug-ins can subscribe to. This allows us to better standardize OJS view events (e.g. to simulate Apache log events as much as possible) and keeps code overhead to a minimum. If an additional OJS hook should be avoided we can hook into the existing TemplateManager::display and FileManager::downloadFile hooks and filter these generic events to identify view events we are interested in.

If we implement a statistics event hook then I recommend that such a hook provide data items similar to the variables available in Apache's log component. This enables us to easily switch the statistics hook later, e.g. based on Apache logs or shared storage as has been proposed.

Privacy Protection

We assume that many OJS providers using the OA-S extension will be liable to German privacy laws. While OJS users will have to evaluated their legal situation on a per-case basis and we cannot guarantee in any way that OJS conforms to all legal requirements in individual cases, we provide basic technical infrastructure that may make it easier for OJS users to comply with German privacy law.

Legal Requirements

The OA-S project commissioned two legal case studies with respect to German privacy law: one describes the legal context of OA-S application at University Stuttgart, the other focuses more generally on OA-S users, especially project members. The first report has been done during an earlier phase of the OA-S project when privacy-enhancing measures, like the use of a SALT for IP hashing, were not yet specified. The second report is more recent. It assumes an enhanced privacy infrastructure, i.e. the use of a SALT to pseudonimize IP addresses. We therefore base our implementation recommendations on the results of the second report.

The report recommends that data providers liable to German privacy law implement the following infrastructure:

  • All personal data must be pseudonymized immediately (within a few minutes) after being stored. This can be achieved by hashing IP addresses with a secret salt. The salt must have a length of at least 128 bits and must be cryptographically secure. The salt must be renewed about once a month and may not be known to the OA-S service provider. The salt will be distributed through a central agent to all data providers. A single salt can be used for all data providers if they do not share pseudonimized data. Pseudonimized data must be immediately transferred to the service provider and thereafter deleted by the data provider, i.e. every five minutes.
  • Data providers have to provide the means for end users to deny data collection ("opt-out"). The cited report comes to the conclusion that an active "opt-in" of end users is not necessary if data will be reliably pseudonymized. It recommends an "opt-out" button which, if clicked, could result in a temporary cookie being set in the end user's browser. Whenever such a cookie is present, usage data may not be collected. The report recommends against setting a permanent cookie as this may now or in the future require active "opt-in" on the part of the end user. Alternatively the user's IP address could be blacklisted while using the service, i.e. entered into a table and all data from that IP would then not be stored. The blacklist entry would have to be deleted after the user session expires.
  • Data providers have to inform end users about their right to opt out of data collection before they start using the service. They also have to inform the end user that opting out of the service will result in a temporary cookie being set in their browsers. This information must be available not only once when the user starts using the service but permanently, e.g. through a link.
  • Data providers will have to implement further organizational measures (registration of data processing organizations, reporting data usage to end users on demand)

Salt Management Interface

As pointed out in the previous section, we'll have to salt our pseudomization hash function. Within the OA-S project, University Library Saarbrücken (SULB) provides a central SALT distribution mechanism as described in the OA-S technical manual for new repositories. SALTs will be provided on a monthly basis and have to be downloaded to OJS. SULB provides a Linux shell script (alternative unprotected link) to download SALTS. We rather recommend to download the SALT from with OJS directly to avoid the additional complexity of calling a shell script and to better support Windows users. The SALT can be downloaded from an HTTP BASIC protected location.

A new salt is usually being provided at the beginning of each month. We recommend the following algorithm for salt management to be implemented in OJS:

   Whenever we receive a log event:
       IF the download timestamp of the current SALT is within the current month THEN
           use the current SALT to pseudonimize log data
       ELSE
           IF the "last download time" lies within the last fifteen minutes THEN
               use the current SALT
           ELSE
               authenticate to "oas.sulb.uni-saarland.de"
               download "salt_value.txt"
               set the "last download time" to the current time() value
               IF the downloaded SALT is different from the current SALT THEN
                   replace the current SALT with the downloaded SALT
                   set the timestamp of the SALT to the current time() value
                   use the new SALT to pseudonimize log data
               ELSE
                   use the current SALT

Pseudonimization

The OA-S sample application provides an algorithm for IP pseudomization based on the SALT value retrieved from SULB.

We recommend using this exact PHP function to pseudonimize IPs in OJS.

Opt-Out and Privacy Information

We propose to implement a small block plug-in that allows for opt-out and privacy information display. The block plug-in will provide a single "privacy" link in the sidebar.

The block plug-in will only appear on pages that may trigger a usage event, i.e. the article abstract page, the issue galley page, the article galley pages and the supplementary file page.

Clicking on the privacy link will open up a plug-in-specific page that contains the privacy information as well as an opt-out button. Clicking on the opt-out button will set a temporary cookie with a validity of one year.

If the opt-out cookie is present in the session then no OA-S statistics events will be logged at all. Cookies will be renewed whenever the user accesses OJS.

Data Storage

Statistics events have to be temporarily stored between the time they arrive and the time they're being harvested by the service provider. For intermediate data storage I recommend a plug-in specific internal database table that contains the fields mentioned in #Log Events above. Personal data (IP) must be stored in its pseudonimized form (see #Privacy Protection).

Due to privacy restrictions and to avoid scalability problems, we should delete log data as soon as it has been successfully transferred to the service provider. Unfortunately, OA-S has not yet specified a protocol that allows us to determine when data has been received by the service provider's server (Source: Email Julika 30.10.2012). OA-S uses the OAI protocol to harvest statistics data. This protocol does not support success/failure messages by itself. Harvesters usually retry access when there was a communications failure. Although access to the OAI interface is authenticated, this means that we cannot delete data whenever we receive an OAI request. We rather have to define a fixed maximum time that log data may be kept in the log and then delete it, independently of its transfer status.

We therefore recommend to save log data indexed by its time stamp. Whenever we manipulate the log table (i.e. when receiving a log event) we'll automatically delete all expired log data.

As the log expiry time will be relatively low, we do not see the necessity to rotate the log table or otherwise improve scalability.

Log Transfer (II)

Transformation into Context Objects

The chosen log format for statistics events has been kept as close as possible to the required context object specification. We'll use PHP's XML DOM to build context objects from records. This will be implemented as a simple filter class that takes a log record array as input and returns the XML DOM object as output.

OAI interface

A few properties distinguish the OAI interface required by OA-S from the default OJS OAI implementation:

  • We have to protect the interface with BASIC HTTP authentication.
  • We do not export meta-data about publication objects (articles, etc.) but about usage events. This also implies that data cannot be retrieved via the usual OJS database tables but must be read from the event log.

The existing data provider for OA-S context objects serves as practical examples and guideline for our work. While exact implementation details must be decided at implementation time it currently looks as if we could fully re-use the OAI base classes provided by PKP core.

We cannot re-use the JournalOAI and OAIDAO classes as they assume publication objects to be published. We also cannot reuse the existing OAIHandler as we have to provide an authenticated interface.

We therefore have to implement the following classes and methods to support the required OAI interface:

  • An OA-S specific handler method which I propose to integrate in the existing plug-in handler class OasHandler.
  • A subclass of the OAI class that provides the connection between the event log DAO and the OAI interface.
  • I don't think that a plugin specific OAI DOI will be necessary. All data access can be done through the event log DAO.

Further information and todos:

  • The exact OAI context object format has been sufficiently specified by the OA-S project and does not have to be documented here.
  • OA-S provides a validator service for the data provider OAI interface. This validator is currently offline.
  • A protocol to confirm successful reception of raw data has not yet been implemented (see email Julika, 31.10.2012)

Authentication

The OA-S specification demands protecting the OA-S OAI interface with HTTP BASIC authentication. Unfortunately a more secure authentication protocol is not supported by OA-S.

To make our implementation as configuration-less as possible we recommend implementing HTTP BASIC authentication in PHP rather than relying on a web server implementation (like Apache's). See http://php.net/manual/en/features.http-auth.php for details how this can be done in PHP.

To simplify configuration we recommend using a standard username "oas" that will be the same for all OJS installations.

We recommend adding a single setting to the plugin for the password. This must be set by the end user. We preset the password with a random string so that the OAI interface will not be exposed inadvertently when first activating the OA-S plug-in.

Whenever a request to the OAI interface comes in we'll check the authentication credentials coming with the request. If they are missing we'll challenge the client with an HTTP BASIC authentication answer.

Retrieving metrics from the Service Provider

The specification of the return format has not yet been officially documented by OA-S and is currently unknown to us. We only have an example of the possible JSON return format:

   {
     "from": "2012-10-01",
     "to": "2012-10-30",
     "entrydef": ["identifier", "date", "counter", "LogEc", "IFABC", "RobotsCounter", "RobotsLogEc", "RobotsIFABC"],
     "entries": [
       {"identifier": "oai:DeinDienst.de:00001", "date": "2012-10-10", "counter": 0, "LogEc": 0, "IFABC": 1, "RobotsCounter": 0, "RobotsLogEc": 0, "RobotsIFABC": 0},
       {"identifier": "oai:DeinDienst.de:00037", "date": "2012-10-11", "counter": 0, "LogEc": 0, "IFABC": 1, "RobotsCounter": 0, "RobotsLogEc": 0, "RobotsIFABC": 0},
   ...
   }
   Source: Email Julika, 31.10.2012

JSON parsing support has been integrated into PHP from version 5.2 onwards. This is a rather recent version and it would be better if we could support earlier 5.x versions, too.

OA-S informed us that they also support a CSV format. Reading CSV files is supported in PHP 4 and 5 as long as the CSV conforms to basic escaping protocols. To achieve a full audit trail and allow for re-import of files it would be a good idea anyway to save downloaded statistics information in file format first and load it asynchronously into the database.

We therefore propose a load protocol that implements the following steps:

  1. Regularly poll the OA-S server for new data. This should be done on a daily basis and will be triggered via OJS' scheduled tasks.
  2. Whenever new data is available: Download all new metrics data into a well-known staging folder in the OJS files directory.
  3. Regularly poll the staging folder for new files. The scheduling for this will be done in the same way as for OA-S server polling.
  4. When a new file is present then try to parse and load it. Once we "claim" a file, we immediately remove it from the "hot" staging folder to avoid race conditions. When the parsing is successful then move the file to the file archive. Otherwise move it to the rejection folder where it can be analyzed, corrected and manually moved to the staging folder again.
  5. When a file contains data that has been loaded before then the last write wins. This guarantees that the loading process will be fully idempotent.

If data is lost in the database then files can be moved from the archive back to the staging folder where they'll be discovered and automatically loaded again.

The cronjob configuration requirements will be documented in a README and on the plugin settings page. Polling and loading can also be triggered by user action. This is implemented through an "update statistics data" button on the plugin settings page and is meant as a fallback for those who are unable or do not want to configure a cron job.

Value Added Services (II / III)

NB: The specification and implementation of the OA-S value added services are not part of the project phase OA-S I. We collect unstructured material for use in later project phases with exception of the common API for value added services which must be fully specified now for timeline synchronization with other project members and to clearly specify data requirements on our part.

Common API for Value Added Services (II)

Some of the use cases for value added services require us to integrate various usage statistics in the same place (e.g. OAS, COUNTER/OJS, ALM). Examples of this would be:

  • display of all article-specific metrics to end readers on the article abstract page
  • selection of a metric for search ranking
  • cross-metric reports for OJS editors.

To implement such use cases, we need an implementation-agnostic cross-plugin API that allows us to treat statistics from various sources conformly. Thus I recommend two additions to core OJS:

  1. A specialized plug-in API similar to the API for public identifiers to be implemented by all ReportPlugin classes.
  2. A multi-dimensional online analytical processing (OLAP) database table for metric storage.

The following two sections will describe our recommendations for these additions in more detail.

Plug-In API

The proposed metrics API allows granular access to any number of metrics provided by different plug-ins. If there are several metric types, we have to define a site- or journal-specific primary metric type which will then be used where a single "most important" metric is required (e.g. for search ranking). We propose that such a selection is done in the site or journal settings by the journal manager or OJS admin respectively.

I propose a change to the ReportPlugin base class. Similarly to PubIdPlugin, this class could serve as a plug-in agnostic metric provider:

  • Statistics plug-ins (e.g. COUNTER, OA-S, ALM) should provide a ReportPlugin.
  • If a plug-in needs hooks (e.g. to track usage events) it should be nested in a GenericPlugin.

We recommend the following specific API for ReportPlugin:

  • getMetrics($metricType = null, $columns = null, $filters = null, $orderBy = null, $range = null)
  • getMetricTypes()
  • getMetricDisplayType($metricType)
  • getMetricFullName($metricType)
  • These methods should return null in case of plug-ins that do not wish to provide metrics, e.g. the current articles, reviews and subscriptions reports
  • The exact meaning of the input variables will be explained in a later section (see below).

Furthermore we recommend the following specific API for publication objects (issue and article galleys, articles, supplementary files).

  • getMetrics($metricType = null, $columns = null, $filters = null, $orderBy = null, $range = null) to return a report with the given publication object pre-filtered.
  • This is for convenience only as we have quite a few use cases with that filter on a single publication object.
  • These methods return null in case metrics are not defined for the given object or filter.
  • Article/ArticleGalley/IssueGalley/SuppFile::getViews() would be renamed to ...::getReport() with the above signature throughout OJS.
  • If no $metricType is given for getMetric() then the main/primary metric type will be used.
  • This API can be extended to issues and journals for aggregate metrics retrieval.
  • The exact meaning of the input variables will be explained in a later section (see below).

Plugins can internally decide whether to actually retrieve metrics from the MetricsDAO (see next paragraph) or whether to retrieve metrics from an external location (e.g. a web service).

Metrics Table

It is entirely possible to specify a common data model for aggregate metrics storage in OJS as OJS front-end statistics use cases are common to all statistics plug-ins. We therefore recommend to consolidate the current plug-in specific metrics stores into a single metrics table.

While saving on development time and complexity, such a table is crucial to implement responsive and flexible cross-metrics reporting, too. Among other front-end use cases, such a table will help us in our goal to replace plug-in specific reports with a simple report generator for all metrics. It would also help us to implement requirements as "time-based reporting" or any other use case that requires access to aggregate metrics data (which are most!) in a simple and efficient way across all metrics plugins.

Having local aggregate metrics data is necessary to provide speedy reports. I do not believe that building cross-metric aggregate reports based on on-the-fly access to a remote web service can be done reliably and with acceptable response times. Plugins like ALM that choose to implement a plug-in specific metrics storage or wish to retrieve metrics from a web service on the fly will still be able to integrate with all OJS front-end features through the API described in the previous paragraph.

Conceptually, the proposed table represents a multi-dimensional OLAP cube. It allows both, granular (drill down) and aggregate (dice and slice), access to metrics data.

The proposed cube has the following properties...

Dimensions:

  • publication object (represented by assocId + assocType)
  • time (day)
  • metric ("COUNTER", "OA-S", "ALM-Facebook", ...)

Aggregation hierarchies over these dimensions:

  • publication object: assocType
  • publication object: author
  • publication object: article -> issue -> journal (as aggregation hierarchy, not the objects themselves)
  • time: month -> year

Additional dimensions may have to be defined to cater to special administrative or front-end use cases, such as

  • geography: Where did the usage events originate from?
  • source file: This enables us to implement a scalable file based, restartable load process. Details of such a process are not in scope of this document.
  • ...

Facts:

  • Facts would be represented as a single integer or float data type dimensioned as just outlined.
  • Dimensions should be modeled additively so that we can use them in reports that "slice" and "dice" the conceptual data cube. This excludes "from-to" notation for the date.
  • Monthly data that is not available on a daily basis can be modeled additively in such a table by leaving the day-aggregation level empty.

This is pretty much a standard OLAP design and should therefore serve all aggregate data needs that may come up in reports or elsewhere.

While the conceptual level seems rather complicated, we can implement such a conceptual model with a single additional table in core OJS as all dimension tables either already exist in the database or could be implemented "virtually" by on-the-fly aggregation, e.g. for the date hierarchy.

We therefore propose a single new database table 'metrics' with the following columns:

  • assocId: foreign key to article, article or issue galley, supplementary file
  • assocType: determining the type of the assoc id (article, article or issue galley, supplementary file)
  • day: the lowest aggregate level of the time dimension
  • month (optional): required only if we want to support month-only aggregation, either because some metrics cannot be provided on a daily basis or to compress historical data
  • metricType: e.g. "OA-S-Counter", "OJS-Counter", "ALM-Facebook", etc.
  • loadId (optional): a plug-in specific load identifier used to implement a restartable and scalable load process, e.g. a file name, file id or run id
  • country (optional): the lowest aggregate level of the source geography dimension
  • metric: an integer or float column that represents the aggregate metric

The dimension columns (all except 'metric') should have a multi-column unique index placed on them to enforce data consistency.

The number of data items in the proposed table will be considerably less than what we currently need to store raw event data. I therefore do not believe that such a table will reach the underlying database's scalability limit. If such a thing would happen we could still purge old metric data or reduce the granularity of historic data (e.g. from per-day to per-month).

For a simple estimate we assume the following maximum dimension cardinality:

  • assocId/assocType: 3000
  • day: 365 * 10 = 3650
  • metricType: 5
  • loadId is not included in the calculation as we assume that there will be at most a single load ID per day which makes loadId isomorphic to the day dimension
  • country is not included as we do not intend to implement this dimension for the time being

We further assume completely dense data (which gives us an upper bound way above the probable data distribution).

Under these pessimistic assumptions we get a maximum of 3000 * 3650 * 5 = 54.75 * 10^6 rows which is comfortably within the range supported by current MySQL versions and probably way above what even large OJS installations will ever encounter. Compressing historical data per day will provide a further theoretical compression ratio of 12:365 (about 96%) for completely dense data. In practice compression ratios will be lower with non-dense data but probably still well above 50%.

Access to the metrics table would be mediated through a MetricDAO. The MetricDAO should provide methods to insert/update metrics and to administer scalable load processes. The MetricDAO should never be used for direct metrics access except for access from statistics plugins that will route data through their getReport()-API. This allows us to support common front-end features for plug-ins like ALM even if those do not support access through the MetricDAO.

Input and Output Formats (Aggregation, Filters, Metrics Data)

To support our use cases through a common API, our input format should support the following filter and aggregation requirements:

  • specify dimension hierarchy elements for columns ("dicing") to define the aggregation level
  • specify report-level filters ("slicing") through one of...
    • dimension element ranges (for ordered dimensions)
    • dimension element selection (for discrete dimensions)
  • specify the result order
  • specify result ranges (for use cases like paging, "top 10", etc.)

More specifically the input format will consist of...

Metrics selection ($metricType):

  • The metrics dimension is special for two reasons: It is not additive and it cannot be sensibly ordered.
  • This means that...
    • It must either be included on column level or a single metric must be chosen as a report-level filter.
    • It does not make sense to filter it by range.
  • Advantages of implementing metrics selection as a separate, mandatory input parameter are...
    • We automatically enforce consistency with the additional restrictions of the dimension.
    • The common scenario that a single metric value or a list of metrics should be retrieved for a publication object can be conveniently supported without a complex column/filter specification.
    • The default metrics API becomes very similar to the API for public IDs which makes it easier to understand and use in most cases.
  • It has to be kept in mind, though, that conceptually, metrics are just another dimension. Future extensions to the API should not repeat the same pattern for other dimensions unless there are similarly compelling reasons for it. Additional dimensions should usually be supported through the $filter/$columns/... variables as specified below!
  • The semantics of the $metricType variable is like this...
    • Setting $metricType to null or not setting it at all will select the main/primary metric and the metric column will not be included in the report.

If no further columns are specified then a single scalar value will be returned.

    • Setting $metricType to a scalar value will select that metric as a report-level filter without selecting it as a column. If no further columns are specified then a single scalar value will be returned.
    • Setting $metricType to an unordered array of several metric dimension elements will select those metrics as report-level filters and at the same time include the metrics column in the report.
    • Setting $metricType to "*" will select all metrics available in the given context (i.e. publication object, plug-in etc.) and at the same time include the metrics column in the report.

An optional dimension hierarchy specification ($columns):

  • The $columns variable is an unordered (non-hashed) array.
  • The array contains a list of identifiers (PHP constants) for the lowest hierarchy aggregation level to be included in the report.
  • Precisely zero or one value can be defined for each aggregation hierarchy. There may be several values per dimension if, and only if, there are several aggregation hierarchies for that dimension (e.g. authors and assocType for the publication object dimension). The dimension hierarchy specification will be checked for consistency wrt this requirement before executing the report.
  • A single scalar value can be given in case a single column should be specified.
  • Dimension hierarchies not included in this specification will be aggregated over (implicit selection of the top hierarchy level).
  • The input variable could probably be named $aggregation or something with $dimension... in it for better conceptual consistency. I chose to name it $columns, though, so that those who do not have a firm grip of OLAP concepts immediately understand what this actually means.

An optional report-level filter specification ($filter):

  • The $filter variable is a hashed array.
  • It contains identifiers (PHP constants) for the filtered hierarchy aggregation level as keys.
  • As values it either contains a hashed array with from/to entries with two specific hierarchy element IDs in case of a range filter or an unordered array of one or more hierarchy element IDs in case of an element selection filter.
  • If only a single element is to be filtered then the value can be given as a single ID rather than an array.
  • If no filter is given for a dimension then it is assumed that data should be aggregated over all dimension elements (implicit selection of the top hierarchy element).
  • If no filter is given at all then all dimensions will be aggregated over.

An optional result order specification ($orderBy):

  • The $orderBy variable is a hashed array.
  • It contains identifiers (PHP constants) for the filtered hierarchy aggregation level as keys.
  • For each identifier you can specify one of the two values "asc" or "desc".
  • It can only include hierarchy aggregation levels that have also been specified in $columns. The order specification will be checked for consistency wrt this requirement before executing the report.

An optional result range specification ($range): This is the usual PKP paging object DBResultRange.

With this input design in mind we can make usage recommendations for access APIs specified in previous paragraphs (access through report plugins, publication objects and the MetricsDAO):

  • Cross-metrics (plugin-agnostic) reports (e.g. on site, journal or article level) can best be supported by direct access to the MetricsDAO. Otherwise we'd have to implement ordering and paging in PHP code which would be unnecessarily complex and slow. We therefore propose that plug-ins that do not wish to store their data in the metrics table will hook into calls to the MetricsDAO and insert their own results into the intermediate result from the MetricsDAO before returning it to the client.
  • Access to specific metrics or metrics that can be served from a single plug-in should always be done through the plug-in API so that plug-ins that do not want to hook into the MetricsDAO will be supported ootb for these use cases.

Output format:

  • We return report data in flat tabular format through the usual db results iterator (DAOResultFactory) from the various getReport() versions.
  • We include all hierarchy levels above the specified dimension hierarchy levels as columns by default. This adds no relevant additional cost on the database side and considerably simplifies our input format.
  • Dimension hierarchy columns will appear in a pre-defined order. Clients will have to re-order the columns if needed.

Further implementation details:

  • To keep our API as simple as possible, we do not provide a full cross-table implementation but rather place all dimensions into columns followed by a single aggregate fact column. If it turns out that more advanced cross-table requirements exist somewhere in the front-end then flat tabular data will have to be "pivoted" on the fly. This can be done through a standard transformation function somewhere in the OJS core. As not all supported database versions support pivoting, we have to do this in PHP anyway.
  • Currently all our dimensions are discrete and can be ordered somehow (except for the metric type, see above). So we support dimension element ranges and selection without further restrictions. Some of the possible filter settings may not make a lot of sense, though (e.g. summing up metrics for countries from 'D' to 'F' or selecting completely disparate dates). I don't think it's necessary to implement further restrictions, though. Users get what they ask for. ;-)
  • When calling the API from the context of a publication object then publication object may neither be included as a column nor as a filter.
  • When calling the API from the context of a plug-in then only metrics available from that plug-in may be selected.
  • We do not support filtering based on aggregate metric data (e.g. "all articles with accesses > 100 in the last year"). That said: most requirements like this can can be simulated by combining ranges, ordering and dimension selection anyway.
  • Support for this API can easily be implemented within the MetricsDAO due to the design of proposed metrics table:
    • Dimension selection corresponds precisely to a SELECT statement on that table.
    • Dimension filtering corresponds precisely to a WHERE statement on the same table or on one of the (potentially virtual) dimension hierarchy columns.
    • Ordering can be easily implemented with a ORDER BY clause.
    • Result ranges can be defined through the standard paging mechanism.
  • Specific reporting use cases can be tuned by introducing further aggregates (for better "dicing" support) or indexes (for "slices" that select at most a one-digit percentage of the cube). We do not implement such support unless it turns out in practice that certain use cases could not be implemented otherwise.

Value-Added Features (II / III)

Most-Viewed Articles (II)

tbd.

Search Ranking (III)

tbd.

Statistics Reports (III)

tbd.

Similar Articles (III)

tbd.

Editors and Authors (III?, PKP/Juan?)

tbd.

Readers (PKP/Juan)

tbd.

A Simple Cross-Plugin Report Generator (PKP/Bruno)

On the basis of the PHP input format specification of the front-end API (see above) we can easily define an HTTP GET version of the input protocol. Translating such a protocol one-to-one to a call on the internal API is trivial.

This will allow us to easily define reports which then can be presented in various formats (XML, CSV, HTML, PDF) without having to implement or even think about a heavy report generator front-end. Such a front-end can be implemented at any time if we like but I'd rather not try to re-invent Excel Pivot Tables in OJS.

With an HTTP GET protocol exporting live OLAP data to Excel or other OLAP tools will be extremely easy. Thus data can be flexibly analyzed outside OJS while common predefined reports with dynamic choices (e.g. time) can be implemented in OJS with a light weight front-end or on any page with a simple download/page link.

This gives advanced OJS users nice opportunities for very easy but still powerful definition of custom reports in any supported output format everywhere in OJS or even for inclusion on external web pages!