Difference between revisions of "OJSdeStatisticsConcept"

From PKP Wiki
Jump to: navigation, search
m (clarification)
m (style)
Line 293: Line 293:
For a simple estimate we assume the following maximum dimension cardinality:
For a simple estimate we assume the following maximum dimension cardinality:
- assocId/assocType: 3000
* assocId/assocType: 3000
- day: 365 * 10 = 3650
* day: 365 * 10 = 3650
- metricType: 5
* metricType: 5
- loadId is not included in the calculation as we assume that there will be at most a single load ID per day which makes loadId isomorphic to the day dimension
* loadId is not included in the calculation as we assume that there will be at most a single load ID per day which makes loadId isomorphic to the day dimension
- country is not included as we do not intend to implement this dimension for the time being
* country is not included as we do not intend to implement this dimension for the time being
We further assume completely dense data (which gives us an upper bound way above the probable data distribution).
We further assume completely dense data (which gives us an upper bound way above the probable data distribution).

Revision as of 11:08, 15 January 2013


The OA-S Project

The German Open Access Statistics (OA-S) project intends to provide a framework and infrastructure that enables OJS users to establish alternative usage statistics for Open Access content and build value added services on top of such statistics.

This specification describes the use cases, requirements and implementation recommendations for the integration of OA-S into OJS.

The basic idea of OA-S can be described as a co-operation of two institutions:

  1. The data provider (in this case operators of OJS installations) tracks access to documents (in our case article and issue galleys, abstracts and supplementary files).
  2. Access data will then be made available through a protected OAI interface and harvested by the OAS service provider.
  3. The service provider will clean the raw data and produce aggregate metrics based on the COUNTER standard. At a later stage LocEc and IFABC statistics may also be supported.
  4. The data provider retrieves metrics from the service provider on a daily basis and stores them in OJS.

These metrics can then be used in different ways in OJS:

  • They can be displayed to the editors, authors and readers.
  • They can be used in search for ranking.
  • Editors could produce statistics reports.
  • We could provide a "most viewed article" feature.
  • We could implement a feature that displays "other (more viewed) articles of the same author".
  • We could display "similar (more viewed) articles" on an article's page.

We'll use the terms "data provider" and "service provider" from here on without further explanation. "Data provider" and "OJS user" can be used synonymously. We use the term "end user" to refer to users accessing an OJS site.

Requirements for OJS OA-S Integration

The requirements for OJS OA-S integration can be divided into four areas:

  1. log data extraction and storage (OJS)
  2. log transfer (OJS -> OA-S)
  3. metrics retrieval (OA-S -> OJS)
  4. value added services (OJS)

The requirements in these areas are...

Log data extraction and storage:

  • We log usage events for access to issue and article galleys, article abstracts and supplementary files.
  • Logged data must be pseudonimized immediately.
  • Logged data must be deleted immediately after it has been successfully transferred to the service provider or after it expires.

Log transfer:

  • Log data must be transformed into context objects.
  • We then have to provide an HTTP BASIC authenticated OAI interface from which the service provider will harvest raw log data.

Metrics retrieval:

  • We retrieve final metrics data from the service provider via a JSON web service.
  • Metrics data will then be stored into an OLAP data model which allows us easy data access, both granular and aggregate.

Value added services:

  • Display granular (per-object) metrics to the readers, authors and editors.
  • Use metrics as a search ranking criteria together with the Lucene plugin.
  • Define reports on OA-S metrics similarly to already existing OJS reports.
  • Implement a "most viewed articles" feature.
  • Implement an "another (more viewed) articles of the same author" feature.
  • Implement a feature "similar (more viewed) articles".

The following sections will provide analysis and implementation recommendations for these requirement areas.

Data extraction and storage

Log Events

OA-S expects us to deliver certain usage event data. OA-S provides sample code for DSpace log extraction. The same code is also provided via SVN. We analyzed the specification as well as the sample code to define the following list of all required or optional data items. The corresponding proposed OJS/PHP data source has been included between parentheses:

  • usage event timestamp (PHP's time() function +/- local time offset)
  • administration section
    • HTTP status code (will be always 200 in our case as we won't produce view events for non-200 responses)
    • downloaded document size (difficult from PHP, we may use connection_aborted() to identify incomplete downloads and set download size = 0 in that case, otherwise download size = actual document size, this would at least emulate the original behavior a bit)
    • actual document size (PKPFile::getFileSize() for galleys and supplementary files, I propose 0 in the case of article abstracts)
    • document format (MIME type, PKPFile::getFileType())
    • URI of the service (e.g. OJS journal URL, Config::getVar('general', 'base_url'))
  • referent section
    • document URL (This will be the canonical "best URL" produced by PKPRouter::url() + ...::getBest...Id() )
    • optional: one or more unique document IDs (e.g. DOI, URN, etc.) (can be easily retrieved from the PubId-plugins for objects where IDs are defined)
  • referring entity section
    • HTTP referrer (if available, $_SERVER['HTTP_REFERER'])
    • optional: additional identifiers of the referring entity (e.g. DOI, ...) if available (not implemented in OJS)
  • requester section
    • hashed + salted IP (PKPRequest::getRemoteAddr() + hashing)
    • hashed C class (PKPRequest::getRemoteAddr() + truncation + hashing)
    • hostname of the requesting entity (if available), truncated to the second level domain (recommendation: do not implemented in OJS as this would require one DNS request per view event which would be very expensive)
    • optional: classification
      • internal: Usage events due to internal requirements, e.g. automated integrity checks, availability checks, etc.
      • administrative: Usage events that happen due to administrative decisions, e.g. for quality assurance. (proposal: use this category for accesses by logged in editors, section editors, authors, etc.)
      • institutional: Usage events triggered from within the institution running the service for which usage events are being collected.
    • optional: hashed session ID or session (recommendation: do not send)
    • HTTP user agent (if available, use PKPRequest::getUserAgent())
  • service type section (omitted, only for relevant for link resolvers)

According to OA-S information, these data items will be used for the following purposes:

  • IP address and timestamp are used to recognize double downloads as defined by the COUNTER standard. Such "double clicks" will be counted as a single usage event.
  • The C class of the IP address will furthermore be used to recognize robots and exclude their usage from the statistics.
  • The file information (url, name, document id, url parameters, etc.) are used to uniquely identify the document which has been accessed.
  • The HTTP status code will used as only successful access may be counted.
  • The size of the document is used to identify full downloads (e.g. 95% of the file downloaded). Partial or aborted downloads will not be counted as usage event.
  • The HTTP user agent will be used to identify robots and to remove their usage from the statistics.
  • The referrer information is used to analyze how users found the service and can be used to improve the service (potential sources: search engines, organizational web portal).

To capture the required information, I recommend implementing a specialized view event hook that all statistics plug-ins can subscribe to. This allows us to better standardize OJS view events (e.g. to simulate Apache log events as much as possible) and keeps code overhead to a minimum. If an additional OJS hook should be avoided we can hook into the existing TemplateManager::display and FileManager::downloadFile hooks and filter these generic events to identify view events we are interested in.

If we implement a statistics event hook then I recommend that such a hook provide data items similar to the variables available in Apache's log component. This enables us to easily switch the statistics hook later, e.g. based on Apache logs or shared storage as has been proposed.

Privacy Protection

We assume that many OJS providers using the OA-S extension will be liable to German privacy laws. While OJS users will have to evaluated their legal situation on a per-case basis and we cannot guarantee in any way that OJS conforms to all legal requirements in individual cases, we provide basic technical infrastructure that may make it easier for OJS users to comply with German privacy law.

Legal Requirements

The OA-S project commissioned two legal case studies with respect to German privacy law: one describes the legal context of OA-S application at University Stuttgart, the other focuses more generally on OA-S users, especially project members. The first report has been done during an earlier phase of the OA-S project when privacy-enhancing measures, like the use of a SALT for IP hashing, were not yet specified. The second report is more recent. It assumes an enhanced privacy infrastructure, i.e. the use of a SALT to pseudonimize IP addresses. We therefore base our implementation recommendations on the results of the second report.

The report recommends that data providers liable to German privacy law implement the following infrastructure:

  • All personal data must be pseudonymized immediately (within a few minutes) after being stored. This can be achieved by hashing IP addresses with a secret salt. The salt must have a length of at least 128 bits and must be cryptographically secure. The salt must be renewed about once a month and may not be known to the OA-S service provider. The salt will be distributed through a central agent to all data providers. A single salt can be used for all data providers if they do not share pseudonimized data. Pseudonimized data must be immediately transferred to the service provider and thereafter deleted by the data provider, i.e. every five minutes.
  • Data providers have to provide the means for end users to deny data collection ("opt-out"). The cited report comes to the conclusion that an active "opt-in" of end users is not necessary if data will be reliably pseudonymized. It recommends an "opt-out" button which, if clicked, could result in a temporary cookie being set in the end user's browser. Whenever such a cookie is present, usage data may not be collected. The report recommends against setting a permanent cookie as this may now or in the future require active "opt-in" on the part of the end user. Alternatively the user's IP address could be blacklisted while using the service, i.e. entered into a table and all data from that IP would then not be stored. The blacklist entry would have to be deleted after the user session expires.
  • Data providers have to inform end users about their right to opt out of data collection before they start using the service. They also have to inform the end user that opting out of the service will result in a temporary cookie being set in their browsers. This information must be available not only once when the user starts using the service but permanently, e.g. through a link.
  • Data providers will have to implement further organizational measures (registration of data processing organizations, reporting data usage to end users on demand)

Salt Management Interface

As pointed out in the previous section, we'll have to salt our pseudomization hash function. Within the OA-S project, University Library Saarbrücken (SULB) provides a central SALT distribution mechanism as described in the OA-S technical manual for new repositories. SALTs will be provided on a monthly basis and have to be downloaded to OJS. SULB provides a Linux shell script (alternative unprotected link) to download SALTS. We rather recommend to download the SALT from with OJS directly to avoid the additional complexity of calling a shell script and to better support Windows users. The SALT can be downloaded from an HTTP BASIC protected location.

A new salt is usually being provided at the beginning of each month. We recommend the following algorithm for salt management to be implemented in OJS:

   Whenever we receive a log event:
       IF the download timestamp of the current SALT is within the current month THEN
           use the current SALT to pseudonimize log data
           IF the "last download time" lies within the last fifteen minutes THEN
               use the current SALT
               authenticate to "oas.sulb.uni-saarland.de"
               download "salt_value.txt"
               set the "last download time" to the current time() value
               IF the downloaded SALT is different from the current SALT THEN
                   replace the current SALT with the downloaded SALT
                   set the timestamp of the SALT to the current time() value
                   use the new SALT to pseudonimize log data
                   use the current SALT


The OA-S sample application provides an algorithm for IP pseudomization based on the SALT value retrieved from SULB.

We recommend using this exact PHP function to pseudonimize IPs in OJS.

Data Storage

Statistics events have to be temporarily stored between the time they arrive and the time they're being harvested by the service provider. For intermediate data storage I recommend a plug-in specific internal database table that contains the fields mentioned in #Data Extraction above. Personal data (IP) must be stored in its pseudonimized form (see #Privacy Protection).

Due to privacy restrictions and to avoid scalability problems, we should delete log data as soon as it has been successfully transferred to the service provider. Unfortunately, OA-S has not yet specified a protocol that allows us to determine when data has been received by the service provider's server (Source: Email Julika 30.10.2012). OA-S uses the OAI protocol to harvest statistics data. This protocol does not support success/failure messages by itself. Harvesters usually retry access when there was a communications failure. Although access to the OAI interface is authenticated, this means that we cannot delete data whenever we receive an OAI request. We rather have to define a fixed maximum time that log data may be kept in the log and then delete it, independently of its transfer status.

We therefore recommend to save log data indexed by its time stamp. Whenever we manipulate the log table (i.e. when receiving a log event) we'll automatically delete all expired log data.

As the log expiry time will be relatively low, we do not see the necessity to rotate the log table or otherwise improve scalability.

Log Transfer

NB: The specification and implementation of log transfer are not part of the project phase OA-S I. We collect unstructured material for use in later project phases.

Transformation into Context Objects


OAI interface


  • OA-S provides a validator service for the data provider OAI interface.
  • We use HTTP Basic authentication to protect the OAI end point (see email Julika, 31.10.2012)
  • A protocol to confirm successful reception of raw data has not yet been implemented (see email Julika, 31.10.2012)

Retrieving metrics from the Service Provider

NB: The specification and implementation of the OA-S statistics retrieval interface are not part of the project phase OA-S I. We collect unstructured material for use in later project phases.

The JSON retrieved from OA-S looks something like this:

     "from": "2012-10-01",
     "to": "2012-10-30",
     "entrydef": ["identifier", "date", "counter", "LogEc", "IFABC", "RobotsCounter", "RobotsLogEc", "RobotsIFABC"],
     "entries": [
       {"identifier": "oai:DeinDienst.de:00001", "date": "2012-10-10", "counter": 0, "LogEc": 0, "IFABC": 1, "RobotsCounter": 0, "RobotsLogEc": 0, "RobotsIFABC": 0},
       {"identifier": "oai:DeinDienst.de:00037", "date": "2012-10-11", "counter": 0, "LogEc": 0, "IFABC": 1, "RobotsCounter": 0, "RobotsLogEc": 0, "RobotsIFABC": 0},

Source: Email Julika, 31.10.2012

Value Added Services

NB: The specification and implementation of the OA-S value added services are not part of the project phase OA-S I. We collect unstructured material for use in later project phases with exception of the common API for value added services which must be fully specified now for timeline synchronization with other project members and to clearly specify data requirements on our part.

Common API for Value Added Services

Some of the use cases for value added services require us to integrate various usage statistics in the same place (e.g. OAS, COUNTER/OJS, ALM). Examples of this would be:

  • display of all article-specific metrics to end readers on the article abstract page
  • selection of a metric for search ranking
  • cross-metric reports for OJS editors.

To implement such use cases, we need an implementation-agnostic cross-plugin API that allows us to treat statistics from various sources conformly. Thus I recommend two additions to core OJS:

  1. A specialized plug-in API similar to the API for public identifiers to be implemented by all ReportPlugin classes.
  2. A multi-dimensional online analytical processing (OLAP) database table for metric storage.

The following two sections will describe our recommendations for these additions in more detail.

Plug-In API

The proposed metrics API allows granular access to any number of metrics provided by different plug-ins. If there are several metric types, we have to define a site- or journal-specific primary metric type which will then be used where a single "most important" metric is required (e.g. for search ranking). We propose that such a selection is done in the site or journal settings by the journal manager or OJS admin respectively.

I propose a change to the ReportPlugin base class. Similarly to PubIdPlugin, this class could serve as a plug-in agnostic metric provider:

  • Statistics plug-ins (e.g. COUNTER, OA-S, ALM) should provide a ReportPlugin.
  • If a plug-in needs hooks (e.g. to track usage events) it should be nested in a GenericPlugin.

We recommend the following specific API for ReportPlugin:

  • getMetric($metricType, $pubObject, ...optional filter...)
  • getMetrics($pubObject, ...optional filter...)
  • getMetricTypes()
  • getMetricDisplayType($metricType)
  • getMetricFullName($metricType)
  • These methods should return NULL in case of plug-ins that do not wish to provide metrics, e.g. the current articles, reviews and subscriptions reports
  • filter criteria could be specified for any of the metric dimensions defined in the metrics table (see next section)

Furthermore we recommend the following specific API for publication objects (issue and article galleys, articles, supplementary files).

  • getMetric($metricType = null, ...optional filter criteria...) to return a single metric for the given publication object
  • getMetrics(...optional filter criteria...) to return an array with all metrics for the given publication object
  • These methods return NULL in case metrics are not defined for the given object or filter.
  • Article/ArticleGalley/IssueGalley/SuppFile::getViews() would be renamed to ...::getMetric() with the above signature throughout OJS.
  • If no $metricType is given for getMetric() then the main/primary metric type will be used.
  • This API can be extended to issues and journals for aggregate metrics retrieval.

Plugins can internally decide whether to actually retrieve metrics from the MetricsDAO (see next paragraph) or whether to retrieve metrics from an external location (e.g. a web service). The proposed API is NOT meant for aggregate data access as required by reporting. Such requirements should be done via the MetricsDAO which will be described next.

Metrics Table

It is entirely possible to specify a common data model for aggregate metrics storage in OJS as OJS front-end statistics use cases are common to all statistics plug-ins. We therefore recommend to consolidate the current plug-in specific metrics stores into a single metrics table.

While saving on development time and complexity, such a table is crucial to implement cross-metrics reporting, too. Among other front-end use cases, such a table would enable us to replace plug-in specific reports with a simple report generator for all metrics. It would also help us to implement requirements as "time-based reporting" in a simple and efficient way across all metrics plugins.

Having local aggregate metrics data is necessary to provide speedy reports. I do not believe that building cross-metric aggregate reports based on on-the-fly access to a remote web service can be done with acceptable response times. Metrics not stored to the metrics table would therefore not be available for standard OJS reporting. Plugins like ALM that choose to implement a plug-in specific metrics storage or wish to retrieve metrics from a web service on the fly will have to provide their own reporting infrastructure and could not use the default report generator.

Conceptually, the proposed table represents a multi-dimensional OLAP cube. It allows both, granular (drill down) and aggregate (dice and slice), access to metrics data.

The proposed cube has the following properties...


  • publication object (represented by assocId + assocType)
  • time (day)
  • metric ("COUNTER", "OA-S", "ALM-Facebook", ...)

Aggregation hierarchies over these dimensions:

  • publication object: assocType
  • publication object: author
  • publication object: article -> issue -> journal (as aggregation hierarchy, not the objects themselves)
  • time: month -> year

Additional dimensions may have to be defined to cater to special administrative or front-end use cases, such as

  • geography: Where did the usage events originate from?
  • source file: This enables us to implement a scalable file based, restartable load process. Details of such a process are not in scope of this document.
  • ...


  • Facts would be represented as a single integer or float data type dimensioned as just outlined.
  • Dimensions should be modeled additively so that we can use them in reports that "slice" and "dice" the conceptual data cube. This excludes "from-to" notation for the date.
  • Monthly data that is not available on a daily basis can be modeled additively in such a table by leaving the day-aggregation level empty.

This is pretty much a standard OLAP design and should therefore serve all aggregate data needs that may come up in reports or elsewhere.

While the conceptual level seems rather complicated, we can implement such a conceptual model with a single additional table in core OJS as all dimension tables either already exist in the database or could be implemented "virtually" by on-the-fly aggregation, e.g. for the date hierarchy.

We therefore propose a single new database table 'metrics' with the following columns:

  • assocId: foreign key to article, article or issue galley, supplementary file
  • assocType: determining the type of the assoc id (article, article or issue galley, supplementary file)
  • day: the lowest aggregate level of the time dimension
  • month (optional): required only if we want to support month-only aggregation, either because some metrics cannot be provided on a daily basis or to compress historical data
  • metricType: e.g. "OA-S-Counter", "OJS-Counter", "ALM-Facebook", etc.
  • loadId (optional): a plug-in specific load identifier used to implement a restartable and scalable load process, e.g. a file name, file id or run id
  • country (optional): the lowest aggregate level of the source geography dimension
  • metric: an integer or float column that represents the aggregate metric

The dimension columns (all except 'metric') should have a multi-column unique index placed on them to enforce data consistency.

The number of data items in the proposed table will be considerably less than what we currently need to store raw event data. I therefore do not believe that such a table will reach the underlying database's scalability limit. If such a thing would happen we could still purge old metric data or reduce the granularity of historic data (e.g. from per-day to per-month).

For a simple estimate we assume the following maximum dimension cardinality:

  • assocId/assocType: 3000
  • day: 365 * 10 = 3650
  • metricType: 5
  • loadId is not included in the calculation as we assume that there will be at most a single load ID per day which makes loadId isomorphic to the day dimension
  • country is not included as we do not intend to implement this dimension for the time being

We further assume completely dense data (which gives us an upper bound way above the probable data distribution).

Under these pessimistic assumptions we get a maximum of 3000 * 3650 * 5 = 54.75 * 10^6 rows which is comfortably within the range supported by current MySQL versions and probably way above what even large OJS installations will ever encounter. Compressing historical data per day will provide a theoretical compression ratio of 12:365 (about 96%) for completely dense data. In practice compression will be lower but still very considerable with non-dense data.

Access to the metrics table would be mediated through a MetricDAO. The MetricDAO should provide methods to insert/update metrics and to administer scalable load processes. The MetricDAO should not used for granular metrics access except for accesses from statistics plugins that will route data through their getMetric()-API. This allows us to support common front-end features with granular metric access for plug-ins like ALM even if those do not support access through the MetricDAO.

Editors and Authors




Search Ranking


Statistics Reports


Most-Viewed Articles


Similar Articles