Difference between revisions of "OJSdeStatisticsConcept"

From PKP Wiki
Jump to: navigation, search
(Describe temporary data storage.)
(Requirements for OJS OA-S Integration)
Line 24: Line 24:
  
 
== Requirements for OJS OA-S Integration ==
 
== Requirements for OJS OA-S Integration ==
 +
 +
The requirements for OJS OA-S integration can be divided into four areas:
 +
# log data extraction and storage (OJS)
 +
# log transfer (OJS -> OA-S)
 +
# metrics retrieval (OA-S -> OJS)
 +
# value added services (OJS)
 +
 +
The requirements in these areas are...
 +
 +
Log data extraction and storage:
 +
* We log usage events for access to issue and article galleys, article abstracts and supplementary files.
 +
* Logged data must be pseudonimized immediately.
 +
* Logged data must be deleted immediately after it has been successfully transferred to the service provider or after it expires.
 +
 +
Log transfer:
 +
* Log data must be transformed into context objects.
 +
* We then have to provide an HTTP BASIC authenticated OAI interface from which the service provider will harvest raw log data.
 +
 +
Metrics retrieval:
 +
* We retrieve final metrics data from the service provider via a JSON web service.
 +
* Metrics data will then be stored into an OLAP data model which allows us easy data access, both granular and aggregate.
 +
 +
Value added services:
 +
* Display granular (per-object) metrics to the readers, authors and editors.
 +
* Use metrics as a search ranking criteria together with the Lucene plugin.
 +
* Define reports on OA-S metrics similarly to already existing OJS reports.
 +
* Implement a "most viewed articles" feature.
 +
* Implement an "another (more viewed) articles of the same author" feature.
 +
* Implement a feature "similar (more viewed) articles".
  
 
= Data Extraction =
 
= Data Extraction =

Revision as of 07:52, 15 January 2013

Overview

The OA-S Project

The German Open Access Statistics (OA-S) project intends to provide a framework and infrastructure that enables OJS users to establish alternative usage statistics for Open Access content and build value added services on top of such statistics.

This specification describes the use cases, requirements and implementation recommendations for the integration of OA-S into OJS.

The basic idea of OA-S can be described as a co-operation of two institutions:

  1. The data provider (in this case operators of OJS installations) tracks access to documents (in our case article and issue galleys, abstracts and supplementary files).
  2. Access data will then be made available through a protected OAI interface and harvested by the OAS service provider.
  3. The service provider will clean the raw data and produce aggregate metrics based on the COUNTER standard. At a later stage LocEc and IFABC statistics may also be supported.
  4. The data provider retrieves metrics from the service provider on a daily basis and stores them in OJS.

These metrics can then be used in different ways in OJS:

  • They can be displayed to the editors, authors and readers.
  • They can be used in search for ranking.
  • Editors could produce statistics reports.
  • We could provide a "most viewed article" feature.
  • We could implement a feature that displays "other (more viewed) articles of the same author".
  • We could display "similar (more viewed) articles" on an article's page.

We'll use the terms "data provider" and "service provider" from here on without further explanation. "Data provider" and "OJS user" can be used synonymously. We use the term "end user" to refer to users accessing an OJS site.

Requirements for OJS OA-S Integration

The requirements for OJS OA-S integration can be divided into four areas:

  1. log data extraction and storage (OJS)
  2. log transfer (OJS -> OA-S)
  3. metrics retrieval (OA-S -> OJS)
  4. value added services (OJS)

The requirements in these areas are...

Log data extraction and storage:

  • We log usage events for access to issue and article galleys, article abstracts and supplementary files.
  • Logged data must be pseudonimized immediately.
  • Logged data must be deleted immediately after it has been successfully transferred to the service provider or after it expires.

Log transfer:

  • Log data must be transformed into context objects.
  • We then have to provide an HTTP BASIC authenticated OAI interface from which the service provider will harvest raw log data.

Metrics retrieval:

  • We retrieve final metrics data from the service provider via a JSON web service.
  • Metrics data will then be stored into an OLAP data model which allows us easy data access, both granular and aggregate.

Value added services:

  • Display granular (per-object) metrics to the readers, authors and editors.
  • Use metrics as a search ranking criteria together with the Lucene plugin.
  • Define reports on OA-S metrics similarly to already existing OJS reports.
  • Implement a "most viewed articles" feature.
  • Implement an "another (more viewed) articles of the same author" feature.
  • Implement a feature "similar (more viewed) articles".

Data Extraction

OA-S expects us to deliver certain usage event data. OA-S provides sample code for DSpace log extraction. The same code is also provided via SVN. We analyzed the specification as well as the sample code to define the following list of all required or optional data items. The corresponding proposed OJS/PHP data source has been included between parentheses:

  • usage event timestamp (PHP's time() function +/- local time offset)
  • administration section
    • HTTP status code (will be always 200 in our case as we won't produce view events for non-200 responses)
    • downloaded document size (difficult from PHP, we may use connection_aborted() to identify incomplete downloads and set download size = 0 in that case, otherwise download size = actual document size, this would at least emulate the original behavior a bit)
    • actual document size (PKPFile::getFileSize() for galleys and supplementary files, I propose 0 in the case of article abstracts)
    • document format (MIME type, PKPFile::getFileType())
    • URI of the service (e.g. OJS journal URL, Config::getVar('general', 'base_url'))
  • referent section
    • document URL (This will be the canonical "best URL" produced by PKPRouter::url() + ...::getBest...Id() )
    • optional: one or more unique document IDs (e.g. DOI, URN, etc.) (can be easily retrieved from the PubId-plugins for objects where IDs are defined)
  • referring entity section
    • HTTP referrer (if available, $_SERVER['HTTP_REFERER'])
    • optional: additional identifiers of the referring entity (e.g. DOI, ...) if available (not implemented in OJS)
  • requester section
    • hashed + salted IP (PKPRequest::getRemoteAddr() + hashing)
    • hashed C class (PKPRequest::getRemoteAddr() + truncation + hashing)
    • hostname of the requesting entity (if available), truncated to the second level domain (recommendation: do not implemented in OJS as this would require one DNS request per view event which would be very expensive)
    • optional: classification
      • internal: Usage events due to internal requirements, e.g. automated integrity checks, availability checks, etc.
      • administrative: Usage events that happen due to administrative decisions, e.g. for quality assurance. (proposal: use this category for accesses by logged in editors, section editors, authors, etc.)
      • institutional: Usage events triggered from within the institution running the service for which usage events are being collected.
    • optional: hashed session ID or session (recommendation: do not send)
    • HTTP user agent (if available, use PKPRequest::getUserAgent())
  • service type section (omitted, only for relevant for link resolvers)


According to OA-S information, these data items will be used for the following purposes:

  • IP address and timestamp are used to recognize double downloads as defined by the COUNTER standard. Such "double clicks" will be counted as a single usage event.
  • The C class of the IP address will furthermore be used to recognize robots and exclude their usage from the statistics.
  • The file information (url, name, document id, url parameters, etc.) are used to uniquely identify the document which has been accessed.
  • The HTTP status code will used as only successful access may be counted.
  • The size of the document is used to identify full downloads (e.g. 95% of the file downloaded). Partial or aborted downloads will not be counted as usage event.
  • The HTTP user agent will be used to identify robots and to remove their usage from the statistics.
  • The referrer information is used to analyze how users found the service and can be used to improve the service (potential sources: search engines, organizational web portal).

To capture the required information, I recommend implementing a specialized view event hook that all statistics plug-ins can subscribe to. This allows us to better standardize OJS view events (e.g. to simulate Apache log events as much as possible) and keeps code overhead to a minimum. If an additional OJS hook should be avoided we can hook into the existing TemplateManager::display and FileManager::downloadFile hooks and filter these generic events to identify view events we are interested in.

If we implement a statistics event hook then I recommend that such a hook provide data items similar to the variables available in Apache's log component. This enables us to easily switch the statistics hook later, e.g. based on Apache logs or shared storage as has been proposed.

Data Transformation

Privacy Protection

We assume that many OJS providers using the OA-S extension will be liable to German privacy laws. While OJS users will have to evaluated their legal situation on a per-case basis and we cannot guarantee in any way that OJS conforms to all legal requirements in individual cases, we provide basic technical infrastructure that may make it easier for OJS users to comply with German privacy law.

Legal Requirements

The OA-S project commissioned two legal case studies with respect to German privacy law: one describes the legal context of OA-S application at University Stuttgart, the other focuses more generally on OA-S users, especially project members. The first report has been done during an earlier phase of the OA-S project when privacy-enhancing measures, like the use of a SALT for IP hashing, were not yet specified. The second report is more recent. It assumes an enhanced privacy infrastructure, i.e. the use of a SALT to pseudonimize IP addresses. We therefore base our implementation recommendations on the results of the second report.

The report recommends that data providers liable to German privacy law implement the following infrastructure:

  • All personal data must be pseudonymized immediately (within a few minutes) after being stored. This can be achieved by hashing IP addresses with a secret salt. The salt must have a length of at least 128 bits and must be cryptographically secure. The salt must be renewed about once a month and may not be known to the OA-S service provider. The salt will be distributed through a central agent to all data providers. A single salt can be used for all data providers if they do not share pseudonimized data. Pseudonimized data must be immediately transferred to the service provider and thereafter deleted by the data provider, i.e. every five minutes.
  • Data providers have to provide the means for end users to deny data collection ("opt-out"). The cited report comes to the conclusion that an active "opt-in" of end users is not necessary if data will be reliably pseudonymized. It recommends an "opt-out" button which, if clicked, could result in a temporary cookie being set in the end user's browser. Whenever such a cookie is present, usage data may not be collected. The report recommends against setting a permanent cookie as this may now or in the future require active "opt-in" on the part of the end user. Alternatively the user's IP address could be blacklisted while using the service, i.e. entered into a table and all data from that IP would then not be stored. The blacklist entry would have to be deleted after the user session expires.
  • Data providers have to inform end users about their right to opt out of data collection before they start using the service. They also have to inform the end user that opting out of the service will result in a temporary cookie being set in their browsers. This information must be available not only once when the user starts using the service but permanently, e.g. through a link.
  • Data providers will have to implement further organizational measures (registration of data processing organizations, reporting data usage to end users on demand)

Salt Management Interface

Data Storage

Statistics events have to be temporarily stored between the time they arrive and the time they're being harvested by the service provider. For intermediate data storage I recommend a plug-in specific internal database table that contains the fields mentioned in #Data Extraction above. Personal data (IP) must be stored in its pseudonimized form (see #Privacy Protection).

Due to privacy restrictions and to avoid scalability problems, we should delete log data as soon as it has been successfully transferred to the service provider. Unfortunately, OA-S has not yet specified a protocol that allows us to determine when data has been received by the service provider's server (Source: Email Julika 30.10.2012). OA-S uses the OAI protocol to harvest statistics data. This protocol does not support success/failure messages by itself. Harvesters usually retry access when there was a communications failure. Although access to the OAI interface is authenticated, this means that we cannot delete data whenever we receive an OAI request. We rather have to define a fixed maximum time that log data may be kept in the log and then delete it, independently of its transfer status.

We therefore recommend to save log data indexed by its time stamp. Whenever we manipulate the log table (i.e. when receiving a log event) we'll automatically delete all expired log data.

As the log expiry time will be relatively low, we do not see the necessity to rotate the log table or otherwise improve scalability.

OAI interface

NB: The specification and implementation of the OA-S OAI interface are not part of the project phase OA-S I. We collect unstructured material for use in later project phases.

  • OA-S provides a validator service for the data provider OAI interface.
  • We use HTTP Basic authentication to protect the OAI end point (see email Julika, 31.10.2012)
  • A protocol to confirm successful reception of raw data has not yet been implemented (see email Julika, 31.10.2012)

Retrieving metrics from the Service Provider

NB: The specification and implementation of the OA-S statistics retrieval interface are not part of the project phase OA-S I. We collect unstructured material for use in later project phases.

The JSON retrieved from OA-S looks something like this:

   {
     "from": "2012-10-01",
     "to": "2012-10-30",
     "entrydef": ["identifier", "date", "counter", "LogEc", "IFABC", "RobotsCounter", "RobotsLogEc", "RobotsIFABC"],
     "entries": [
       {"identifier": "oai:DeinDienst.de:00001", "date": "2012-10-10", "counter": 0, "LogEc": 0, "IFABC": 1, "RobotsCounter": 0, "RobotsLogEc": 0, "RobotsIFABC": 0},
       {"identifier": "oai:DeinDienst.de:00037", "date": "2012-10-11", "counter": 0, "LogEc": 0, "IFABC": 1, "RobotsCounter": 0, "RobotsLogEc": 0, "RobotsIFABC": 0},
   ...
   }

Source: Email Julika, 31.10.2012

API for Value Added Services