PKP Statistics Framework

From PKP Wiki
Revision as of 16:58, 4 December 2013 by Jmacgreg (Talk | contribs)

Jump to: navigation, search

PKP Statistics Framework

Summary

With the release of 2.4.3, Open Journal Systems features a new structure for collecting statistics and generating reports. Using "metrics" (a set of rules for how system usage is measured), users can set specific criteria for statistics gathering. Prior to this change, different plugins may have collected statistics differently. Some might have counted hits from bots while others might not have. Thanks to configurable "metrics", statistics plugins can now report the stats you need using the exact same selection and filtering criteria.

Thanks to this change, it is now easier to reuse structures to read log files, store statistics, and generate reports.

The following documentation includes the following information:

  • Migrating from the previous version to the new metric-driven statistics structure.
  • Specific setup information for system administrators.
  • Generating reports using either legacy (the previous style) or the new metric.
  • Generating custom reports.
  • Aggregating stats by specified units like: region, city, day, month... etc.
  • Filtering reports.
  • Understanding generated reports.
  • Logging statistics.
  • Usage examples.

If you regularly use statistics reporting in OJS, this document will help you migrate to the new system and show you how to generate relevant statistics with variable levels of granularity.

Note for Upgraders

Depending on the way you use OJS statistics currently, the changes to the statistics framework of OJS may require some adjustments upon upgrade to 2.4.3. For all users who are installing OJS fresh and starting from scratch, they may merely use the software as intended. For users not concerned with migrating old stats or not currently using the existing stats package, they may also use the software as intended and should not encounter any issues.

However, users who need their statistics from before the update will have to carefully read the sections in this document on statistics migration. Since statistics will now be counted differently, these steps are essential for processing old logs.

Technical details

The technical details are not crucial for normal operation in OJS. This document should be enough to help manage usage statistics. However, if further technical documentation is needed, please access OJSdeStatisticsConcept.

Metrics

The most important new concept is a “metric.” It can be understood as a set of rules that determine how system usage is measured. Using the previous set of plugins to manage statistics, separate plugins might have had different rules to determine which accesses should be logged as valid article views. For example, the previous Timed Views plugin did not filter bots whereas the COUNTER plugin did. For that reason, even if both plugins were counting the same events (article views, for example) they would present different results. In the new structure, this corresponds to each plugin having a different metric.

OJS can now have any number of metrics. Any report plugin can implement a specific metric and serve the system with statistics. All calls for statistics inside the system are now dependent on which metric the system is using. So, if the journal has more than one plugin implementing different metrics, they can choose in the statistics settings page which metric will be used to present statistics to users, both in public and private pages.

You can still access different metrics statistics without changing the main metric setting using the statistics report page (manager/statistics). We will see more about this in Report Generation.

Default metric (OJS/COUNTER)

OJS 2.4.3 ships with only one implemented metric. That means that only one report plugin implements a metric – specifically, UsageStatsReportPlugin. This plugin follows all processing rules (avoiding bots, double clicking, etc) from the COUNTER project; therefore the metric that it implements is called OJS/COUNTER. This plugin collects statistics from the following public objects: journal, issue, article and galley. It implements time and geo localization dimensions, so reports can be generated by month, year, city, region and/or country.

Statistics migration

The current OJS upgrade process will migrate statistics data from the Timed View plugin, the Counter plugin, and the built-in OJS view counts from previous versions. Because each one of those sources had different approaches to collecting statistics, they cannot be merged; therefore each will be migrated as a different metric:

Previous data source Migrated metric name
Counter plugin ojs::legacyCounterPlugin
Timed Views plugin ojs::timedViews
OJS views ojs::legacyDefault

The only data that the migration process will adapt are the statistics from the Timed Views plugin. Originally this plugin was not filtering bots; for data compatibility, the migration process will delete all entries that come from bots identified using the Counter robots list.

Because those old sources do not continue collecting statistics after the upgrade (the only source of statistics inside the default OJS 2.4.3 installation is the UsageStatsReportPlugin, which implements the ojs::counter metric) the metrics that they implement don’t appear inside the main metric setting. They are only intended for backwards compatibility. However, as noted below, the new reporting tools are able to generate reports from both old and new data sources.

Requirements (for site administrators)

Geolocation database

To run the upgrade process and to correctly migrate the old statistics data, you will have to ask your site administrator to download the geolocation database, following these steps:

Linux

  1. open a shell prompt
  2. go into OJS installation’s base directory
  3. wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
  4. gunzip GeoLiteCity.dat.gz
  5. mv GeoLiteCity.dat plugins/generic/usageStats

Windows

  1. download the file GeoLiteCity.dat.gz
  2. decompress it using any decompression tool into plugins/generic/usageStats directory

In both cases the complete path to the installed database file should be plugins/generic/usageStats/GeoLiteCity.dat

Execution time

If you have a very large set of Timed Views entries, the migration process may take more time than what's allowed by a process using the browser. To avoid incomplete upgrade process, if you have a really large set of statistics, you should use the client upgrade tool.

After migration

When the upgrade is complete, OJS will begin logging usage events into log files. If you decide to use Apache log files, you can turn this logging off (see OJS Statistics Logging, below).

The only difference for editors and authors will be the views column found in the editorial process for galleys. Because statistics are now unified, the system will present statistics throughout using the default metric. As before, the default metric is ojs::counter, but, unless you start to process event log files (see Processing Log Files, below) you will have no statistics for that metric. Therefore the views column will show 0 until some statistics are retrieved. (You can process old Apache log files too, if you want to get historical statistics into the new metric; see below.)

Please note that this does not mean that the old galley view data is not available anymore. The data is saved and still available for reports under the metric ojs::legacyDefault. Read the next section for more details on how to generate reports using both old and new statistics data.

Report Generation

Once the migration process is finished, you can generate reports using any metric. Each old OJS statistics source is now implemented as a report plugin. This way, it is possible to access them in the Report Generator section of the Stats & Reports page, under Journal Management.

All three plugins can work with both legacy and new statistics data sources. This means that you can generate the same reports that you could before migration, with the same results. And you can also generate reports from new statistics data, collected by the new usage statistics plugin.

Please note that the new default statistic metric will not produce the same results that each of the old sources would have produced if they were still collecting statistics. Now the usage statistics plugin implements rules for retrieving usage events in a different way – in order to be more consistent with the COUNTER project. One example is bot filtering; previously this was not executed consistently for all stats, and it did not use the official COUNTER robots list when it was being performed. Another difference is the process to avoid double-clicking, with different timing for file downloads and normal page views.

The OJS '"views"' data is the only element that doesn’t have a way to generate reports from new statistics data using the interface. The following url must be used instead:

manager/report/ViewReportPlugin?metricType=ojs::counter

That is because this report and the new usage statistics report are very similar. For statistics collected since the migration, users can use the new OJS usage statistics report, in the Report Generator area of the Stats & Reports page, under Journal Management.

There, users will find a similar format to the old view report but with more information, like geolocalization and date. Additionally, galley objects will have their own rows. Please note that you will get no data using the OJS usage statistics report if you don’t process any log files (see Processing Log Files).

Custom Reports

Managers can now generate custom reports, selecting columns, filters and ordering. This can be done via the report generator user interface, accessed by clicking on the “Generate Custom Report” link, in the Report Generator part of the Stats & Reports page, under Journal Management.

After defining the columns, filters and order-by settings, you can submit the form by clicking on “Generate Custom Report”. A URL will appear above the “submit” button. You can use this URL to generate again the same report, without having to choose the same settings.

To generate reports filtering data using times, like current day or month, click the “Current Month” or “Today” checkboxes in the “by time” filter section. A tip is to generate the report the way you like, get the url, and store it in your browser’s Favorites or bookmarks for quick access.

The following section describes each of the report generator settings.

Default Report Templates

This option allows Journal Managers to select a set of predefined settings to generate a report. There are five default reports:

Report name Purpose
Article file downloads Shows the number of all article files downloads (html, pdf, others all together).
Article abstract page views Shows the number of article abstract page views only.
Issue file downloads Shows the number of all issue files downloads (html, pdf, others all together).
Issue table of content page views Shows the number of table of contents page views only.
Journal main page views Shows the number of journal main page views only.

Aggregate Stats By

You can define the aggregation level of the statistics. The default for each report template is to aggregate statistics by country and month. So all statistics for the same published object will be grouped by month and country. You can define a lower level of aggregation, like region, city and/or day. The lower the aggregation level is, the more rows the same published objects will likely have. Let’s consider this example:

Type Article title Country Month Count
Galley Article 1 Canada 201311 10

This report shows that for November, the article 1 files were downloaded 10 times by users from Canada. As you can see, the aggregation level here is by country and month.

Let’s see this other report now.

Type Article title Country Month Day Count
Galley Article 1 Canada 201311 20131105 5
Galley Article 1 Canada 201311 20131110 5

With this new report, you can see two lines for the same published object (article). The difference between both reports it’s that now we can see the Day column, so we can check the article files download by each day of the month. That’s the day aggregation level.

Select Report Range

This option allows you to define the report time range. The options are:

Time range option Description
Today Will show only statistics for the current day of the report generation. If you generate a report url and save it for later use, when you use it, the day used will be that current day, and not the day you first generated the report url.
Current month Same for today option, but will show statistics for the current month.
Range by day Define start and end dates (year, month and day).
Range by month Define start and end months (year and month).

None of those settings involve aggregation level, only the report range. You can generate a report for the current month in a day aggregation level, for example.

Public Object Types

All statistics in OJS are now associated with a type of published object. Each object type has its own usage event that triggered the statistic collection. The following table describes that.

Object type Usage event
Journal Journal index page view.
Issue Issue table of contents page view.
Issue galleys Issue file downloads.
Article Article abstract page view.
Article galleys Article file downloads.

The default report templates inside the report generator form will always generate reports with the object type column, so you can check what it means the count column values.

Advanced Options

These options, together with the time range, really define the reports. What the default template options do is simply preselect options here. If you select a default report template and change any option here, you will change the report, regardless of whether you selected a report template.

Columns

Define the columns that will be used to build the report. The columns you select here not only define which data will be presented inside the reports, but also the aggregation level of the statistics. For example, if ID, Type and Month columns are chosen, the report will sum all views from an object from one month and will present that data in one row. If you choose Day instead of Month, each object will have a row for each day with statistics.

Filters

Define the data that will be used to filter statistics for the reports. Statistics can be generated for only articles inside an specific issue, or, for a specific article. Statistics can also be filtered by only certain kinds of published objects (articles and galleys, for example). It is also possible to filter by geolocation, selecting from which country or region or even city you want to generate statistics.

Order by

You can define which columns will be used to order the statistics in report. Note that even if you choose a column that you didn’t select to be presented in the report (using the columns section) the report will still be ordered by that column. Several columns can be defined. So, if the first column has a lot of equal results, the second one will be used, and so on for the other ones.

Report templates explanation

Now that we understand better the concepts behind reports, let’s take the default report templates as examples to see which settings they define (columns, filters) and why that generates the expected result.

Report* Columns Filters
1 Type, Article, Issue, Country, Month Object type equals to Galley
2 Type, Article, Issue, Country, Month Object type equals to Article
3 Type, Issue, Country, Month Object type equals to Issue Galley
4 Type, Issue, Country, Month Object type equals to Issue
5 Type, Journal, Country, Month Object type equals to Journal
  • Take a look at the reports table to see its description.

The most important setting is the filter. For example, if no filter is selected for only displaying article galleys (in case of report 1, for example), then all the other published objects would have stats presented in the report. The columns define the aggregation level – as mentioned above – and they are adjustable. Here is an example:

Type Article title Issue Country Month Count
Galley Article 1 Issue 1 Canada 201311 10

This is a default report 1 result. Let’s add the File Type column (you can do that selecting the article file downloads report template, then clicking on advanced options, holding the shift key and clicking on File Type, inside the columns select box).

Type Article title Issue Country Month File Type Count
Galley Article 1 Issue 1 Canada 201311 PDF 5
Galley Article 1 Issue 1 Canada 201311 HTML 5

Now we have a lower aggregation level, showing also the file type for the monthly article file downloads. The main objective of the report is the same, but the aggregation level is different.

We can also modify the ID column. If you add that instead of the file type, you should have something like this.

ID Title Type Article title Issue Country Month Count
1 File.pdf Galley Article 1 Issue 1 Canada 201311 3
2 File2.pdf Galley Article 1 Issue 1 Canada 201311 2
3 File.html Galley Article 1 Issue 1 Canada 201311 5

The aggregation level is now really low on the object id level. In this example, each galley will have its own rows, even if they are galleys for the same article and the same type.

In general, for those default report templates, if you change the columns, the purpose of the report will still be there. These column adjustments only differentiate between higher or lower aggregation levels. Filtering object type settings, however, will change the purpose of the report.

At the same time, you can use filters like geolocation or the context one to be more specific while generating reports.

Statistics Logging

After installing or upgrading OJS, the system will already be logging accesses for all installed journals into log files, inside the OJS files directory, in usageStats/usageEventLogs. There will be one file for each day. Not all system requests are logged into those files, only the ones that matter for usage statistics (submission views, file downloads, etc).

If you don’t want to log access into files, you can turn this option off inside the generic usage statistics plugin. Go to Plugins management page, into Generic Plugins category and search for the Usage Statistics plugin. Click on settings and you will be directed to the plugins settings page (manager/plugin/generic/usagestatsplugin/settings). Inside the settings page you will find the “Create log files” checkbox; uncheck it and save.

External access log files alternative =

The new usage statistics plugin can also read external log files to retrieve usage events from there in order to collect usage statistics. You can use your apache access logs, for example. The default expected format for the external access log files is the apache combined format.

If you wish to use log files in another format, you will have to create a regular expression that can parse the log entries. See the plugin settings page for more information.

The advantage to sharing logs with the web server is that logs will be kept in one unified place, and the existing operating system policies around log file management will not need to be extended to cover additional logs for OJS. You can turn off the “Create log files” usage statistics plugin option, so you don’t duplicate logging data (see OJS Statistics logging section).

Processing Log Files

Basic (default)

OJS, by default, automatically starts logging the usage events into it’s own access log files. It also automatically process those files, on a daily basis. So, for example, if you installed OJS today at 2:00 pm, the processing of the usage events for today would start tomorrow at 2:00 pm. If you don’t change any setting, this will proceed without any input.

Advanced

If you really want to control your stats processing, you can go to Plugins management page, into Generic Plugins category and search for the Acron plugin. If disabled, automatic processing will stop. Read the following steps to understand how to manually trigger the processing.

OJS needs to process the files to be able to retrieve statistics data from them. This process can be done using the File Loader task tool which comes with the Usage Statistics plugin.

File loader

The file loader task implements a process to reliably handle all file processing. It works with 4 folders: stage, processing, archive and reject. Everytime the file loader is run, it will search for files inside the stage folder. If there are any, it will move one file to the processing folder and it will start working on it. If anything goes wrong, it will send an email to the system administrator with valuable information about what went wrong and will move the file to the reject folder. If the processing went well, then it will move the file to the archive folder. It will continue this process until there are no more files inside the stage folder.

Processes

The file loader is a scheduled task tool, which can be periodically invoked using cron or another system task management tool. The command to run this task depends on which process are you willing to use:

File loader task processes:

# Process Command
1 OJS log files php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasks.xml
2 OJS log files with automated staging process php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasksAutoStage.xml
3 External log files php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasksExternalLogFiles.xml

This should be executed from the OJS installation base directory.

Process 1 and 3 requires files to be moved inside the stage folder (regardless of whether it’s an apache log file or an OJS log file). Process 2 already automatically moves OJS log files into stage folder, so the command only needs to be executed.

Everything else is handled by the file loader task. If this task isn't scheduled for automatic execution using cron or a similar mechanism, it will have to be run manually.

Note that for any process you choose, you can move files into the stage folder anytime, even while the scheduled task is running. You can also move any number of files inside the stage directory. What determines the period of time that you will be moving files into the stage directory is mainly your necessity for updated statistics.

The only thing that is not recommended is to move a log file that is still being used by the system (OJS or apache) to log access. OJS already controls that for process 2, but for 1 and 3 you have to be careful. It is not recommended because you can waste processing time when you reprocess the same file. You can guarantee that you will not do that paying attention to the apache log files rotation (moving only the ones that were already rotated) or, if you use OJS log files, not moving the file that has it’s filename with the current day.

If you really need updated statistics from the same day, you can still copy the files (instead of moving) so the system can continue logging more access into the same file. However, you will be able to process what has already been logged. The system can handle with files re-processing (see next topic), as long as the filename is unchanged the final copied version of the file contains all access logging.

Usage examples

Let’s imagine that we have the following scenarios, all three with a cron job calling the file loader task tool on a daily basis:

  1. Journal Manager using apache log files that rotate each week; the log files are named using the day in which the file rotation occurs (e.g.: 20130929-access.log); JM uses file loader task process 3.
  2. Journal Manager using OJS log files that rotate each day, having in their filenames the day that they were being used to log access; JM uses file loader task process 1.
  3. Same last scenario, but JM uses file loader task process 2.

In scenario 1, if JM can’t configure the apache log files rotation to a smaller period of time, he can copy the 20130929-access.log file to the usageStats/stage directory every time he wants new statistics to be processed. The system will, each time, delete all data that were processed for that file and reprocess everything again. That is why this process is not as efficient as an smaller period of log file rotation. At day 20131006 the log rotation will occur and JM can finally move or copy the 20130929-access.log file to the stage directory so the stats logged between the last time he staged the file and the log rotation can also be processed.

In scenario 2, JM can move one file per day, always moving the file from the previous day. On Monday he can move two files at the same time (from Saturday and Sunday). He will have a small time between access and available statistics (one day only, for most of the week) and will avoid reprocessing same access log entries.

In scenario 3, JM does not have to move the files. Each time the file loader task is executed, the system will automatically grab the log files that were not processed yet, avoiding the current day one, will stage them and will start processing each one.

In scenarios 1 and 2, if for some reason JM can’t move files for a whole week, in the next opportunity he can move all of them at once. The scheduled task will process one by one then, until the stage directory is empty again.

Reprocessing log files

If you need to reprocess any log file (if a new robot list or geolocation database is introduced, or you fixed problems inside rejected files) you can move them back to the stage process. Even if they were inside “rejected” or “archived” directories. They will be processed and any existing data from these logs will be replaced by reprocessed data.

The important thing is that, after a file is successfully processed and moved to the archive folder, its filename is preserved. The filename is used to keep track of statistics data and the file that it was retrieved from. For the same reason, it’s important that all log files have unique filenames. You don’t have to worry about this if you are using OJS log files. But if you are using apache access log files, ensure that you are using something unique such as dates in the filenames.