Difference between revisions of "PKP Statistics Framework"
(Split advanced section into smaller groups)
(→Geo location database: fixed lists)
|Line 45:||Line 45:|
# open a shell prompt
# go into OJS installation’s base directory
# wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
# gunzip GeoLiteCity.dat.gz
# mv GeoLiteCity.dat plugins/generic/usageStats
# download the file [http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz GeoLiteCity.dat.gz]
# decompress it using any decompression tool into plugins/generic/usageStats directory
In both cases the complete path to the installed database file should be plugins/generic/usageStats/GeoLiteCity.dat
In both cases the complete path to the installed database file should be plugins/generic/usageStats/GeoLiteCity.dat
Revision as of 19:28, 26 November 2013
- 1 The framework
- 2 Statistics migration
- 3 Report Generation
- 4 Statistics Logging
- 5 Processing Log Files
- 6 Reprocessing log files
The new statistics structure implemented by OJS 2.4.3 is more general and can serve many ways of collecting stats and generating reports. For example, it’s easier to reuse structures to read log files, store statistics and generate reports.
The technical details are not important for normal operation. This document should be enough to help you manage statistics in OJS. If you still want to learn more about that, access OJSdeStatisticsConcept.
The most important new concept is a “metric.” It can be understood as a set of rules that determine how we measure system usage. Using the previous set of plugins to manage statistics, the Counter plugin had different rules to determine which accesses should be logged as valid article views than the Timed Views plugin; for example, the Timed Views plugin did not filter bots. For that reason, even if both plugins were counting the same events (article views, for example) they would present different results. In the new structure, this corresponds to each plugin having a different metric.
OJS can now have any number of metrics. Any report plugin can implement one, and serve the system with statistics. All calls for statistics inside the system are now dependent on which metric the system is using. So, if you have more than one plugin implementing different metrics, you can choose in the statistics settings page which metric will be used to present statistics to users, both in public and private pages.
You can still access different metrics statistics without changing the main metric setting using the statistics report page (manager/statistics). We will see more about this in Report Generation.
== Default metric (OJS/COUNTER)
OJS 2.4.3 ships with only one implemented metric. That means that only one report plugin implements a metric, specifically UsageStatsReportPlugin. This plugin follows all processing rules (avoiding bots, double clicking, etc) from the COUNTER project; therefore the metric that it implements is called OJS/COUNTER. This plugin collects statistics from the following public objects: journal, issue, article and galley. It implements time and geo localization dimensions, so reports can be generated by month, year, city, region and/or country.
The current OJS upgrade process will migrate statistics data from the Timed View plugin, the Counter plugin, and the built-in OJS view counts from previous versions. Because each one of those sources had different approaches to collecting statistics, we can’t merge them; therefore each will be migrated as a different metric:
|Previous data source||Migrated metric name|
|Timed Views plugin||ojs::timedViews|
The only data that the migration process will adapt are the statistics from the Timed Views plugin. Originally this plugin was not filtering bots; for data compatibility, the migration process will delete all entries that come from bots identified using the Counter robots list.
Because those old sources do not continue collecting statistics after the upgrade (the only source of statistics inside the default OJS 2.4.3 installation is the UsageStatsReportPlugin, which implements the ojs::counter metric) the metrics that they implement don’t appear inside the main metric setting. They are only intended for backwards compatibility. However, as noted below, the new reporting tools are able to generate reports from both old and new data sources.
Requirements (for site administrators)
Geo location database
To run the upgrade process and to correctly migrate the old statistics data, you will have to ask your site administrator to download the geo location database, following these steps:
- open a shell prompt
- go into OJS installation’s base directory
- wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
- gunzip GeoLiteCity.dat.gz
- mv GeoLiteCity.dat plugins/generic/usageStats
- download the file GeoLiteCity.dat.gz
- decompress it using any decompression tool into plugins/generic/usageStats directory
In both cases the complete path to the installed database file should be plugins/generic/usageStats/GeoLiteCity.dat
If you have a very large set of Timed Views entries, the migration process may take more time than what's allowed by a process using the browser. To avoid incomplete upgrade process, if you have a really large set of statistics, you should use the client upgrade tool.
When the upgrade is complete, OJS will begin logging usage events into log files. If you decide to use Apache log files, you can turn this logging off (see OJS Statistics Logging, below).
The difference for editors and authors will only be reading the views column that can be seen in the editorial process, for galleys. Because statistics are now unified, system will present statistics throughout using the default metric. As before, the default metric is ojs::counter, but, unless you start to process event log files (see Processing Log Files, below) you will have no statistics for that metric. Therefore the views column will show 0 until some statistics are retrieved. (You can process old Apache log files too, if you want to get historical statistics into the new metric; see below.)
Note that this not means that the old galley view data is not available anymore. The data is saved and still available for reports under the metric ojs::legacyDefault. Read the next section for more details on how to generate reports using both old and new statistics data.
Once the migration process is finished, you can generate reports using any metric. Each old OJS statistics source is now implemented as a report plugin, so it’s possible to access them in the Report Generator part of the Stats & Reports page, under Journal Management.
All three plugins can work with both legacy and new statistics data sources. This means that you can generate the same reports that you could before migration, with the same results. And you can also generate reports from new statistics data, collected by the new usage statistics plugin.
Note that the new default statistic metric will not produce the same results that each one of those old sources would produce if they were still collecting statistics. This happens because now the usage statistics plugin implements rules for retrieving usage events in a different way, in order to be more consistent with the COUNTER project. One example is bot filtering, previously not executed consistently for all stats, and not using the official COUNTER robots list when it was being performed. Another difference is the process to avoid double-clicking, with different timing for file downloads and normal page views.
The OJS views data is the only element that doesn’t have a way to generate reports from new statistics data using the interface. You have to use the following url:
That’s because this report and the new usage statistics report are very similar. For statistics collected since the migration, can use the new OJS usage statistics report, in the Report Generator area of the Stats & Reports page, under Journal Management.
There you will get a similar format to the old view report, with the difference that you will receive more information, like geolocalization and date; also, galley objects will have their own rows. Note that you will get no data using the OJS usage statistics report if you don’t process any log files (see Processing Log Files).
Managers can now generate custom reports, selecting columns, filters and ordering. This can be done via the report generator user interface, accessed by clicking on the “Generate Custom Report” link, in the Report Generator part of the Stats & Reports page, under Journal Management.
After defining the columns, filters and order-by settings, you can submit the form by clicking on “Generate Custom Report”. You will also notice that a URL will appear above the form submit button. You can use this URL to generate again the same report, without having to choose the same settings.
If you want to generate reports filtering data using times, like current day or month, you can click the “Current Month” or “Today” checkboxes in the “by time” filter section. A tip is to generate the report the way you like, get the url, and store it in your browser’s Favorites for quick access.
The following section describes each of the report generator settings.
Default Report Templates
This option allows Journal Managers to select a set of predefined settings to generate a report. There are five default reports:
|Article file downloads||Shows the number of all article files downloads (html, pdf, others all together).|
|Article abstract page views||Shows the number of article abstract page views only.|
|Issue file downloads||Shows the number of all issue files downloads (html, pdf, others all together).|
|Issue table of content page views||Shows the number of table of contents page views only.|
|Journal main page views||Shows the number of journal main page views only.|
Aggregate Stats By
You can define the aggregation level of the statistics. The default for each report template is to aggregate statistics by country and month. So all statistics for the same published object will be grouped by month and country. You can define a lower level of aggregation, like region, city and/or day. The lower the aggregation level is, the more rows the same published objects will likely have. Let’s consider this example:
This report shows that for November, the article 1 files were downloaded 10 times by users from Canada. As you can see, the aggregation level here is by country and month.
Let’s see this other report now.
With this new report, you can see two lines for the same published object (article). The difference between both reports it’s that now we can see the Day column, so we can check the article files download by each day of the month. That’s the day aggregation level.
Select Report Range
This option allows you to define the report time range. The options are:
|Time range option||Description|
|Today||Will show only statistics for the current day of the report generation. If you generate a report url and save it for later use, when you use it, the day used will be that current day, and not the day you first generated the report url.|
|Current month||Same for today option, but will show statistics for the current month.|
|Range by day||Define start and end dates (year, month and day).|
|Range by month||Define start and end months (year and month).|
None of those settings involve aggregation level, only the report range. You can generate a report for the current month in a day aggregation level, for example.
Public Object Types
All statistics in OJS are now associated with a type of published object. Each object type has its own usage event that triggered the statistic collection. The following table describes that.
|Object type||Usage event|
|Journal||Journal index page view.|
|Issue||Issue table of contents page view.|
|Issue galleys||Issue file downloads.|
|Article||Article abstract page view.|
|Article galleys||Article file downloads.|
The default report templates inside the report generator form will always generate reports with the object type column, so you can check what it means the count column values.
These options, together with the time range, really define the reports. What the default template options do is simplyl preselect options here. If you select a default report template and change any option here, you will change the report, regardless of whether you selected a report template.
Define the columns that will be used to build the report. The columns you select here not only define which data will be presented inside the reports, but also defines the aggregation level of the statistics. For example, if you choose ID, Type and Month columns, the report will sum all views from an object from one month and will present that data in one row. If you choose Day instead of Month, each object will have a row for each day with statistics.
Define the data that will be used to filter statistics for the reports. You can choose to generate statistics only for articles inside an specific issue, or only for an specific article. You can also choose to filter statistics only by some kind of published objects (articles and galleys, for example). And you can also filter by geo location, selecting from which country or region or even city you want to generate statistics.
You can define which columns will be used to order the statistics in report. Note that even if you choose a column that you didn’t pick up to be presented in the report (using the columns section) the report will still be ordered by that column. You can define several columns, so if the first column has a lot of equal results, the second one will be used, and going like that for the other ones.
Report templates explanation
Now that we understand better the concepts behind reports, let’s take the default report templates as examples to see which settings they define (columns, filters) and why that generates the expected result.
|1||Type, Article, Issue, Country, Month||Object type equals to Galley|
|2||Type, Article, Issue, Country, Month||Object type equals to Article|
|3||Type, Issue, Country, Month||Object type equals to Issue Galley|
|4||Type, Issue, Country, Month||Object type equals to Issue|
|5||Type, Journal, Country, Month||Object type equals to Journal|
- Take a look at the reports table to see its description.
The most important setting is the filter. If we don’t use the filter to select only stats related to article galleys (in case of report 1, for example), then all the other published objects would have stats presented in the report. The columns define the aggregation level, as said before, and you can play with them. Let’s give an example:
|Galley||Article 1||Issue 1||Canada||201311||10|
This is a default report 1 result. Let’s add the File Type column (you can do that selecting the article file downloads report template, then clicking on advanced options, holding the shift key and clicking on File Type, inside the columns select box).
|Type||Article title||Issue||Country||Month||File Type||Count|
|Galley||Article 1||Issue 1||Canada||201311||5|
|Galley||Article 1||Issue 1||Canada||201311||HTML||5|
Now we have a lower aggregation level, showing also the file type for the monthly article file downloads. The main objective of the report it’s the same, but the aggregation level is different.
We can also play with the ID column. If you add that instead of the file type, you should have something like this.
|1||File.pdf||Galley||Article 1||Issue 1||Canada||201311||3|
|2||File2.pdf||Galley||Article 1||Issue 1||Canada||201311||2|
|3||File.html||Galley||Article 1||Issue 1||Canada||201311||5|
The aggregation level is now really low, on the object id level. Each galley will have its own rows, even if they are galleys for the same article and with the same type.
In general, for those default report templates, if you change the columns, the purpose of the report will still be there, you will only have a higher or lower aggregation level. If you change the filter object type settings, then you will change the purpose of the report.
At the same time, you can use filters like geo location or the context one to be more specific while generating reports.
After installing or upgrading OJS, the system will already be logging accesses for all installed journals into log files, inside the OJS files directory, in usageStats/usageEventLogs. There will be one file for each day. Not all system requests are logged into those files, only the ones that matter for usage statistics (submission views, file downloads, etc).
If you don’t want to log access into files, you can turn this option off inside the generic usage statistics plugin. Go to Plugins management page, into Generic Plugins category and search for the Usage Statistics plugin. Click on settings and you will be directed to the plugins settings page (manager/plugin/generic/usagestatsplugin/settings). Inside the settings page you will find the “Create log files” checkbox; uncheck it and save.
External access log files alternative =
The new usage statistics plugin can also read external log files to retrieve usage events from there in order to collect usage statistics. You can use your apache access logs, for example. The default expected format for the external access log files is the apache combined format.
If you wish to use log files in another format, you will have to create a regular expression that can parse the log entries. See the plugin settings page for more information.
The advantage to sharing logs with the web server is that logs will be kept in one unified place, and the existing operating system policies around log file management will not need to be extended to cover additional logs for OJS. You can turn off the “Create log files” usage statistics plugin option, so you don’t duplicate logging data (see OJS Statistics logging section).
Processing Log Files
OJS, by default, automatically starts logging the usage events into it’s own access log files. It also automatically process those files, on a daily basis. So, for example, if you install OJS today, at 2 pm, tomorrow at the same time the processing of the usage events for today will start. If you don’t change any setting, this will go on without the need to do anything else.
If you really want to control your stats processing, you can go to Plugins management page, into Generic Plugins category and search for the Acron plugin. Just disable it and the automatic processing will stop. Read the following steps to understand how to manually trigger the processing.
OJS needs to process the files to be able to retrieve statistics data from them. This process can be done using the File Loader task tool which comes with the Usage Statistics plugin.
The file loader task implements a process to reliably handle all file processing. It works with 4 folders: stage, processing, archive and reject. Everytime you run the file loader, it will search for files inside the stage folder. If there are any, it will move one file to the processing folder and it will start working on it. If anything goes wrong, it will send an email to the system administrator with valuable information about what gone wrong and will move the file to the reject folder. If the processing went well, then it will move the file to the archive folder. It will continue this process until there are no more files inside the stage folder.
The file loader is a scheduled task tool, which can be periodically invoked using cron or another system task management tool. The command to run this task depends on which process are you willing to use:
File loader task processes:
|1||OJS log files||php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasks.xml|
|2||OJS log files with automated staging process||php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasksAutoStage.xml|
|3||External log files||php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasksExternalLogFiles.xml|
This should be executed from the OJS installation’s base directory.
Process 1 and 3 needs you to move files inside the stage folder (regardless of whether it’s an apache log file or an OJS log file). Process 2 already automatically moves OJS log files into stage folder, so you have only to execute the command.
Everything else is handled by the file loader task. If you don’t schedule this for automatic execution using cron or a similar mechanism, you will have to run the file loader task manually.
Note that for any process you choose, you can move files into the stage folder anytime, even while the scheduled task is running. You can also move any number of files inside the stage directory. What determines the period of time that you will be moving files into the stage directory is mainly your necessity for updated statistics.
The only thing that is not recommended is to move a log file that it’s still being used by the system (OJS or apache) to log access. OJS already controls that for process 2, but for 1 and 3 you have to be careful. It’s not recommended because you can waste processing time when you reprocess the same file. You can guarantee that you will not do that paying attention to the apache log files rotation (moving only the ones that were already rotated) or, if you use OJS log files, not moving the file that has it’s filename with the current day.
If you really need updated statistics from the same day, you can still copy the files, instead of moving, so the system can continue logging more access into the same file, but you will be able to process what were already logged. The system can handle with files re-processing (see next topic), as long as you don’t change the filename and you make sure that the final copied version of the file contains all access logging.
Let’s imagine that we have the following scenarios, all three with a cron job calling the file loader task tool on a daily basis:
- Journal Manager using apache log files, that rotate each week; the log files are named using the day in which occur the file rotation (e.g.: 20130929-access.log); JM uses file loader task process 3.
- Journal Manager using OJS log files, that rotate each day, having in their filenames the day that they were being used to log access; JM uses file loader task process 1.
- Same last scenario, but JM uses file loader task process 2.
In scenario 1, if JM can’t configure the apache log files rotation to a smaller period of time, he can copy the 20130929-access.log file to the usageStats/stage directory every time he wants new statistics to be processed. The system will each time delete all data that were processed for that file and reprocess everything again. That’s why this process is not as efficient as an smaller period of log file rotation. At day 20131006 the log rotation will occur and JM can finally move or copy the 20130929-access.log file to the stage directory for the last time, so the stats logged between the last time he staged the file and the log rotation will be also processed.
In scenario 2, JM can move one file per day, always moving the file from the previous day. On Monday he can move two files at the same time (from Saturday and Sunday). He will have a small time between access and available statistics (one day only almost all the week) and will avoid reprocessing same access log entries.
In scenario 3, JM don’t have to move the files. Each time the file loader task is executed, the system will automatically grab the log files that were not processed yet, avoiding the current day one, will stage them and will start processing each one.
In scenarios 1 and 2, if for some reason JM can’t move files for a whole week, in the next opportunity he can move all of them at once. The scheduled task will process one by one then, until the stage directory is empty again.
Reprocessing log files
If you need to reprocess any log file (if a new robot list or geolocation database is introduced, or you fixed problems inside rejected files) you can move them back to the stage process. Even if they were inside “rejected” or “archived” directories. They will be processed and any existing data from these logs will be replaced by reprocessed data.
The important thing is that after a file is succesfully processed and moved to the archive folder that you preserve its name. The filename is used to keep track of statistics data and the file that it was retrieved from. For the same reason, it’s important that all log files have unique filenames. You don’t have to worry about this if you are using OJS log files. But if you are using apache access log files, ensure that you are using something unique such as dates in the filenames.