PKP Bugzilla – Bug 6263
Some "Select Reviewer" sort options don't sort correctly
Last modified: 2014-05-28 03:56:11 PDT
We are moving to Git Issues for bug tracking in future releases. During transition, content will be in both tools. If you'd like to file a new bug, please create an issue.
(Courtesy of John Grobler: http://pkp.sfu.ca/support/forum/viewtopic.php?f=8&t=6862&p=26688#p26599.)
Some of the column sort options on the Select Reviewer page (eg. http://git/ojs/index.php/citation/editor/selectReviewer/26) don't appear to be working correctly. John is having problems with the Weeks, Active and Latest columns; I can only replicate in the Latest and Active columns myself, but that may be just because of a lack of data on my end. Basically, for any of those columns, clicking the header sorts the listings, but doesn't sort correctly by number. This should be easy to replicate if you have a number of reviewers who have already completed reviews in your system, but I'll include a couple of screenshots shortly.
Created attachment 3376 [details]
After clicking twice on the weeks column
Some screenshots from John. I've cropped out any potentially identifying information. This one shows what happens when the "Weeks" column is clicked twice.
Created attachment 3377 [details]
After clicking twice on the active column
Created attachment 3378 [details]
After clicking twice on the Latest column
I appears that the data gets sorted on the database field "latest", but the data used for presentation in the column marked "latest" is actually the article's "last notified" date. There are several ways to tackle this, but before applying a patch it would be helpful to know what the "latest" column is supposed to represent. If it really is supposed to be the value "latest", and it's unique, then the SQL retrieval will need to be modified a bit so that the SectionEditorSubmissionDAO->getReviewerStatistics function makes use of that field too (right now it does not).
Mike, this should indeed be tracking the date_notified column of the review_assignments table (not articles table) -- Both the getReviewersForArticle() and getReviewerStatistics() method should both be selecting this.
Alright,that makes sense. Question about the sorting then: should this be kept "latest", with getSortMapping() modified so that "latest" maps to "date_modified" instead? (I see something like that for "submitDate", for instance, but that might have been an earlier shortcut).
AFAICR, the 'latest' in getSortMapping() is only used for the getReviewersForArticle() where we select 'latest', not ac.date_notified, as an alias to MAX(ac.date_notified). So it should remain as it is. But if the sorting isn't working, I could be wrong.
If I integrally change last_notified to last_modified for this (modify the getReviewersForArticle function to use "MAX(ac.last_modified) as latest", modify the getReviewerStatistics function to use "SELECT r.reviewer_id, MAX(r.last_modified) AS latest", add the field value using $statistics[$row['reviewer_id']]['latest'] = $this->datetimeFromDB($row['latest']), and then modify selectReviewer.tpl so that $reviewerState.latest is used) The sorting is still going wrong, though.
Just to make sure I have the logic right, reading the code:
- getReviewersForArticle generates a sorted list of reviewers for a particular article ID
- getReviewerStatistics generates the stats for each reviewer, based on a specific journal ID
because the first is a select constrained to a specific article, but the second for a specific journal, getReviewerStatistics is going to find a value that is not related to the article the reviewer's stats are being requested for, on this page, but will instead find whatever is the most recent "last_modified" field for the reviewer, for the journal matching the passed journal ID.
Does that sound like a reasonable analysis? Given that there is no articleid available as argument for getReviewerStatistics, I'm not sure I know how to constrain the SQL select so that it only fetches the "last_modified" field for the article in question.
I am somewhat stuck on this bug, because there are several things I can do to generate numbers that are ordered, but all of them reflect different things. Before I do that it would be helpful to know what "latest" actually means, in terms of what is to be presented to the user.
Is this column supposed to represent the last date a particular reviewer was notified about this article? Is it supposed to represent that most recent date the reviewer was active? Is it supposed to represent the most recent date the reviewer was seen?
The label is not indicative enough of what data it should present, so there's not enough to go on in order to provide a proper fix (although it is clear that the two functions responsible for getting the data actually retrieve very different things. One generates data relating to article-specific information, the other generates data relating to reviewer-for-journal-specific information).
Mike -- There are descriptions to all the column headings hidden at the bottom of the select reviewer page:
Name links to reviewer's profile.<br/>
Ratings is out of 5 (Excellent).<br/>
Weeks refers to average period of time to complete a review.<br/>
Latest is date of most recently accepted review.<br/>
Active is how many reviews are currently being considered or underway
Oh my god, this is ridiculous. I was getting confused with dates, turns out my review_assignments table simply has two entries for the reviewer whose date was wrong. One in 1965, one in 2007, and the MAX(date_notified) aggregation of course picked the higher one, while the query in getReviewerStatistics() did not because that groups on reviewer ID, and so only finds whatever is the first row (no ORDER BY is applied there).
Do we want to apply the same sorting we use for getReviewersForArticle for getReviewerStatistics?
Mike, I don't think SectionEditorSubmissionDAO::getReviewerStatistics needs sorting, if that's what you mean. The whole set for the journal is fetched and then the appropriate entry in the results is looked up by index as part of the loop in the selectReviewer.tpl template.
That will not play "correctly" with multiple rows in the review_assignments table, though. Say we have
1 | reviewer_id = 18 | submission_id = 17 | date_notified = 1960-01-01 | ...
2 | reviewer_id = 18 | submission_id = 202 | date_notified = 2011-01-01 | ...
then getReviewersForArticle(), sorted on latest (alias for date_notifed), will find this reviewer as having "2011-01-01" for that field. Quite correctly so.
However, the getReviewerStatistics() fuction, which is what we actually use to fill the $reviewer var used in the template, has a query that collapses results based on reviewer_id, without any preference on which to keep; The "GROUP BY r.reviewer_id" line (SectionEditorSubmissionDAO:1040) will simply discard everything after the first matching row, even if the select is told to get "MAX(r.date_notified)" rather than just "r.date_notified"
Consequently, when we read the $reviewer->date_notified, it will have the wrong value. If this should simply always return the most recent date then we could automatically sort on last_notified here anyway (since the first query is concerned mostly with counts, so one row stays one row). Does that sound sensible?
Mike, you're talking about the q
Sorry, hit the wrong button.
Mike, you're talking about the following query, right?
SELECT r.reviewer_id, MAX(r.date_notified) AS last_notified
FROM review_assignments r,
WHERE r.submission_id = a.article_id AND
a.journal_id = ?
GROUP BY r.reviewer_id
This should go through all reviewers for a journal, and for each, fetch the most recent r.date_modified. The GROUP BY shouldn't depend on sort ordering, and shouldn't just operate on the first row encountered -- it sets the scope under which the aggregate function MAX(...) operates. Try running through your sample data with this query, simplifying it if needed.
Created attachment 3512 [details]
getReviewersForArticle query result
query problem in getReviewersForArticle illustrated
I've attached a text file showing what data I'm using to look at this bug, and what the results of the queries are when run through MySQL. Note that the results in this file suggest that the big query we now use in getReviewersForArticle() will return values for MAX(date_notified) that are based on reviewer behaviour across all journals, rather than just the one the article is in.
Because in getReviewerStatistics() we only get "date_notified" fields for one specific journal, which query should be "fixed"?
Should getReviewersForArticle() be made to only look for the date of the most recently accepted review for the journal with the passed journal_id? (this seems to make sense to me, since information about a reviewer's behaviour in journal XYZ should not be relevant to an article in journal ABC)?
Or should getReviewerStatistics() get an extra query (or have one of the existing ones broadened) to account for the latest activity by a reviewer across all journals rather than just the one we're getting the statistics for (that seems to make a little less sense)?
Created attachment 3513 [details]
additional query results showing errors in the getReviewersForArticle query
additional query results showing errors in the getReviewersForArticle query
I've added a second attachment that also shows the getReviewersForArticle() query resulting in missed data, where a reviewer with a not-null "date_notified" field for the journal in question does not have this date show up in the results for the mega-query.
Generally speaking, we try to compartmentalize journals as much as possible. I think it makes the most sense for the stats to be consistently journal-specific.
Makes sense to me. Shall I try to rewrite the getReviewersForArticle query so that it doesn't select review activity for other journals, and doesn't miss out on activity?
Yes to excluding data from other journals -- but for the second part, what activity is it missing?
In the second attachment (http://pkp.sfu.ca/bugzilla/attachment.cgi?id=3513) you can see that even though there is a reviewer, Anderson Batista, with a "date_notified" set for an article from this journal (2005-11-07 16:23:59, for submission_id 28), the getReviewersForArticle() query doesn't actually find that value, instead resolving it to "NULL".
Created attachment 3514 [details]
new query plus results
I've attached the query I am currently trying, plus its result and some verification, for feedback. It's not the PHP one yet, with ? substitution and sort/set variables.
AVG(ra.quality) AS average_quality,
COUNT(rac.review_id) AS completed,
COUNT(rai.review_id) AS incomplete,
MAX(ra.date_notified) AS latest
FROM roles r
LEFT JOIN articles a USING (journal_id)
LEFT JOIN users u ON (r.user_id = u.user_id AND r.journal_id = 1)
LEFT JOIN review_assignments ra ON (ra.reviewer_id = u.user_id AND ra.submission_id = a.article_id)
LEFT JOIN review_assignments rac ON (ra.review_id = rac.review_id AND rac.date_completed IS NOT NULL)
LEFT JOIN review_assignments rai ON (ra.review_id = rai.review_id AND rai.date_completed IS NULL)
WHERE r.journal_id = 1 AND r.role_id = 4096
GROUP BY u.user_id
ORDER BY latest DESC;
Looks good -- but for consistency, better to stick with:
LEFT JOIN xxx ON (... = ...)
LEFT JOIN xxx USING (...)
I guess you're working on the query? If you're still stuck on e.g. the missing data and want some help, it might be easiest to send me a .sql.gz of your database so I can work with it directly.
(In reply to comment #25)
> Looks good -- but for consistency, better to stick with:
> LEFT JOIN xxx ON (... = ...)
> ...rather than...
> LEFT JOIN xxx USING (...)
Actually, if we use ON then it breaks, finding dates from wrong journals, too, which is why I went with USING instead.
I'm doing this based on the "present" demo database you sent me, do you still have that or should I send you the copy I have sitting here back?
Hmm, the following:
... FROM roles r LEFT JOIN articles a USING (journal_id)
should be exactly equivalent to the following:
... FROM roles r LEFT JOIN articles a ON (r.journal_id = a.journal_id)
Were you using a different condition on the ON syntax in your tests?
corrected query in SectionEditorSubmissionDAO->getReviewersForArticle:
$sql = 'SELECT DISTINCT
ar.review_id ' .
($selectQuality ? ', AVG(ac.quality) AS average_quality ' : '') .
($selectLatest ? ', MAX(ac.date_notified) AS latest ' : '') .
($selectComplete ? ', COUNT(ra.review_id) AS completed ' : '') .
($selectAverage ? ', AVG(ra.date_completed-ra.date_notified) AS average ' : '') .
($selectIncomplete ? ', COUNT(ai.review_id) AS incomplete ' : '') .
'FROM roles r
LEFT JOIN articles a ON (a.journal_id = r.journal_id)
LEFT JOIN users u ON (u.user_id = r.user_id)
LEFT JOIN review_assignments ar ON (ar.reviewer_id = u.user_id AND ar.cancelled = 0 AND ar.submission_id = ? AND ar.round = ?) ' .
($joinInterests ? 'LEFT JOIN controlled_vocabs cv ON (cv.assoc_type = ? AND cv.assoc_id = u.user_id AND cv.symbolic = ?)
LEFT JOIN controlled_vocab_entries cve ON (cve.controlled_vocab_id = cv.controlled_vocab_id)
LEFT JOIN controlled_vocab_entry_settings cves ON (cves.controlled_vocab_entry_id = cve.controlled_vocab_entry_id) ':'') .
($joinAll ? 'LEFT JOIN review_assignments ac ON (ac.reviewer_id = u.user_id AND ac.submission_id = a.article_id) ':'') .
($joinComplete ? 'LEFT JOIN review_assignments ra ON (ra.reviewer_id = u.user_id AND ra.date_completed IS NOT NULL) ':'') .
($joinIncomplete ? 'LEFT JOIN review_assignments ai ON (ai.reviewer_id = u.user_id AND ai.date_notified IS NOT NULL AND ai.cancelled = 0 AND ai.date_completed IS NULL) ':'') .
'WHERE r.journal_id = ? AND
r.role_id = ? ' . $searchSql . ' GROUP BY u.user_id, u.last_name, ar.review_id' .
($sortBy?(' ORDER BY ' . $this->getSortMapping($sortBy) . ' ' . $this->getDirectionMapping($sortDirection)) : '');
This restricts reviewer information in the "ac" aliassed "review_assignemnts" left join to only submissions/articles whose journal identifier is the one relevant to the retrieval operation. In addition, linking in the articles table contrained on journal_id and contraining "ac" by using "ac.submission_id = a.article_id" also solves the problem of "missing" date_notified dates for authors in the original query.
The explicit join on the users table also means that the "u.user_id=r.user_id" constraint could be moved from the WHERE clause into the LEFT JOIN's ON... section.
Thanks, Mike -- would you mind uploading the modification as a unified .diff?
Created attachment 3519 [details]
patch for classes/sumission/sectioneditor/SectionEditorSubmissionDAO.inc.php
Patch against OJS from github
Thanks, Mike. Some minor comments:
- It looks like you're using spaces for indentation; try to keep to tabs.
- I don't know what the purpose of the change to the "u" join is, but I don't think OTOH it does anything differently this way.
- The articles table is brought in to limit the "ac" join to submissions in the current journal, correct? If so, it should be moved nearer to that join to make that clearer; having done that for the "ac" join, there are other stats that need to be made journal-specific too, I think.
Created attachment 3521 [details]
> It looks like you're using spaces for indentation; try to keep to tabs.
Forgot my texteditor was still set to procesingjs mode. I'll make sure to change that.
> I don't know what the purpose of the change to the "u" join is, but I don't
> think OTOH it does anything differently this way.
I did this mostly because the fact that r, u and a are now used in join contraints, I kept getting an error "unknown column u.user_id" when I kept it as a straight selection table (SELECT .. FROM users u, roles r LEFT JOIN ...) with the resolution as a WHERE constraint.
> The articles table is brought in to limit the "ac" join to submissions in the
> current journal, correct? If so, it should be moved nearer to that join to make
> that clearer; having done that for the "ac" join, there are other stats that
> need to be made journal-specific too, I think.
Good point. I've reordered the left joins so that they're closer to where the relevant information is joined in. Is this better?
Also, the "ar" table is used to grab all articles currently in round [...], but resolves its submission_id to a php-passed value. Should this also become ar.submission_id = a.article_id? (if so, how would one go about making sure the removal of a ? is correctly handled?)
Mike, it looks like the most recent patch has an accidental change in indentation; it should just include the changed lines but it includes the whole query...
You can make the roles table accessible in the LEFT JOINs by reversing the order they're specified in, i.e.:
... FROM users u, roles r LEFT JOIN...
... FROM roles r, users u LEFT JOIN ...
and changing the later references inside the join conditions from "u.user_id" to "r.user_id" instead (since the two are equivalent in the WHERE conditions).
This will keep the cardinality a little lower, since doing a "roles LEFT JOIN users" will bring in a lot of unnecessary rows (which will later be knocked out by the DISTINCT and WHERE conditions, but will probably slow things down on both sides).
Have another look at the "ar" join -- it's used to check whether or not each reviewer returned is already assigned to the given submission and submission round, so the "Assign" link can be hidden in that case. It doesn't need to change.
Created attachment 3522 [details]
fixed the spacing (let me know if it's still oddly spaced, if so I'll have to start looking at a different text editor because this one's too unpredictable), and I've updated the patch to use the suggested "FROM users u, roles r LEFT JOIN ..." with u.user_id replaced by r.user_id in the LEFT JOIN constraints, and the "u.user_id=r.user_id" constraint placed back int the WHERE clause.
OK, looks good! OTOH, could you not also convert the "articles a" join from a LEFT JOIN to a JOIN? That would decrease the cardinality again.
Looking at this query some more, it turns out there were still some problems with it; ac.submission_id is used a few times in joins that don't know "ac" (ar, ai), but more interestingly everything sorts correctly except for "done" and "weeks". Further work is required.
The ordering that I get from the getReviewersForArticle query, when sorting on "done" (=date_completed) seems to be different from the order that is then used by the database result iterator through calls to &_returnReviewerUserFromRow(&$row).
For example,the query generates the following result set (verified by running in MySQL manually):
| user_id | last_name | review_id | completed |
| 1 | Willinsky | NULL | 8 |
| 35 | Ashton | NULL | 3 |
| 16 | Flores Magon | NULL | 3 |
| 12 | Jordan | NULL | 3 |
| 15 | Korteweg | NULL | 2 |
| 220 | Anderson | NULL | 1 |
| 125 | Kincheloe | NULL | 1 |
| 18 | Rogers | NULL | 1 |
| 149 | Adzhubey | NULL | 1 |
| 14 | Klinger | NULL | 1 |
| 9 | Rodriguez | NULL | 1 |
| 67 | Blucher | NULL | 1 |
| 194 | Bressan | NULL | 1 |
| 26 | Marx | NULL | 1 |
| 19 | Wollstonecraft | NULL | 1 |
| 17 | Reclus | NULL | 1 |
| 13 | Morrison | NULL | 1 |
| 10 | Li | NULL | 1 |
| 104 | Leandro | NULL | 0 |
| 168 | smith | NULL | 0 |
| 238 | Foerster | NULL | 0 |
| 5 | Inglis | NULL | 0 |
| 69 | Thakurdesai | NULL | 0 |
| 137 | Cornelio | NULL | 0 |
| 200 | Atif | NULL | 0 |
Which has the following user_id ordering:
1, 35, 16, 12, 15, 220, 125, 18, 149, 14, 9, 67, 194,
26, 19, 17, 13, 10, 104, 168, 238, 5, 69, 137, 200
However, when examining which order the result iterator runs through this result set, the order is in fact:
1, 35, 16, 12, 15, 67, 194, 26, 19, 17, 13, 10, 220,
125, 18, 149, 14, 9, 136, 196, 270, 32, 97, 159, 224
Is it possible for retrieveRange() to change the ordering of items in a set?
No, retrieveRange won't change the ordering -- you can trace it through the DAO class yourself to see where it goes.
If you like, write up a specific step-by-step description showing the discrepancy and I can try to track it down.
It's possible that there's some kind of caching going on outside of ojs' control - even though I have caching set to "none", checking the result today gives a different ordering than before... something to keep in the back of my mind while testing, I suppose.
At any rate, for the "done" column the issue appears to be that when sorting on "done", the "total span" and "completed count" values for the reviewer statistics are based on [review_assignments.date_completed - review_assignments.date_notified] for entries in the review_assignemnts table that have not been declined, i.e. r.declined = 0.
However, our getReviewersForArticle() query does not filter out declined entries, so the ordering based on completions will differ between the two queries.
Should the big query be made to take declined=0 into account, or should the reviewer statistics query be modified to also count declined reviews (I would imagine the first, rather than the second)?
Additionally, the templates/section editor/selectreviewer.tpl seems to be using the wrong smarty variable for displaying the "completed" count. It currently uses:
I'm not sure where that variable comes from, but it does not seem to contain the correct values based on the SQL used in getReviewerStatistics. Using:
instead of what is currently used does list the value that is abstracted via row tallying in PHP in the getReviewStatistics function at least.
(Also: it is possible to change the getReviewerStatistics query so that the total span and completion count is done in the query itself, rather than php tallying. This might be cleaner.)
Declined reviews shouldn't be included in the counts.
The completedReviewCounts variable comes from ReviewAssignmentDAO::getCompletedReviewCounts; if you want to try removing that function entirely, go for it. (The getCompletedReviewCountsForReviewForms function doesn't ever appear to be used and can be removed.)
Created attachment 3533 [details]
disregarding 'declined' articles in completion count
This patch is a fix of previous patch, fixed so that left joins are on their own table's submission id, review_assignments.declined=0 is used as constraint for the complete and incomplete joins, and the selectReviewer template has been updated to use the correct reviewer statistics key for the values presented in the "done" column.
I suggest opening a new ticket for the cleanup of ReviewAssignmentDAO::getCompletedReviewCounts and getCompletedReviewCountsForReviewForms functions, let me know if that's okay or whether you'd like that worked into this patch too, instead.
Have you looked at my suggestion in comment #35?
I've run the two queries through the MySQL profiler, which reports some counter-intuitive statistics. When using LEFT JOIN articles a ON (a.journal_id=r.journal_id):
mysql> select sum(duration) from information_schema.profiling where query_id=1;
| sum(duration) |
| 0.167724 |
And when using JOIN articles a ON (a.journal_id=r.journal_id):
mysql> select sum(duration) from information_schema.profiling where query_id=2;
| sum(duration) |
| 0.208414 |
Restarting MySQL and running the queries in reverse order, just to rule out caching shows the same disparity, with the normal JOIN actually taking longer to perform than the LEFT JOIN.
Attachment #3533 [details] leaves the unused $completedReviewCounts variable lying around; you'll save a big query by getting rid of that.
You're using a small dataset; looking at something more real-world, I get a considerable improvement using a straight-up join as opposed to a left join (~2.75 seconds drops to ~0.85 on average).
That's true. several seconds gained on large datasets definitely warrants the replacement. I shall update the patch.
Created attachment 3534 [details]
replaced LEFT JOIN with JOIN
Updated query with a straight JOIN on articles, and LEFT JOINs for additional constraint.
To clarify on the first part of comment #44 (and the end of comment #41): you can leave the cleanup of the ReviewAssignmentDAO::getCompletedReviewCounts and
getCompletedReviewCountsForReviewForms functions themselves to another entry (and please do file a bug), but because you've removed the need for the $completedReviewCounts variable in your patch here, it's best to clean up the variable assignment here as well.
Created attachment 3538 [details]
query rewrite, template update, and obsolete template var removal
This patch is a rewrite for the getReviewersForArticle SQL query, updates the selectReviewer template to use the correct values for the "done" column (=date completed), and removes the assignment of the incorrect variable in SubmissionEditHandler.
This turns classes/submission/reviewAssignment/ReviewAssignmentDAO->getCompletedReviewCounts() into an orphaned function; I will open a new ticket for removal of this orphaned function.
Looks good, Mike. I think it should definitely get a test in PostgreSQL, and then we'll need to look at porting to OCS.
Created attachment 3547 [details]
postgres minimal database
I tested the code on PostgreSQL 8.3 (7.x is no longer on offer) and verified that there were no errors, and that the behaviour seems in line with MySQL.
Great, Mike. Before this can be back-ported to OCS, you'll need to back-port bug #6573. Mind taking a look?
I'm using OJS 220.127.116.11 and this bug is still present. I've tried to fix the query in SectionEditorSubmissionDAO.inc.php with Mike's attachment, but the code is different now as it was in 2011.
I'm seeing the same incomplete sorting behaviour when selecting reviewers as John Grobler from the original forum post (link in the description of the bug).
Would it be possible to set the default sorting order to descending, eg. so that the reviewers with active reviews (or with most reviews done) would be immediately positioned on top when clicking on the respective column heading.