Bug 8248

Summary: Unable to create index on article_settings with postgres
Product: OJS Reporter: Tom Christensen <tgc>
Component: InstallerAssignee: PKP Support <pkp-support>
Status: RESOLVED FIXED    
Severity: normal CC: alec
Priority: P3    
Version: 2.4.3   
Hardware: All   
OS: All   
Version Reported In: Also Affects:
Attachments: Patch against OJS 2.4.2

Description Tom Christensen 2013-06-06 04:00:20 PDT
While performing an upgrade from OJS 2.3.7 to 2.4.2 using a PostgreSQL database I ran into this error:
Query failed: ERROR:  index row requires 14368 bytes, maximum size is
    8191

The failing SQL is:
CREATE INDEX article_settings_name_value ON article_settings (setting_name, setting_value);

The error is triggered by abstracts in setting_value that apparently exceed 8k in size. Unfortuntately PostgreSQL b-tree indexes will not allow indexing fields with more than 8k data.
This is a hard limit for btree based indexes and I could not see any obvious alternative index type that could be used instead.

I see in dbscripts/xml/indexes.xml that for mysql create index adds a limit on the fields (setting_name(50), setting_value(150)), unfortunately there is no direct equivalent in PostgreSQL.
It is however possible to limit the amount of data in a field that is indexed using an expression which could be a builtin function like substring.
Unfortunately this would require any lookups to use the same function or the index is not used, this is probably not a workable solution.

My current workaround is therefore to instead create a partial index which excludes setting_name = 'abstract' from the index.
I'll attach a patch that does this.

More information on PostgreSQL indexes here:
http://www.postgresql.org/docs/8.4/static/indexes.html
Comment 1 Tom Christensen 2013-06-06 04:00:42 PDT
Created attachment 3934 [details]
Patch against OJS 2.4.2
Comment 2 Alec Smecher 2013-06-06 10:52:30 PDT
Tom, thanks for investigating this. That particular index is only currently used by the PublishedArticleDAO::getBySetting function, which looks up an article based on a setting name and value. In turn this is currently only used by the public identifier code to find an article by public identifier.

The change you're proposing will result in the index not being used for this query, since (as you point out) the query won't use the same function as the index creation.

One option to optimize this in a database-independent and unobtrusive way would be to add a value_hash column that is set to e.g. the SHA1 hash of the setting_value when values are inserted or updated, then index (and query) that rather than the full setting_value.

However, for the moment I suspect that's over-optimizing. Until we run into performance problems with this approach, I'd suggest we simply create the following two indexes instead:

CREATE INDEX article_settings_name ON article_settings (setting_name);

OTOH this should change use cases for this query from a full scan to a partial scan based on this index. Note that we'll probably have to make the same change for issue_settings.

Does that sound workable?
Comment 3 Tom Christensen 2013-06-07 04:54:21 PDT
I'm not sure I understand exactly.

The change I've proposed is to exclude rows from the index where setting_name = 'abstract', this is what is known as a partial index in PostgreSQL.

With such a partial index all queries with 'where setting_name =' will use the index *except* for where setting_name = 'abstract'.
I did this thinking that a query such as where setting_name = 'abstract' and setting_value = 'some text from the abstract' was unlikely.

With the addition of the index you propose this ensures as far as I can see (using the psql EXPLAIN command to examine query plans) that no table scan is made on queries even with setting_name = 'abstract' and setting_value = 'some text from the abstract'.

You mentioned "The following two indexes" but only proposed one, did you mean to refer to the partial index I've proposed as the second?

The issue_settings table did not exhibit this issue in our current installation. If a similar solution with a partial index where to be implemented then it looks like the 'description' field is the prime candidate for exclusion when looking at the current contents of our installation.
Comment 4 Alec Smecher 2013-06-17 14:18:35 PDT
Thanks for the additional info, Tom. Can you confirm that the query does appear to use a partial index based on an execution plan? I'm speaking OTOH without trying this in PostgreSQL directly so I could well be wrong.

("Following two indexes" was a typo -- I was proposing just the one.)
Comment 5 Tom Christensen 2013-06-18 02:48:31 PDT
I'm using PostgreSQL 8.4(.13).

If I drop the partial index I created and use only the index you proposed on setting_name then the query plan looks like this:
ojsdev=# explain select * from article_settings where setting_name = 'title' and setting_value='mytitle';
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Bitmap Heap Scan on article_settings  (cost=91.98..672.43 rows=1 width=112)
   Recheck Cond: ((setting_name)::text = 'title'::text)
   Filter: (setting_value = 'mytitle'::text)
   ->  Bitmap Index Scan on article_settings_name  (cost=0.00..91.98 rows=4230 width=0)
         Index Cond: ((setting_name)::text = 'title'::text)
(5 rows)

If I add my partial index that excludes rows with setting_name = 'abstract', the query plan looks like this:
ojsdev=# CREATE INDEX article_settings_name_value ON article_settings (setting_name, setting_value) where setting_name != 'abstract';
CREATE INDEX
ojsdev=# explain select * from article_settings where setting_name = 'title' and setting_value='mytitle';
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Index Scan using article_settings_name_value on article_settings  (cost=0.00..8.28 rows=1 width=112)
   Index Cond: (((setting_name)::text = 'title'::text) AND (setting_value = 'mytitle'::text))
(2 rows)

A notable speedup.
With this additional index the cost of using index on setting_name only is incurred only when querying setting_name = 'abstract';

ojsdev=# explain select * from article_settings where setting_name = 'abstract' and setting_value = 'myabstract';
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Bitmap Heap Scan on article_settings  (cost=79.37..650.59 rows=1 width=112)
   Recheck Cond: ((setting_name)::text = 'abstract'::text)
   Filter: (setting_value = 'myabstract'::text)
   ->  Bitmap Index Scan on article_settings_name  (cost=0.00..79.37 rows=3615 width=0)
         Index Cond: ((setting_name)::text = 'abstract'::text)
(5 rows)

Dropping your proposed index and keeping only the partial index created above will make this query fall back to tablescan:
ojsdev=# DROP INDEX article_settings_name;
DROP INDEX
ojsdev=# explain select * from article_settings where setting_name = 'abstract' and setting_value = 'myabstract';
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Seq Scan on article_settings  (cost=0.00..982.66 rows=1 width=112)
   Filter: (((setting_name)::text = 'abstract'::text) AND (setting_value = 'myabstract'::text))
(2 rows)

However this is an unlikly query, a more likely query would be something like this:
ojsdev=# explain select article_id,setting_value from article_settings where setting_name = 'abstract' and article_id = '35';
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Index Scan using article_settings_pkey on article_settings  (cost=0.00..8.34 rows=1 width=90)
   Index Cond: ((article_id = 35::bigint) AND ((setting_name)::text = 'abstract'::text))
(2 rows)

Much better and would not be helped by an index on setting_name.
Comment 6 Alec Smecher 2013-06-18 13:25:56 PDT
Tom, thanks. As the article_settings table is meant to be a fairly extensible facility, I do worry that other settings may spill over the size limit. What about whitelisting the settings that we want indexed as opposed to blacklisting just abstracts? Looking at the code, we have...

indexingState (used in SolrWebService)
medra::registeredDoi (used in DOIExportPlugin / MedraExportPlugin)
datacite::registeredDoi (used in DOIExportPlugin / DataciteExportPlugin)

This would mean something like...

CREATE INDEX article_settings_name_value ON article_settings (setting_name, setting_value) where setting_name IN ('indexingState', 'medra::registeredDoi', 'datacite::registeredDoi');

This has the added bonus of being much more greppable, i.e. if someone's wondering what the index is for, it's much clearer.

Would you mind double-checking that this still plays nicely with the execution plan?
Comment 7 Tom Christensen 2013-06-19 05:46:27 PDT
That will work fine as long as the right values for setting_name are indexed.
Some vigilance will be required to make sure that list of values stay updated or performance will suffer. Given the obvious problems with postgres support I am somewhat worried this will be overlooked.

ojsdev=# DROP INDEX article_settings_name;
ERROR:  index "article_settings_name" does not exist
ojsdev=# DROP INDEX article_settings_name_value;
DROP INDEX
ojsdev=# CREATE INDEX article_settings_name_value ON article_settings (setting_name, setting_value) where setting_name IN ('indexingState', 'medra::registeredDoi', 'datacite::registeredDoi');
CREATE INDEX
ojsdev=# explain select setting_value from article_settings where setting_name = 'pub-id::publisher-id';
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Seq Scan on article_settings  (cost=0.00..1551.71 rows=5251 width=97)
   Filter: ((setting_name)::text = 'pub-id::publisher-id'::text)
(2 rows)

ojsdev=# explain select setting_value from article_settings where setting_name = 'pub-id::publisher-id' and setting_value = 'my pub-id';
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on article_settings  (cost=0.00..1678.06 rows=1 width=97)
   Filter: (((setting_name)::text = 'pub-id::publisher-id'::text) AND (setting_value = 'my pub-id'::text))
(2 rows)

ojsdev=# explain select setting_value from article_settings where setting_name = 'indexingState';
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Index Scan using article_settings_name_value on article_settings  (cost=0.00..8.27 rows=1 width=97)
   Index Cond: ((setting_name)::text = 'indexingState'::text)
(2 rows)
Comment 8 Alec Smecher 2013-06-19 14:23:02 PDT
Fix index creation with PostgreSQL (patch against OJS pre-3.0)
https://github.com/pkp/ojs/commit/e16c4600fbd0d49c76bc89139a2386d913d9dfe5
Comment 9 Alec Smecher 2013-06-19 14:25:02 PDT
Fix index creation with PostgreSQL (patch against OJS 2.4.x)
https://github.com/pkp/ojs/commit/bc8bb2ee5db01cc6db23c0dcfd2662afc0517b5b
Comment 11 Alec Smecher 2013-06-19 14:29:21 PDT
Thanks, Tom. I do prefer a whitelist approach rather than a blacklist approach and eventually an implementation using hashes would be more generic. We haven't done a particularly good job ensuring PostgreSQL compatibility and most of that is due to our scattershot testing approach. We are tinkering with automated regression testing and PostgreSQL would be a good candidate for that test suite as well; the development team and most of the user community focuses on MySQL. Meanwhile, I appreciate your diligence in helping us catch up.

I've also added a warning to the code self-documentation for getBySetting that should remind users to ensure indexing is kept correct. This is low-risk if forgotten: it'll result in lower performance for certain queries.

I'll add this to our recommended patches list shortly.