Bug 6635 - Review search index rebuild tool's memory requirements
Review search index rebuild tool's memory requirements
Status: NEW
Product: OJS
Classification: Unclassified
Component: General
2.4.x
All All
: P3 normal
Assigned To: PKP Support
: 8181 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-05-06 09:51 PDT by Alec Smecher
Modified: 2013-08-28 12:11 PDT (History)
5 users (show)

See Also:
Version Reported In:
Also Affects:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alec Smecher 2011-05-06 09:51:51 PDT
Review search index rebuild tool's memory requirements. It appears that an in-memory cache might be causing it to require a lot of memory. (See http://pkp.sfu.ca/support/forum/viewtopic.php?f=8&t=3816)
Comment 1 Alec Smecher 2012-09-21 14:32:44 PDT
Additional investigation from Florian:

Hi Alec,

I gave up when I realized that virtually every call to the DAO-layer caused memory leaks that cannot be tracked back to a single or a few functions. There must be more than one leak and I cannot say exactly where it comes from (our code, ADODB, php, mysql-extension). You'd have to measure memory usage after every single step to analyze this. I didn't go that far.

If you like to pursue this further then you can start where I stopped: You can simply dump memory usage before and after every DAO-statement in LucenePlugin and SolrWebService and you'll see memory usage continually increase. The critical functions here are LucenePlugin::callbackRebuildIndex() and SolrWebService::_getArticleXml().

Florian

On 08/03/2012 01:36 PM, Alec Smecher wrote:
> Hi Florian,
>
> How far did you get in simplifying this? If you're able to come up with
> a compact proof of the leak then I might be able to help by pursuing it
> further with the PHP folks. (Though you'll note that this didn't get us
> anywhere with <https://bugs.php.net/bug.php?id=46408>.)
>
> I'd like to see if we can pursue an option in PHP before moving to a
> batch script.  (Another ugly but cross-platform solution would be to
> write the loop script in PHP as well, and have it execute the PHP
> interpreter to launch the re-indexing activity.)
>
> Thanks for all your work tracking this down.
>
> Regards,
> Alec
>
> On 28/07/12 02:43 PM, Florian Grandel wrote:
>> Hi everybody,
>>
>> after a whole day spent with profiling, I tracked this down to many
>> small memory leaks in the database access layer. When I remove all
>> database access then I get no memory leak.
>>
>> This seems not only to be ADOdb's fault. I think there are also memory
>> leaks in PHP itself or in the mysql extension. E.g.,
>> mysql_free_result() does not return to the original amount of memory
>> consumed.
>>
>> I'm wondering whether large OJS installations with several journals
>> never had memory problems when rebuilding their index. The current
>> implementation must use even more memory. Anyone heard of difficulties
>> there?
>>
>> The only thing I can now think of to limit memory consumption when
>> rebuilding an index is calling PHP once per 100-article batch via a
>> shell script loop.
>>
>> @Alec: Batch indexing can be done with the "dirty"-markers that I
>> wrote you about recently. I could simply mark all articles dirty first
>> and then select and index a fixed amount of "dirty" articles whenever
>> the script is called. Did you know that Wikipedia does batch job
>> handling like this whenever someone makes a request, even if it's just
>> to read an article?
>>
>> @Bozana: Indexing the database with FQS/QN together exceeded the
>> current memory limit after about 2 thirds of the articles were
>> indexed. The "dirty"-mechanism will only be part of the next project
>> phase. So to index the test corpus we'll have to augment the PHP CLI
>> memory limit on the test server for now. Is this possible?
>>
>> Maybe one of you still has some better ideas?
>>
>> Thanks!!
>>
>> Florian
>>
>> On 07/28/2012 04:41 PM, Alec Smecher wrote:
>>> Hi all,
>>>
>>> Thanks for diving into this, Florian. We've had a few reports on the
>>> forum but I haven't had a time to investigate it. I'll spend some time
>>> early next week to see if I can add to the information you've posted.
>>>
>>> Cheers,
>>> Alec
>>>
>>> On 28/07/12 10:49 AM, Florian Grandel wrote:
>>>> Hi all,
>>>>
>>>> just posted this problem to stackoverflow as everything I tried out so
>>>> far was to no avail.
>>>>
>>>> See
>>>> http://stackoverflow.com/questions/11703164/php-domdocument-unset-does-not-release-resources
>>>>
>>>>
>>>>
>>>> There you can also see the list of things I tried out. I'm just out of
>>>> ideas right now...
>>>>
>>>> Florian
Comment 2 Jacob Sanford 2013-02-20 05:59:22 PST
>> I'm wondering whether large OJS installations with several journals
>> never had memory problems when rebuilding their index. The current
>> implementation must use even more memory. Anyone heard of difficulties
>> there?

That's us! 

We're on OJS 2.3.8 and have just completed a server migration which required flushing the search index tables before doing the SQL dump. Rebuilding on the new server was challenging due to memory issues.

JS
Comment 3 Jacob Sanford 2013-02-20 11:31:49 PST
To note, however, it wasn't insurmountable or anything. The memory usage maxed out at 253MB on the rebuild.
Comment 4 Alec Smecher 2013-02-20 11:43:08 PST
Thanks, Jake, I was about to ask how much you needed to allocate. I'd say 256MB is high but not impossible -- ergo this bug remains open but not a primary priority.
Comment 5 Alec Smecher 2013-03-25 08:51:48 PDT
*** Bug 8181 has been marked as a duplicate of this bug. ***
Comment 6 Shubhash Wasti 2013-03-25 09:11:46 PDT
Alec, some workarounds to at least complete the indexing would be nice, e.g. maybe indexing in chunks rather than all articles at a time. I cannot currently afford to allocate more than 6GB for PHP to complete the indexing and without it the search function is basically broken.
Comment 7 Alec Smecher 2013-03-25 09:17:51 PDT
Shubhash, does your installation contain many journals? You can specify a particular journal path to index on the command line; if you index each journal in turn then you could well keep the memory usage low enough.
Comment 8 Shubhash Wasti 2013-03-25 09:25:51 PDT
We have only have one journal in this OJS instance. And this journal has 742 published articles. I don't think this number is unreasonably high. 

Is there a way to specify something like "index article numbers from 1 to 50", or something like that? That way, I could repeat the process 15 times, and be done with it.
Comment 9 Alec Smecher 2013-03-25 10:07:40 PDT
Shubhash, another alternative would be try a newer PHP -- it's pretty amazing that 742 articles tried to eat >6GB. You don't need to upgrade your server's main PHP, either, to do this: download a source tarball, unpack, compile, and then explicitly call that PHP (e.g. from wherever you compiled it in your home directory) when you're running the rebuild from the command line. It would be an interesting data point for us to add to the problem.
Comment 10 Shubhash Wasti 2013-03-25 14:24:32 PDT
For now, as I couldn't spend a lot of time compiling and trying new version of PHP, I simply disabled the indexing of PDF files. Pdftotext was generating lots of errors anyway, and the PDF copies are supposed to be identical copies of HTML
versions; so I believe nothing is lost in doing so.

After disabling PDF indexing, rebuildSearchIndex completed with a peak memory usage of 3.2GB.
Comment 11 Alec Smecher 2013-07-19 14:32:57 PDT
FYI, this may be resolved by the patch at:
http://pkp.sfu.ca/support/forum/viewtopic.php?f=8&t=10058
Comment 12 Renato Mendes 2013-08-19 20:00:10 PDT
(In reply to comment #7)
> Shubhash, does your installation contain many journals? You can specify a
> particular journal path to index on the command line; if you index each
> journal in turn then you could well keep the memory usage low enough.

Alec,

What is the command for doing what you mentioned? Is it possible do index a specific conference in case of an OCS?

Thanks,
Renato
Comment 13 Alec Smecher 2013-08-26 13:48:14 PDT
Renato, unfortunately OCS doesn't currently support that option. Have you tried the patch to resolve the memory leak instead?
Comment 14 Renato Mendes 2013-08-26 14:05:38 PDT
alec, do you mean to apply this patch https://github.com/pkp/pkp-lib/commit/06bc11f5ecfb25ceffd612a3e7c8aef193c21903.diff ?
Comment 15 beghelli 2013-08-26 14:39:26 PDT
(In reply to comment #14)
> alec, do you mean to apply this patch
> https://github.com/pkp/pkp-lib/commit/
> 06bc11f5ecfb25ceffd612a3e7c8aef193c21903.diff ?

Hi Renato,

I am from the development team also, and yes, this patch should fix the memory leak.
Comment 16 Renato Mendes 2013-08-26 15:13:02 PDT
(In reply to comment #15)
> I am from the development team also, and yes, this patch should fix the
> memory leak.

I am not sure if this patch applies to OCS. Because, I added the lines on

ocs/lib/pkp/classes/plugins/HookRegistry.inc.php

then server returned the error

$ php rebuildSearchIndex.php 
PHP Parse error:  syntax error, unexpected T_FUNCTION, expecting T_VARIABLE in ..ocs/lib/pkp/classes/plugins/HookRegistry.inc.php on line 109

PHP Parse error:  syntax error, unexpected T_FUNCTION, expecting T_VARIABLE in ..ocs/lib/pkp/classes/plugins/HookRegistry.inc.php on line 116

Also I couldn't I couldn't find the setUp() or tearDownd() functions on

classes/core/PKPRequest.php and
classes/core/PKPRouterCase.inc.php
Comment 17 beghelli 2013-08-27 08:02:21 PDT
Renato, what's you OCS version? Thanks.
Comment 18 Renato Mendes 2013-08-27 18:25:31 PDT
(In reply to comment #17)
> Renato, what's you OCS version? Thanks.

OCS 2.3.5.0 running on PHP 5.2.17 and mysql 5.1.54-rel12.6-log
Comment 19 beghelli 2013-08-28 08:11:01 PDT
Renato, I double checked and OCS 2.3.5 does not have the memory leak problem, at least not this one that this patch fixed. So, you are correct, this patch will not apply in OCS.

Have you memory leak problems in OCS? When?
Comment 20 Renato Mendes 2013-08-28 12:11:36 PDT
(In reply to comment #19)
> Have you memory leak problems in OCS? When?

Yes. When trying to rebuild search index using cli $ php tools/rebuildSearchIndex.php

My OCS hosts about 19 conferences and an author complained about incomplete search results.