OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



Searching full text - pdf

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
What to do if you have a technical problem with OJS:

1. Search the forum. You can do this from the Advanced Search Page or from our Google Custom Search, which will search the entire PKP site. If you are encountering an error, we especially recommend searching the forum for said error.

2. Check the FAQ to see if your question or error has already been resolved.

3. Post a question, but please, only after trying the above two solutions. If it's a workflow or usability question you should probably post to the OJS Editorial Support and Discussion subforum; if you have a development question, try the OJS Development subforum.

Searching full text - pdf

Postby nef » Mon Mar 14, 2011 4:51 am

Hi
For several months now we haven't been able to make a full text search in our pdf files. Do you have any suggestions what to do?
Thank you in advance.
Niels Erik
nef
 
Posts: 235
Joined: Fri Jun 01, 2007 2:56 am
Location: Aarhus, Denmark

Re: Searching full text - pdf

Postby asmecher » Mon Mar 14, 2011 8:36 am

Hi Niels Erik,

What do you have in your config.inc.php in your [search] section for PDF indexing? If you have a tool configured there, but it's not working, you could try running a few articles through the same tool on the command line to see whether or not the tool is the problem or OJS isn't indexing the results successfully.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8674
Joined: Wed Aug 10, 2005 12:56 pm

Re: Searching full text - pdf

Postby tgc99 » Tue Mar 15, 2011 2:08 am

asmecher wrote:Hi Niels Erik,
What do you have in your config.inc.php in your [search] section for PDF indexing? If you have a tool configured there, but it's not working, you could try running a few articles through the same tool on the command line to see whether or not the tool is the problem or OJS isn't indexing the results successfully.

config.inc.php contains:
Code: Select all
index[application/pdf] = "/usr/bin/pdftotext %s -"

pdftotext works and running tools/rebuildSearchIndex.php works, all articles are indexed.
The problem is that new articles are not indexed when uploaded.
I added a wrapper around pdftotext that creates a log every time it's called but no logs are generated which indicates that the indexer is never called when an article is uploaded.

We are testing this on our 2.3.4 test/translation instance but our current 2.2.4 production instance has the same issue. Both are running on RHEL 5.

-tgc
tgc99
 
Posts: 56
Joined: Thu Oct 18, 2007 3:50 am
Location: Aarhus, Denmark

Re: Searching full text - pdf

Postby asmecher » Tue Mar 15, 2011 8:41 am

Hi tgc,

Looks good so far. The indexing should occur when Step 5 of the submission process is confirmed. Look in classes/author/form/submit/AuthorSubmitStep5Form.inc.php for the following lines:
Code: Select all
                // Update search index
                import('classes.search.ArticleSearchIndex');
                ArticleSearchIndex::indexArticleMetadata($article);
                ArticleSearchIndex::indexArticleFiles($article);
If you suspect that's not being executed, try throwing in some debugging output to double-check.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8674
Joined: Wed Aug 10, 2005 12:56 pm

Re: Searching full text - pdf

Postby tgc99 » Wed Jul 13, 2011 6:44 am

asmecher wrote:Hi tgc,

Looks good so far. The indexing should occur when Step 5 of the submission process is confirmed. Look in classes/author/form/submit/AuthorSubmitStep5Form.inc.php for the following lines:
Code: Select all
                // Update search index
                import('classes.search.ArticleSearchIndex');
                ArticleSearchIndex::indexArticleMetadata($article);
                ArticleSearchIndex::indexArticleFiles($article);

If you suspect that's not being executed, try throwing in some debugging output to double-check.

I'm only now digging into this but the result is surprising.

Those functions you've pointed out are indeed called when step 5 of the submission process is confirmed.
Looking at the inner workings of the indexArticleFiles function reveals a call to ArticleSearchIndex::updateFileIndex for each galley file.
To get the list of galley files this code is executed in indexArticleFiles:
Code: Select all
// Index galley files
$fileDao =& DAORegistry::getDAO('ArticleGalleyDAO');
$files =& $fileDao->getGalleysByArticle($article->getId());

This is where things start to go wrong since $files comes out empty and no call is ever made to ArticleSearchIndex::updateFileIndex.

Looking at getGalleysByArticle in classes/article/ArticleGalleyDAO.inc.php this code extracts the list of galleys:
Code: Select all
               $result =& $this->retrieve(
                        'SELECT g.*,
                        a.file_name, a.original_file_name, a.type, a.file_type, a.file_size, a.date_uploaded, a.date_modified
                        FROM article_galleys g
                        LEFT JOIN article_files a ON (g.file_id = a.file_id)
                        WHERE g.article_id = ? ORDER BY g.seq',
                        $articleId
                );

If I'm reading this right that can only work if the article exists in the 'article_galleys' table but in my experiments that did not happen when it was submitted.
For me this happens when the article is being published by going into Editor mode and adding it to a journal issue (I did not try other workflows, this is about all I know how to do).
I added debugging inside ArticleSearchIndex::updateFileIndex to catch it being called from somewhere else but it's not being triggered when I go into Editor mode and publish the article.

Any advice on how to proceed would be appreciated.

-tgc
tgc99
 
Posts: 56
Joined: Thu Oct 18, 2007 3:50 am
Location: Aarhus, Denmark

Re: Searching full text - pdf

Postby asmecher » Wed Jul 13, 2011 9:56 am

Hi tgc,

Just to make sure I know where to start looking -- the problem is that PDF fulltext isn't being indexed as it's added, correct? Running the indexing tool (tools/rebuildSearchIndex.php) does correct the problem temporarily (i.e. until new content is added), correct?

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8674
Joined: Wed Aug 10, 2005 12:56 pm

Re: Searching full text - pdf

Postby tgc99 » Wed Jul 13, 2011 11:16 pm

asmecher wrote:Hi tgc,
Just to make sure I know where to start looking -- the problem is that PDF fulltext isn't being indexed as it's added, correct? Running the indexing tool (tools/rebuildSearchIndex.php) does correct the problem temporarily (i.e. until new content is added), correct?

Exactly.

My debug session was done with OJS 2.3.6 but it is also a problem for our OJS 2.2.4 instance (soon to be upgraded, finally!).
I've attached my debug patch so you can see where I've looked. I had show_stacktrace on and display_errors on.

-tgc
Attachments
debug.diff
(4 KiB) Downloaded 96 times
tgc99
 
Posts: 56
Joined: Thu Oct 18, 2007 3:50 am
Location: Aarhus, Denmark

Re: Searching full text - pdf

Postby asmecher » Thu Jul 14, 2011 9:03 am

Hi tgc,

The call you're looking for is probably the call to ArticleSearchIndex::updateFileIndex in classes/submission/form/ArticleGalleyForm.inc.php -- I see a case or two where it's not being called and should be. Could you try the patch at http://pkp.sfu.ca/bugzilla/show_bug.cgi?id=6765 and confirm whether it solves the problem?

Thanks,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8674
Joined: Wed Aug 10, 2005 12:56 pm

Re: Searching full text - pdf

Postby tgc99 » Fri Jul 15, 2011 1:16 am

It does not solve the problem.
I do not now how and where in the UI that this form is activated but I do not believe I'm ever touching it in my test workflow.

What I do is complete the submission as author then I follow the link given into Home > User > Editor > Submissions > #YY > Editing and in the 'Scheduling' part I 'post' it to a journal issue.
As soon as that is done the article turns up in the article_galley table, in the index for the journal and will be indexed by rebuildindex.

-tgc
tgc99
 
Posts: 56
Joined: Thu Oct 18, 2007 3:50 am
Location: Aarhus, Denmark

Re: Searching full text - pdf

Postby asmecher » Fri Jul 15, 2011 9:10 am

Hi tgc,

Ah, I see. How are you uploading the PDF? Are you using the "expedite" link, which is available at the end of the submission process to Authors who are also Editors? The modification I linked above applies to the Upload Galley form, which is available to Editors and Layout Editors when they upload a Galley under the Editing section.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8674
Joined: Wed Aug 10, 2005 12:56 pm

Re: Searching full text - pdf

Postby tgc99 » Tue Jul 26, 2011 12:41 am

My test workflow is as already described and the user I'm logged in as is the equivalent of a site admin.
I follow the 'Click Here' link at the end of the submission process ("I follow the link given") which I assume it what you mean by the "expedite" link.

As I said earlier this is about all I know how to do and after poking around a bit I'm not able to figure out where and how to try out what you describe.

-tgc
tgc99
 
Posts: 56
Joined: Thu Oct 18, 2007 3:50 am
Location: Aarhus, Denmark

Re: Searching full text - pdf

Postby asmecher » Tue Aug 02, 2011 10:16 am

Hi tgc,

Yes, that answers my question -- the "expedite" process is the same as the "click here" button you're talking about. See the patch at http://pkp.sfu.ca/bugzilla/show_bug.cgi?id=6765 mentioned in comment #3 -- that should add the indexing that you're missing. Please give it a try and report back here.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8674
Joined: Wed Aug 10, 2005 12:56 pm

Re: Searching full text - pdf

Postby tgc99 » Tue Aug 30, 2011 6:16 am

asmecher wrote:Yes, that answers my question -- the "expedite" process is the same as the "click here" button you're talking about. See the patch at http://pkp.sfu.ca/bugzilla/show_bug.cgi?id=6765 mentioned in comment #3 -- that should add the indexing that you're missing. Please give it a try and report back here.

I've finally had time to look at this again and I can verify that it does what I expect.
Your first patch I cannot test but I've asked Niels Erik to have a look at the workflow you describe and hopefully he will also be able to verify that some of the other possible workflows are okay.

-tgc
tgc99
 
Posts: 56
Joined: Thu Oct 18, 2007 3:50 am
Location: Aarhus, Denmark

Re: Searching full text - pdf

Postby nevermind182004 » Tue Sep 06, 2011 7:35 pm

hello Alec,

I've got a similar problem with the threadstarter, unable to search for fulltext articles.

This is also activated in our config file;
index[application/pdf] = "/usr/bin/pdftotext %s -"

given the possible testings and solutions above, can i apply the same given patches to our version of ojs? our ojs version is 2.3.1..

Thanks!
Rye
nevermind182004
 
Posts: 86
Joined: Mon Apr 20, 2009 6:02 pm

Re: Searching full text - pdf

Postby asmecher » Wed Sep 07, 2011 10:11 am

Hi Rye,

Dumb question, but have you checked that the referenced command line tool is installed and in the right place on your server?

You might want to test out that part of the configuration by running tools/rebuildSearchIndex.php; this will cause the index to be fully rebuilt and should pick up your PDFs. If that works, then the patch is the next thing to try, as it'll cause PDFs to be indexed upon upload in one place where that call was omitted. If the rebuild script doesn't work, you'll have to look more closely at your PDF indexing tools.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8674
Joined: Wed Aug 10, 2005 12:56 pm

Next

Return to OJS Technical Support

Who is online

Users browsing this forum: Yahoo [Bot] and 5 guests