OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



indexing PDFs and a "hidden" galley

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
What to do if you have a technical problem with OJS:

1. Search the forum. You can do this from the Advanced Search Page or from our Google Custom Search, which will search the entire PKP site. If you are encountering an error, we especially recommend searching the forum for said error.

2. Check the FAQ to see if your question or error has already been resolved.

3. Post a question, but please, only after trying the above two solutions. If it's a workflow or usability question you should probably post to the OJS Editorial Support and Discussion subforum; if you have a development question, try the OJS Development subforum.

indexing PDFs and a "hidden" galley

Postby shimrah » Mon Jul 30, 2007 9:39 am

Hey there...

We have journal archives that need to be added to the OJS site: we have PDFs for newer issues, and image-only (scanned) PDFs for older. However, we need those articles to be full text searchable. So I have two questions:

1) We are trying now to make PDF search to work as per viewtopic.php?t=1119:
- pdftotext.exe was added to C:\PHP\extras and tested that it works fine;
- line added to [search] section of config.inc.php: index[application/pdf] = "C:/PHP/extras/pdftotext.exe %s -"

Now we are trying to run
php tools/rebuildSearchIndex.php
We receive no error messages, but it just blinks for the second. Search of the PDF is not working.
How we can debug this script? Where we can look for results of the text index?

php -v gives this info:
PHP 5.1.4 (cli) (built: May 4 2006 10:35:22)
Copyright (c) 1997-2006 The PHP Group
Zend Engine v2.1.0, Copyright (c) 1998-2006 Zend Technologies

2) We would like to index our image-only PDFs without going through the trouble of creating HTML versions. Is it possible to either a) add an OCRed plain text file for purpose of creating an index, but flag it so that it will be hidden from users (a "hidden" type for this text galley) or b) generate an index for the article and manually insert it?
Another possibility... if a text galley is added and indexed, then the galley is deleted, will the index also be deleted? Is there a way to force it to persist?

Thanks!
Shimrah
shimrah
 
Posts: 45
Joined: Thu Apr 05, 2007 10:01 am

Re: indexing PDFs and a "hidden" galley

Postby asmecher » Mon Jul 30, 2007 10:10 am

Hi Shimrah,

First, make sure that your PHP CLI is configured to display errors if any occur. You might need to run a <?php phpinfo() ?> script to find out which php.ini configuration file is being used, and it's often a different file than the web server PHP module uses. Make sure display_errors, error_reporting and display_startup_errors are configured to show all error messages.

If the indexing tool is still running without any signs of errors but your index still isn't getting created, try running the PDF text extraction tool (e.g. pdftotext.exe) by hand on one of your PDFs to make sure e.g. that it's compatible with the version of PDF files you're using.

As for making scanned PDFs searchable, we've had success with Adobe Acrobat Professional's OCR technology. It creates OCR'd text within the PDF that can be used to perform searching and highlighting within Acrobat Reader, and OJS can use this to index the PDFs as well via text extraction tools like pdftotext.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 9214
Joined: Wed Aug 10, 2005 12:56 pm

Re: indexing PDFs and a "hidden" galley

Postby GennadyK » Tue Jul 31, 2007 12:51 pm

Thank you, Alec

>First, make sure that your PHP CLI is configured to display errors if any occur. You might need to run a <?php phpinfo() ?> script to find out which php.ini configuration file is being used, and it's often a different file than the web server PHP module uses. Make sure display_errors, error_reporting and display_startup_errors are configured to show all error messages.

- all this was done, but until we start to insert debug code (e.g. echo "rebuildSearchIndex.php 0\n";) we can't get any error msg.
Then we found out that problem was in this part of code in the tools\includes\cliTool.inc.php:
require('includes/driver.inc.php');

It doesn't like relative path. Everything start to work when it was changed to full path:
require(dirname(dirname(dirname(__FILE__))) . '/includes/driver.inc.php');

Please fix this bug, so others will not get into it.

Also thank you for the Adobe Acrobat Professional's OCR technology tip. Works great!

Regards,
Gennady
GennadyK
 
Posts: 19
Joined: Tue Apr 24, 2007 9:16 am

Re: indexing PDFs and a "hidden" galley

Postby asmecher » Tue Jul 31, 2007 1:48 pm

Hi Gennady,

Odd -- the chdir code in the top of cliTool.inc.php should've corrected any problems with relative paths. Are you configured to use safe mode or open_basedir? What directory were you running the tools from? open_basedir may interfere with the command-line tools.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 9214
Joined: Wed Aug 10, 2005 12:56 pm

Re: indexing PDFs and a "hidden" galley

Postby GennadyK » Tue Jul 31, 2007 2:05 pm

php_cli.ini:
safe_mode = Off
;open_basedir =

C:\OJS> c:\php\php -c c:\php\php_cli.ini -f tools/rebuildSearchIndex.php

I found this and then tried absolute path:
http://lists.evolt.org/archive/Week-of- ... 72150.html
GennadyK
 
Posts: 19
Joined: Tue Apr 24, 2007 9:16 am

Re: indexing PDFs and a "hidden" galley

Postby asmecher » Tue Jul 31, 2007 2:20 pm

Hi GennadyK,

If you have a chance to investigate further, I'd be interested in finding out the cause. Relative paths are used almost exclusively in OJS -- e.g. the "import" function uses relative paths -- and even if the one problematic one is fixed, many more remain. If one works, all should work.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 9214
Joined: Wed Aug 10, 2005 12:56 pm

Re: indexing PDFs and a "hidden" galley

Postby GennadyK » Tue Jul 31, 2007 2:29 pm

We will have chance in case of new problem. As for this particular case it could be PHP CLI specific.
I guess proposed changes will not break the code and should work for everyone.

Thank you
GennadyK
 
Posts: 19
Joined: Tue Apr 24, 2007 9:16 am


Return to OJS Technical Support

Who is online

Users browsing this forum: Google [Bot] and 5 guests