OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



Problem with Indexing PDFs

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
The Public Knowledge Project Support Forum is moving to http://forum.pkp.sfu.ca

This forum will be maintained permanently as an archived historical resource, but all new questions should be added to the new forum. Questions will no longer be monitored on this old forum after March 30, 2015.

Problem with Indexing PDFs

Postby jamilj » Mon Jul 21, 2014 8:40 pm

Hello,

I would like to make sure that my PDFs are being indexed and I thought that they were because I had the following line uncommented in my config file:

Code: Select all
index[application/pdf] = "/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr '[:cntrl:]' ' '"


However, full text searching has definitely not been working (my site is several months old at this point). I went ahead and tried to rebuild my search index, and proceeded to get the following errors:

Code: Select all
Indexing "Monthly Review" ... Error: XObject 'Im0' is unknown
Error: Expected the optional content group list, but wasn't able to find it, or it isn't an Array
Error: Illegal entry in ToUnicode CMap
Error: Invalid Font Weight


There were about 100 or so errors, mostly invalid font weight and then about 20 percent were the optional group count error. It does appear that full text searching is working now. However, should I be concerned about the errors? Does it mean that certain PDFs were not indexed?

Finally, and most importantly, does having the PDFs indexed allow for the PDF contents to be searched via the web? For example, when you go to a JSTOR PDF from a google search and the search terms are highlighted. If this is not possible, does the Lucene plugin provide more options on this front? We want the text to be fully indexed by search engines but we obviously do not want the PDFs to be able to be downloaded directly.

Many thanks.
jamilj
 
Posts: 45
Joined: Sun Aug 25, 2013 10:36 pm

Re: Problem with Indexing PDFs

Postby asmecher » Tue Jul 22, 2014 8:18 am

Hi jamilj,

Those messages come from the pdftotext tool; try running your PDFs through that tool manually to see what you get. Depending on your PDF creation toolchain the results will vary considerably.

OJS uses the text extracted from these tools to serve its own built-in search engine (or Lucene, if that is configured). Google uses its own tools to extract text from PDFs, so you'll have to ensure that your PDFs are accessible by the Google indexing service.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 10015
Joined: Wed Aug 10, 2005 12:56 pm

Re: Problem with Indexing PDFs

Postby jamilj » Wed Jul 23, 2014 9:55 am

Dear Alex,

That last point is my main question: how do you make sure that the text is available to google's indexing service? I know that there is an option to use google docs to display PDFs. Is that one way?
jamilj
 
Posts: 45
Joined: Sun Aug 25, 2013 10:36 pm

Re: Problem with Indexing PDFs

Postby asmecher » Wed Jul 23, 2014 9:59 am

Hi jamilj,

Google's workings are a little obscure outside of the machine itself, I'm afraid. As far as I know, it's sufficient just to make sure that your PDFs are accessible to the Google indexing service (i.e. not locked behind a password); beyond that you'd have to check with Google or someone who specializes in SEO. There may be interactions between normal Google searches and Google Scholar as well, though I don't know the details.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 10015
Joined: Wed Aug 10, 2005 12:56 pm


Return to OJS Technical Support

Who is online

Users browsing this forum: Yahoo [Bot] and 1 guest