OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



Problem with Indexing PDFs

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
What to do if you have a technical problem with OJS:

1. Search the forum. You can do this from the Advanced Search Page or from our Google Custom Search, which will search the entire PKP site. If you are encountering an error, we especially recommend searching the forum for said error.

2. Check the FAQ to see if your question or error has already been resolved.

3. Post a question, but please, only after trying the above two solutions. If it's a workflow or usability question you should probably post to the OJS Editorial Support and Discussion subforum; if you have a development question, try the OJS Development subforum.

Problem with Indexing PDFs

Postby jamilj » Mon Jul 21, 2014 8:40 pm

Hello,

I would like to make sure that my PDFs are being indexed and I thought that they were because I had the following line uncommented in my config file:

Code: Select all
index[application/pdf] = "/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr '[:cntrl:]' ' '"


However, full text searching has definitely not been working (my site is several months old at this point). I went ahead and tried to rebuild my search index, and proceeded to get the following errors:

Code: Select all
Indexing "Monthly Review" ... Error: XObject 'Im0' is unknown
Error: Expected the optional content group list, but wasn't able to find it, or it isn't an Array
Error: Illegal entry in ToUnicode CMap
Error: Invalid Font Weight


There were about 100 or so errors, mostly invalid font weight and then about 20 percent were the optional group count error. It does appear that full text searching is working now. However, should I be concerned about the errors? Does it mean that certain PDFs were not indexed?

Finally, and most importantly, does having the PDFs indexed allow for the PDF contents to be searched via the web? For example, when you go to a JSTOR PDF from a google search and the search terms are highlighted. If this is not possible, does the Lucene plugin provide more options on this front? We want the text to be fully indexed by search engines but we obviously do not want the PDFs to be able to be downloaded directly.

Many thanks.
jamilj
 
Posts: 40
Joined: Sun Aug 25, 2013 10:36 pm

Re: Problem with Indexing PDFs

Postby asmecher » Tue Jul 22, 2014 8:18 am

Hi jamilj,

Those messages come from the pdftotext tool; try running your PDFs through that tool manually to see what you get. Depending on your PDF creation toolchain the results will vary considerably.

OJS uses the text extracted from these tools to serve its own built-in search engine (or Lucene, if that is configured). Google uses its own tools to extract text from PDFs, so you'll have to ensure that your PDFs are accessible by the Google indexing service.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8869
Joined: Wed Aug 10, 2005 12:56 pm

Re: Problem with Indexing PDFs

Postby jamilj » Wed Jul 23, 2014 9:55 am

Dear Alex,

That last point is my main question: how do you make sure that the text is available to google's indexing service? I know that there is an option to use google docs to display PDFs. Is that one way?
jamilj
 
Posts: 40
Joined: Sun Aug 25, 2013 10:36 pm

Re: Problem with Indexing PDFs

Postby asmecher » Wed Jul 23, 2014 9:59 am

Hi jamilj,

Google's workings are a little obscure outside of the machine itself, I'm afraid. As far as I know, it's sufficient just to make sure that your PDFs are accessible to the Google indexing service (i.e. not locked behind a password); beyond that you'd have to check with Google or someone who specializes in SEO. There may be interactions between normal Google searches and Google Scholar as well, though I don't know the details.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8869
Joined: Wed Aug 10, 2005 12:56 pm


Return to OJS Technical Support

Who is online

Users browsing this forum: imedpub, obeiki and 2 guests