You are viewing the PKP Support Forum | PKP Home Wiki

Source mods to index pdfs on shared host

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
The Public Knowledge Project Support Forum is moving to http://forum.pkp.sfu.ca

This forum will be maintained permanently as an archived historical resource, but all new questions should be added to the new forum. Questions will no longer be monitored on this old forum after March 30, 2015.

Source mods to index pdfs on shared host

Postby smitech » Mon Jan 15, 2007 10:46 am

Our journal is hosted on a shared host which has no provision for execution of native programs on the server machine. Articles are published in PDF format in the journal. The OJS software builds an index of uploaded galley files for the purpose of full text searching. It does this using native facilities for HTML galleys, but for PDFs and other file types it relies on the execution of native "helper" programs to extract the text from the galley for indexing. Since our shared host does not permit the execution of the helper programs our PDF galleys do not get indexed.

Our original solution to this problem was to include both HTML and PDF formats for each srticle. But preparation of the HTML galleys proved to be too time consuming, and our scientist authors were loath to have their articles available in any editable form, wanting them to always appear exactly as they had written them.

A possible but messy solution was to post the galleys to the journal as PDFs and then run the program pdftotext on the editor's (Windows) machine using the -htmlmeta option to extract the text of the pdf into an HTML file. This HTML file would also be posted to the journal so a full text index could be generated. This is all easy and quick to do. But the messy part of this is that we have to have explanatory text all over the site telling users not to use the HTML versions of the articles as they are not meant for viewing. This was deemed to be unworkable.

Our final solution was an expedient modification of the OJS sources (i.e. a hack) to prohibit the display of any HTML galley with the original filename of NOVIEW.html. These HTML galleys are built on the editor's machine as described above, using pdftotext to extract the text from the PDF file into an HTML file named NOVIEW.html. This galley is uploaded to the journal to provide indexing for the article.

We are sensitive to the fact that local modifications to the sources make it more difficult to take updates. These modifications are very well contained and apply to three files: templates\issue\issue.tpl, templates\search\searchResults.tpl, templates\article\article.tpl. They are identical for all files. The original code fragment looks like this:
Code: Select all
{if $hasAccess}
      {foreach from=$article->getGalleys() item=galley name=galleyList}
         <a href="{url page="article" op="view" path=$article->getBestArticleId($currentJournal)|to_array:$galley->getGalleyId()}" class="file">{$galley->getLabel()|escape}</a>

It is replaced with this:
Code: Select all
{if $hasAccess}
      {foreach from=$article->getGalleys() item=galley name=galleyList}
         {* SMITech - Don't allow display of HTML file used only to build an index. *}
         {if "NOVIEW.html" == $galley->getOriginalFileName()}
            <span class="disabled"><u>{$galley->getLabel()|escape}</u></span>
            <a href="{url page="article" op="view" path=$article->getBestArticleId($currentJournal)|to_array:$galley->getGalleyId()}" class="file">{$galley->getLabel()|escape}</a>

Note that this results in the HTML galley link being greyed out instead of just not appearing at all. This was done just so the editor would know that this article had an HTML galley uploaded just for the purpose of generating an index. You can see how this works on our site: http://rr.smitech.org.

This is not an ideal solution to the problem - I'm never really fond of any kind of "special filename" implementations, but it does fill all of our functional criteria. I have no idea how general the need for PDF only journals hosted on shared hosts is, but if this is not a marginal case then I would suggest a better integrated solution would be a better idea.

Posts: 22
Joined: Tue Dec 19, 2006 6:15 pm

Re: Source mods to index pdfs on shared host

Postby pashton » Sun Sep 02, 2007 6:40 am

This is a really good mod that solves a huge problem for me, however the code you have is not quite the same on these to pages templates\search\searchResults.tpl, templates\article\article.tpl, on my version which is I believe the latest. Have you updated this mod since the original post. On the page that was the same it worked well, I also fond that using css it was quite easy to make the 'HTML' link disappear altogether.

Or alternatively does anyone have any suggestions for people on shared hosts that cannot use pdftotext?

Any help appreciated.


Posts: 38
Joined: Fri Dec 17, 2004 5:51 pm

Return to OJS Technical Support

Who is online

Users browsing this forum: Google [Bot] and 1 guest