Lucene integration (text-extraction and stemming)

OJS development discussion, enhancement requests, third-party patches and plug-ins.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
The Public Knowledge Project Support Forum is moving to

This forum will be maintained permanently as an archived historical resource, but all new questions should be added to the new forum. Questions will no longer be monitored on this old forum after March 30, 2015.
Posts: 3
Joined: Fri May 28, 2010 10:13 pm

Lucene integration (text-extraction and stemming)

Postby libranto » Sat May 29, 2010 1:05 am

Hi dear contributors,

I need to integrate Lucene search capabilities to my journal management system. As you know, Lucene has PHP port but it is not enough to develop sufficient IR systems. I think, some core components ara missing on Php side :( Frist, you need to extract text from various media formats (.doc, pdf, html. etc.). Second and very important issiue is stemming capabilities (improves ~ %40 retrieval performans for small repositories) from many languages. For those reasons you need to use 3rd part tools or to convert 3rd part tools to php but i think first solution is fast and easy to implement.

I want to develope a jar file gets some parameters (file path, language, status (insert, update or delete), journalid, dc contents etc.) from console. First, jar file will use Apache Tika (text-extraction library consists of many text and metadata extraction libraries, such as pdfbox etc.) for text extraction from various media formats. Then, will entegrate hard stemming algorithms for various languages. This jar file also will create and update Lucene index. Your programming language is not important just create an index and search this index with other Lucene ports. I want to use Lucene Php port on search side (again, need to use jar file for stemming search words/tokens on Php side).

But I have some problems. If i develope core IR system wiht java, some of application hosting organizations (for example some of universities) will ban exec function or jar execution because of security reasons! So, i will write once, but not run everywhere :( I'm not a Php expert, this will be my first experiment. So, i need your knowlegde. Is there any different ways for exracting text from various media formats and stemming algorithms library consists of different languages for php? Or free webservice handles this issiues? Which tools PKP developers are using for not facing this problems?

Thanks time and consideration,

Posts: 10015
Joined: Wed Aug 10, 2005 12:56 pm

Re: Lucene integration (text-extraction and stemming)

Postby asmecher » Sat May 29, 2010 6:43 am

Hi libranto,

We do have plans to integrate with Lucene -- they are still in planning, but this is a definite goal. We do have some experience with the Zend Framework port of Lucene to PHP, but as you note, it's not currently complete enough to offer a full replacement for Lucene/SOLR.

Text extraction from media formats is already supported by OJS in the form of external tools. These are configured in the file by MIME type and generally work quite well.

I'm not familiar with a good implementation of the more complex text manipulation operations such as stemming in PHP; I suspect Java-land libraries are the best way to go for these features.

Because of the difficulty in finding a host that will support both PHP and Java, we'll support these features as options for those who need them, but will also continue to support PHP-only basic indexing and searching features.

Alec Smecher
Public Knowledge Project Team

Return to “OJS Development”

Who is online

Users browsing this forum: No registered users and 2 guests