Hi dear contributors,
I need to integrate Lucene search capabilities to my journal management system. As you know, Lucene has PHP port but it is not enough to develop sufficient IR systems. I think, some core components ara missing on Php side
Frist, you need to extract text from various media formats (.doc, pdf, html. etc.). Second and very important issiue is stemming capabilities (improves ~ %40 retrieval performans for small repositories) from many languages. For those reasons you need to use 3rd part tools or to convert 3rd part tools to php but i think first solution is fast and easy to implement.
I want to develope a jar file gets some parameters (file path, language, status (insert, update or delete), journalid, dc contents etc.) from console. First, jar file will use Apache Tika (text-extraction library consists of many text and metadata extraction libraries, such as pdfbox etc.) for text extraction from various media formats. Then, will entegrate hard stemming algorithms for various languages. This jar file also will create and update Lucene index. Your programming language is not important just create an index and search this index with other Lucene ports. I want to use Lucene Php port on search side (again, need to use jar file for stemming search words/tokens on Php side).
But I have some problems. If i develope core IR system wiht java, some of application hosting organizations (for example some of universities) will ban exec function or jar execution because of security reasons! So, i will write once, but not run everywhere
I'm not a Php expert, this will be my first experiment. So, i need your knowlegde. Is there any different ways for exracting text from various media formats and stemming algorithms library consists of different languages for php? Or free webservice handles this issiues? Which tools PKP developers are using for not facing this problems?
Thanks time and consideration,