You are viewing the PKP Support Forum | PKP Home Wiki

Diacritics and searching

Open Harvester Systems support questions and answers, bug reports, and development issues.

Moderators: jmacgreg, michael, John

Forum rules
The Public Knowledge Project Support Forum is moving to http://forum.pkp.sfu.ca

This forum will be maintained permanently as an archived historical resource, but all new questions should be added to the new forum. Questions will no longer be monitored on this old forum after March 30, 2015.

Diacritics and searching

Postby Harold » Wed Feb 06, 2008 9:17 am


We have a lot of accented characters in our Harvester2 database. We notice that search is not "diacritics-neutral" - for example, if you search for Análisis you don't get hits for Analisis, and vice-versa. Where do we start to look to neutralise the special characters? Is it an indexing property, or can we do something to the search?

Posts: 17
Joined: Fri May 18, 2007 6:34 am

Re: Diacritics and searching

Postby asmecher » Wed Feb 06, 2008 4:38 pm

Hi Harold,

The easiest way to do this would be using a postprocessor plugin that passes data to index through the iconv function (which is probably already installed on your system, but it's worth double-checking first), e.g.:
Code: Select all
$noDiacritics = iconv("UTF-8", "ASCII//TRANSLIT", $withDiacritics);
There are no example postprocessor plugins that ship with the Harvester, but they are basically the same as preprocessor plugins, for which several examples exist; have a look in plugins/preprocessors.

Alec Smecher
Public Knowledge Project Team
Posts: 10015
Joined: Wed Aug 10, 2005 12:56 pm

Re: Diacritics and searching

Postby Harold » Tue Mar 11, 2008 5:19 am

Hi Alec,

Thanks for the reply. We do have iconv. What we don't have are sufficient developer skills to take this any further (...which is how we ended up using an off-the-shelf harvester - very good though Harvester2 is!)

This seems like a fundamental problem which will be faced by a number of harvesting applications, so I am hopeful that it has already been addressed. If anybody out there has a solution, or can get us any nearer to what we need, that would be very helpful.

Posts: 17
Joined: Fri May 18, 2007 6:34 am

Re: Diacritics and searching

Postby mj » Tue Mar 11, 2008 10:47 am

Hi Harold,

The issue you're encountering is one which has become of increasing importance as the PKP suite grows among users with non-Latin character sets. We have gone through some effort to improve character handling in OJS and OCS, but due to the nature and purpose of indexing, it's arguably even more important in the Harvester. The issue comes down to the difference between how humans and computers see characters (and "words"), and the intrinsic challenges with transliteration.

For example, to a human, the words "Análisis" and "Analisis" seem essentially the same, but to a machine, they are completely different. Moreover, they are essentially no more or less different than "Análisis", "Ξανά", "計算機科学", or "インポート". Further, while Western Latin characters can (relatively) easily be transliterated into non-accented characters and case-modified, this is not generally applicable. For example, trying to transliterate the above using iconv yields "An'alisis", and all empty strings, respectively. Romanization complicates the issue even further.

We are constantly thinking about this issue and how to better enhance the PKP suite to handle various character sets and provide better search capabilities, but the short answer is that it's an extremely complicated problem, and the main reason it's faced by so many applications (and not just limited to Harvesters) is that it's very difficult to solve effectively. Rest assured that we are working on it the best we can.

Best regards,
Site Admin
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada

Return to Open Harvester Systems Support and Development

Who is online

Users browsing this forum: No registered users and 1 guest