OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



Diacritics and searching

Open Harvester Systems support questions and answers, bug reports, and development issues.

Moderators: jmacgreg, michael, John

Forum rules
Developer Resources:

Git: You can access our public Git Repository here. Comprehensive Git usage instructions are available on the wiki.

Bugzilla: You can access our Bugzilla report tracker here.

Search: You can use our Google Custom Search to search across our main website, the support forum, and Bugzilla.

Questions and discussion are welcome.

Diacritics and searching

Postby Harold » Wed Feb 06, 2008 9:17 am

Hi,

We have a lot of accented characters in our Harvester2 database. We notice that search is not "diacritics-neutral" - for example, if you search for Análisis you don't get hits for Analisis, and vice-versa. Where do we start to look to neutralise the special characters? Is it an indexing property, or can we do something to the search?

Thanks
Harold
 
Posts: 17
Joined: Fri May 18, 2007 6:34 am

Re: Diacritics and searching

Postby asmecher » Wed Feb 06, 2008 4:38 pm

Hi Harold,

The easiest way to do this would be using a postprocessor plugin that passes data to index through the iconv function (which is probably already installed on your system, but it's worth double-checking first), e.g.:
Code: Select all
$noDiacritics = iconv("UTF-8", "ASCII//TRANSLIT", $withDiacritics);
There are no example postprocessor plugins that ship with the Harvester, but they are basically the same as preprocessor plugins, for which several examples exist; have a look in plugins/preprocessors.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 9205
Joined: Wed Aug 10, 2005 12:56 pm

Re: Diacritics and searching

Postby Harold » Tue Mar 11, 2008 5:19 am

Hi Alec,

Thanks for the reply. We do have iconv. What we don't have are sufficient developer skills to take this any further (...which is how we ended up using an off-the-shelf harvester - very good though Harvester2 is!)

This seems like a fundamental problem which will be faced by a number of harvesting applications, so I am hopeful that it has already been addressed. If anybody out there has a solution, or can get us any nearer to what we need, that would be very helpful.

Thanks.
Harold
 
Posts: 17
Joined: Fri May 18, 2007 6:34 am

Re: Diacritics and searching

Postby mj » Tue Mar 11, 2008 10:47 am

Hi Harold,

The issue you're encountering is one which has become of increasing importance as the PKP suite grows among users with non-Latin character sets. We have gone through some effort to improve character handling in OJS and OCS, but due to the nature and purpose of indexing, it's arguably even more important in the Harvester. The issue comes down to the difference between how humans and computers see characters (and "words"), and the intrinsic challenges with transliteration.

For example, to a human, the words "Análisis" and "Analisis" seem essentially the same, but to a machine, they are completely different. Moreover, they are essentially no more or less different than "Análisis", "Ξανά", "計算機科学", or "インポート". Further, while Western Latin characters can (relatively) easily be transliterated into non-accented characters and case-modified, this is not generally applicable. For example, trying to transliterate the above using iconv yields "An'alisis", and all empty strings, respectively. Romanization complicates the issue even further.

We are constantly thinking about this issue and how to better enhance the PKP suite to handle various character sets and provide better search capabilities, but the short answer is that it's an extremely complicated problem, and the main reason it's faced by so many applications (and not just limited to Harvesters) is that it's very difficult to solve effectively. Rest assured that we are working on it the best we can.

Best regards,
mj
Site Admin
 
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada


Return to Open Harvester Systems Support and Development

Who is online

Users browsing this forum: No registered users and 1 guest