by mj » Tue Mar 11, 2008 10:47 am
Hi Harold,
The issue you're encountering is one which has become of increasing importance as the PKP suite grows among users with non-Latin character sets. We have gone through some effort to improve character handling in OJS and OCS, but due to the nature and purpose of indexing, it's arguably even more important in the Harvester. The issue comes down to the difference between how humans and computers see characters (and "words"), and the intrinsic challenges with transliteration.
For example, to a human, the words "Análisis" and "Analisis" seem essentially the same, but to a machine, they are completely different. Moreover, they are essentially no more or less different than "Análisis", "Ξανά", "計算機科学", or "インポート". Further, while Western Latin characters can (relatively) easily be transliterated into non-accented characters and case-modified, this is not generally applicable. For example, trying to transliterate the above using iconv yields "An'alisis", and all empty strings, respectively. Romanization complicates the issue even further.
We are constantly thinking about this issue and how to better enhance the PKP suite to handle various character sets and provide better search capabilities, but the short answer is that it's an extremely complicated problem, and the main reason it's faced by so many applications (and not just limited to Harvesters) is that it's very difficult to solve effectively. Rest assured that we are working on it the best we can.
Best regards,