OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



chinese full text search

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
What to do if you have a technical problem with OJS:

1. Search the forum. You can do this from the Advanced Search Page or from our Google Custom Search, which will search the entire PKP site. If you are encountering an error, we especially recommend searching the forum for said error.

2. Check the FAQ to see if your question or error has already been resolved.

3. Post a question, but please, only after trying the above two solutions. If it's a workflow or usability question you should probably post to the OJS Editorial Support and Discussion subforum; if you have a development question, try the OJS Development subforum.

chinese full text search

Postby swing » Mon Mar 23, 2009 4:03 am

hi,
i have a problem with the full text search in my chinese demo journal -- i have to admit that i don't know chinese, thus i copy a word (e.g. the text between two spaces) from an journal article, but don't get any results :-( i am using ojs-2.2.2 and everything is sett to be utf-8.
i similarly tried the full text search of the journal 'Taiwanese Philosophical Investigation' (http://academic.nthu.edu.tw/eHSSTW/ojs/index.php), but without success :-(
any idea/hint/help?
thanks a lot!
swing
 
Posts: 142
Joined: Tue Oct 09, 2007 2:59 am

Re: chinese full text search

Postby mj » Fri Mar 27, 2009 1:32 pm

Hi Swing,

Using the search on the TPI site seems to work fine for me when I use cut-and-paste to do a full-text search. I would strongly suggest you double-check that:

  1. check that your browser is submitting forms in UTF-8 (although this usually isn't a problem)
  2. check that your config.inc.php settings are using utf8 as the connection_charset
  3. your database collaiton is set to UTF-8 (utf8_general_ci or utf8_unicode_ci) collation
  4. all of your tables are set to the same collation as the database
  5. all of your columns in the tables are likewise set to the same collation
  6. check the article_search_keyword_list table and perhaps search the keyword_text column to see if the string you're looking for is there
If you're not seeing any results, the database is the most likely culprit (rather than OJS), and more often that not it's either a collation/encoding issue, or a keyword that isn't getting indexed properly. If you find problems in the database, I'd recommend doing an advanced forum search for some SQL statements that will help you get everything into the right encoding.

Hope this helps,
MJ
mj
Site Admin
 
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada

Re: chinese full text search

Postby swing » Tue Mar 31, 2009 9:14 am

Dear MJ,
thanks a lot for your response!
I checked everything. The encoding seems to be OK, but I am not sure if the keywords are indexed properly.
I used the text from the Wikipedia article about Vancouver (http://zh.wikipedia.org/wiki/%E6%BA%AB% ... 5%E8%8F%AF) in my demo journal and there is, for example, the following entry in the table article_search_keyword_list:
"place)著名的「五帆」建築、葛勞士山(grouse"
Thus, if I search for "著名的「五帆」建築、葛勞士山" (which means Grouse Mountain, I suppose) I can't find anything -- I have to search for "place)著名的「五帆」建築、葛勞士山(grouse" to find it.
It is similar if I search on the TPI site. I have, for example, to search for "莊子哲學以「化」見長,「化」又建立在「氣」的基礎上,莊子是中國氣化哲學的奠基者。本文探討的重點不在" (first line) in http://academic.nthu.edu.tw/eHSSTW/ojs/ ... le/view/22 to find it -- if I search just for the first word "莊子哲學以「化」見長" there is no result.
But, as I said, I don't know Chinese, thus maybe I am searching totally wrong (although the fist example from Wikipedia seems very logical to me).
Sorry for disturbing (maybe this all is not so important) -- soon we will probably host a journal with articles in Chinese and Japanese, thus I wanted to test the search and to know if everything is OK and what's wrong and how it functions and ...
Thanks a lot and best wishes,
swing
swing
 
Posts: 142
Joined: Tue Oct 09, 2007 2:59 am

Re: chinese full text search

Postby mj » Tue Mar 31, 2009 1:08 pm

Hi swing,

From your description, it sounds like the indexing function isn't breaking words along the whitespace boundaries you'd expect. I've had a look at the Wikipedia page you mention, and notice that the breaks between "words" aren't in fact space characters (U+0020), even though visually there appear to be breaks. You can try copy-pasting and searching in any UTF-8 compatible text editor to confirm this (try searching for a "space" from the spacebar). When text is indexed, it is broken across space characters as boundaries, since these are assumed to be the separators for "words". This would explain why you're getting larger chunks in the index than you'd expect by looking at the text.

I don't know enough about Chinese or Japanese glyphs to understand how these "words" are broken without use of the space character; if you're able to find out from someone who does, and they can explain how we'd be able to detect these characters as word boundaries, then we could certainly address this as an improvement request for a future release. There are a number of OJS journals in Asian encodings, so I'd consider this important enough to address if we can figure out how to solve it.

Hope this helps,
MJ
mj
Site Admin
 
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada

Re: chinese full text search

Postby swing » Wed Apr 01, 2009 4:50 am

Hi MJ,
Thanks! It helps and it perfectly makes sense :D I didn't think of checking if it's a space character between 'words' :roll:
In the meantime I got some more information about Chinese. There is no space between 'words' (like I expected it), just signs like ,or 。 Thus, I suppose, the search based on space between words is not so good for Chinese :-(
As I am informed, Lucene should be integrated/used in OJS soon? Maybe this is going to solve some problems? I am not familiar with Lucene, but there should be Analyser (for most languages) that could be used to index the content...
Best wishes,
swing
swing
 
Posts: 142
Joined: Tue Oct 09, 2007 2:59 am

Re: chinese full text search

Postby mj » Thu Apr 02, 2009 12:30 pm

Hi swing,

Yes, Lucene-based search has already been built into the latest Harvester 2.3 release and is in development for porting into OJS, most likely in the OJS 2.3 release. This should hopefully address a lot of issues that tend to appear with various character encodings and languages, as well as provide some improved search/indexing functionality. It will work quite differently, so there's sure to be much testing (and probably plenty of trial-and-error). In the meantime, I'm not sure whether it's worthwhile to add word-break characters for various languages. If you have a comprehensive list of the characters and their UTF-8 codes, I'd be happy to take a look at it.

MJ
mj
Site Admin
 
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada

Re: chinese full text search

Postby swing » Tue Apr 07, 2009 4:04 am

Hi MJ,

I am afraid that the standard analyzer (each word is separated by special markers such as a space, a comma,...) is not well suited for Chinese :-( Lets take for example the following sentence: "I'll go to the market on Sunday." In Chinese it looks like this "星期天我将要去市场。" or "星期天我要去市场。" (I = 我, will = 将要, go to = 去, the market = 市场, on Sunday = 星期天). Thus, there is no separation between 'words'. So I am not sure whether it's worthwhile to put much work in this kind of search/indexing for Chinese. I think the Lucene tokenizes Chinese text into 1-gram, i.e. a Chinese character is a single token. Maybe it would be better to wait for Lucene integration and invest more work in that? Hmm...

If I continue with Chinese like this, I'll become an expert soon ;-) (Well, I was always interested in it and wanted to learn it -- so swing, there you go... ;-) )

Best wishes,
swing
swing
 
Posts: 142
Joined: Tue Oct 09, 2007 2:59 am


Return to OJS Technical Support

Who is online

Users browsing this forum: Google [Bot] and 4 guests