OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



character encoding problems with buildSearchIndex.php

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
What to do if you have a technical problem with OJS:

1. Search the forum. You can do this from the Advanced Search Page or from our Google Custom Search, which will search the entire PKP site. If you are encountering an error, we especially recommend searching the forum for said error.

2. Check the FAQ to see if your question or error has already been resolved.

3. Post a question, but please, only after trying the above two solutions. If it's a workflow or usability question you should probably post to the OJS Editorial Support and Discussion subforum; if you have a development question, try the OJS Development subforum.

character encoding problems with buildSearchIndex.php

Postby geoffg » Wed Sep 26, 2012 11:19 am

When I use tools/buildSearchIndex.php to rebuild the search index, special characters end up garbled. For example, the word "scribner’s" in a pdf file (with a smart apostrophe) ends up in the article_search_keyword_list as "scribner’s"
International characters like ñ also have problems.

We're using mysql with utf8_general_ci collation. Also, in the OJS config.inc.php, all utf-8 related settings seem to be enabled:

index[application/pdf] = "/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr '[:cntrl:]' ' '"

locale = en_US
client_charset = utf-8
connection_charset = On
database_charset = On
charset_normalization = On

Does anyone have any idea what could be going on?
geoffg
 
Posts: 15
Joined: Tue Jul 10, 2012 4:08 pm

Re: character encoding problems with buildSearchIndex.php

Postby asmecher » Wed Sep 26, 2012 11:38 am

Hi geoffg,

This might be a problem internal to your PDFs, or due to the configuration of the extraction tool; I'd suggest starting by manually running your PDF extraction command on a sample PDF to see what its output looks like. Try the command configured in "index[application/pdf]", replacing %s with the filename of the PDF you want to work with.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 7698
Joined: Wed Aug 10, 2005 12:56 pm

Re: character encoding problems with buildSearchIndex.php

Postby geoffg » Wed Sep 26, 2012 11:56 am

Thanks Alec for responding so quickly.

I should have mentioned that I have run the pdftotext command with the same arguments as found in the ojs config file, outputting to a text file. And the text file comes thru correctly. The special characters don't get garbled.

I've also tried running the ./tools/rebuildSearchIndex.php from the command line while specifying utf-8 as the default character set for php, like this:

php -d default_charset=utf-8 ./tools/rebuildSearchIndex.php

But that didn't make any difference.

I can't figure out what might be causing this. I think that in some cases, the pdf itself is to blame. But in other cases, I can see that running pdftotext from the command line retains the correct characters in a given pdf file, but running rebuildSearchIndex.php ends up producing gremlins in the database for the same pdf file.
geoffg
 
Posts: 15
Joined: Tue Jul 10, 2012 4:08 pm

Re: character encoding problems with buildSearchIndex.php

Postby asmecher » Wed Sep 26, 2012 12:22 pm

Hi geoffg,

It can be hard to tell with character encoding issues because your terminal emulator (if e.g. you're connecting to your server via SSH) may be hiding the gremlins itself. If you pipe the output from pdftotext to a file on the server side, then transfer it over to your desktop and view it there in a good programmer's editor, you might find the gremlins there too. (Alternately, dump to a file on your server side and use hexdump or something similar to inspect the coding at the file level).

Ah, and I see you've mentioned smart apostrophes. Are these UTF8-valid smart apostrophes, or Microsoft's smart quote "extension"? It might help to focus on accented characters and come back to quotes and apostrophes once you're sure vanilla UTF8 is working OK.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 7698
Joined: Wed Aug 10, 2005 12:56 pm

Re: character encoding problems with buildSearchIndex.php

Postby geoffg » Fri Sep 28, 2012 8:56 am

Okay. I've examined the text file output from pdftotext in a good coding editor (textmate and also sublime text) and the utf-8 characters come thru okay. But using rebuildSearchIndex.php (on the same pdf file) gremlins appear in the article_search_keyword_list table.

Here's an example. The word Österreich looks fine in the .txt file outputted from pdftotext. But in the article_search_keyword_list table it comes across as österreich.

The pdf file I'm using to test with can be found here: http://www.i18nguy.com/unicode/unicodeexample.pdf. But none of the international characters come thru on any of our real journal articles either. They get garbled.

I really appreciate the help you've offered so far. Do you have any other ideas about what could be causing this?
geoffg
 
Posts: 15
Joined: Tue Jul 10, 2012 4:08 pm

Re: character encoding problems with buildSearchIndex.php

Postby geoffg » Fri Sep 28, 2012 9:34 am

I've figured the problem. The answer is in this thread:

viewtopic.php?f=8&t=8738

The Database connection character set needs to be set to "UTF8" not "utf-8" (the config documentation could be a little clearer with this.)

Thanks for the help in troubleshooting this.

The settings I'm using now are:


; Default locale
locale = en_US

; Client output/input character set
client_charset = utf-8

; Database connection character set
; Must be set to "Off" if not supported by the database server
; If enabled, must be the same character set as "client_charset"
; (although the actual name may differ slightly depending on the server)
connection_charset = UTF8

; Database storage character set
; Must be set to "Off" if not supported by the database server
database_charset = UTF8

; Enable character normalization to utf-8 (recommended)
; If disabled, strings will be passed through in their native encoding
charset_normalization = On
geoffg
 
Posts: 15
Joined: Tue Jul 10, 2012 4:08 pm


Return to OJS Technical Support

Who is online

Users browsing this forum: JasonNugent, spekala, swing, Yahoo [Bot] and 5 guests