Bug 6626

Summary: Special Characters in all reports are garbled
Product: OCS Reporter: James MacGregor <jmacgreg>
Component: GeneralAssignee: PKP Support <pkp-support>
Status: RESOLVED FIXED    
Severity: normal CC: alec, bugzilla
Priority: P3    
Version: 2.3.4   
Hardware: All   
OS: All   
Version Reported In: Also Affects:

Description James MacGregor 2011-04-29 15:14:38 PDT
Confirmed in git master. Any special characters that are placed into any OCS report (Papers, Registrant or Review) come in garbled. For example, the test text 

Direction de la rédaction : Éric Duchemin (co-directeur) et Louise Vandelac (co-directrice)
Éditeurs : Université du Québec à Montréal et Éditions en environnement VertigO
ISSN : 1492-8442 (numérique)

... is outputted as 

"Direction de la rédaction : Éric Duchemin (co-directeur) et Louise Vandelac (co-directrice)
Éditeurs : Université du Québec à Montréal et Éditions en environnement VertigO
ISSN : 1492-8442 (numérique)"
Comment 1 Alec Smecher 2011-04-29 15:43:56 PDT
James, how are you loading the CSVs? In OpenOffice, it should be possible to specify a UTF8 character set. It sounds like you're opening it as though it was Latin1-encoded.
Comment 2 James MacGregor 2011-04-29 16:24:20 PDT
(In reply to comment #1)
> James, how are you loading the CSVs? In OpenOffice, it should be possible to
> specify a UTF8 character set. It sounds like you're opening it as though it was
> Latin1-encoded.

Hi Alec, I think you're right -- by opening in OO.O and specifying UTF-8, the characters look fine. (Actually, OO.O defaults to UTF-8.) I get the incorrect character display via Excel. I've looked online and this doesn't seem to be a particularly isolated or unique problem, but some other forums recommend that the character set be set in the file header -- is this being done in OCS? See eg. https://drupal.org/node/187774. I've tried changing line 55 of plugins/reports/registrants/RegistrantReportPlugin.inc.php from 

header('content-type: text/comma-separated-values');

... to 

header('content-type: text/csv; charset=utf-8');

... or 

header('content-type: text/comma-separated-values; charset=utf-8');

... but to no effect. I haven't tried the mb_convert_encoding() method yet.
Comment 3 Trudy A. Shore 2011-07-20 11:30:55 PDT
I did a bunch of work on this and posted it all here:

http://pkp.sfu.ca/support/forum/viewtopic.php?f=3&t=7134&start=15#p28965

Unfortunately, it wasn't acknowledged by anyone - not even the original poster.
Comment 4 Alec Smecher 2011-08-24 11:38:16 PDT
Thanks for contributing, Trudy, The patches linked here have similar tweaks to three different reports.
Comment 5 Alec Smecher 2011-08-24 11:40:02 PDT
Add explicit UTF8 BOM and tweak entity decoding
https://github.com/pkp/ocs/commit/0c60806acdc13b13b46603388b3735658b7b960b
Comment 6 Alec Smecher 2011-08-24 11:40:03 PDT
Add explicit UTF8 BOM and tweak entity decoding
https://github.com/pkp/ocs/commit/d0fce7785deba74fa2d626c782dcf950584d9468
Comment 7 James MacGregor 2011-08-24 13:53:28 PDT
Hi Alec, the fixes you committed don't seem to make a difference, in git stable at least -- the demo text below is still displayed in Excel as I describe below. The one exception is if I put a special character in an author's Affiliation field -- that appears at the HTML character-encoded equivalent, eg. instead of "rédaction" or "r√©daction", it will appear as "r&eacute;daction". 

One other oddity: the firstmost cell (top left) also appears strange in Excel: 

Ôªø"Submission ID"

rather than 

"Submission ID"

... as it does in OO.O. There's nothing I can see there, like an accent or anything, to mangle that -- maybe Excel just doesn't like the fixes?
Comment 8 Alec Smecher 2011-08-24 14:08:24 PDT
Try pulling down the most recent commits. That's the UTF-8 BOM you're seeing, which is supposed to instruct Excel that it's dealing with a UTF-8 file. The newest change will hopefully at least instruct the OS that it should be saving a UTF-8 file, but from http://stackoverflow.com/questions/155097/microsoft-excel-mangles-diacritics-in-csv-files it looks like we may simply have to recommend work-arounds when loading the file.
Comment 9 James MacGregor 2011-08-24 14:35:22 PDT
(In reply to comment #8)
> Try pulling down the most recent commits. That's the UTF-8 BOM you're seeing,
> which is supposed to instruct Excel that it's dealing with a UTF-8 file. The
> newest change will hopefully at least instruct the OS that it should be saving
> a UTF-8 file, but from
> http://stackoverflow.com/questions/155097/microsoft-excel-mangles-diacritics-in-csv-files
> it looks like we may simply have to recommend work-arounds when loading the
> file.

No change, unfortunately. I concur with the stackoverflow sentiment that this really is a bug in Excel (and maybe specific to Excel on OS X, according to Trudy, which is what I'm using). Happy to say we've done our due diligence here and close the bug, if you want.
Comment 10 Alec Smecher 2011-08-31 17:25:02 PDT
Removed bad fix
https://github.com/pkp/ocs/commit/7265fed83f3d0af9413123d07b04c8c9fe020121
Comment 11 Alec Smecher 2011-08-31 17:25:02 PDT
Removed bad fix
https://github.com/pkp/ocs/commit/f9d3917191d830fa9e5deba33d49f7219afd1647
Comment 12 Trudy A. Shore 2013-03-21 10:56:57 PDT
My client has just opened up the CSV generated by papers report in Excel for windows and, guess what?  

Special characters are garbled...

After all of the work I did to fix this bug, I was surprised to find out that not all of the changes made it into OCS 2.3.5?

In particular, I just went and applied this change and now everything is OK for my client:

Add this line:

    fwrite($fp, chr(0xEF).chr(0xBB).chr(0xBF));

in plugins/reports/papers/PaperReportPlugin.inc.php, right after:

    $fp = fopen('php://output', 'wt');

I'm not sure why this was never used. In reading this bug report, it looks like it might've caused some problems?  But for my clients, they are very happy to have it fixed.
Comment 13 Alec Smecher 2013-03-21 11:02:11 PDT
Per James' comment #7, adding a UTF8 BOM will not work on all platforms; sometimes it results in the BOM appearing in the import data rather than transparently instructing Excel to use UTF8 on import.

See http://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0 for example instructions on specifying a UTF8 encoding during the Excel import.

There is lots of discussion around Excel and UTF-8 quirks here:
http://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically