OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



MS Word derived HTML galleys

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
What to do if you have a technical problem with OJS:

1. Search the forum. You can do this from the Advanced Search Page or from our Google Custom Search, which will search the entire PKP site. If you are encountering an error, we especially recommend searching the forum for said error.

2. Check the FAQ to see if your question or error has already been resolved.

3. Post a question, but please, only after trying the above two solutions. If it's a workflow or usability question you should probably post to the OJS Editorial Support and Discussion subforum; if you have a development question, try the OJS Development subforum.

MS Word derived HTML galleys

Postby delong » Thu Mar 23, 2006 2:05 am

Anybody have any hints or suggestions for getting clean HTML galleys derived from MS Word files? I get various degrees of cruft in my galleys - little question marks where tabs or other characters aren't formatting correctly.

Its just in the galley proof views and published html articles. If I click on the galley file link next to galley proof, it renders fine.

My charsets are set uniformly to utf8. Checked that already. I think its just a problem with Word files converted to HTML, because my pdf proofs are fine.

How does everyone else convert their submissions to HTML? Hints, suggestions, exhortations are much appreciated!
delong
 
Posts: 13
Joined: Thu Dec 01, 2005 3:17 pm

re: MS Word derived HTML galleys

Postby mj » Mon Jul 31, 2006 12:52 pm

Hi delong,

One of the features we're working on for OJS 2.2 is integration with a web service that will allow document conversion from MS-Word (and others) to XML, which in turn can be rendered into (consistent, UTF-8) XHTML galleys.

I have actually done a fair amount of work in this area; mostly using OpenOffice.org to do the conversion, which I can't recommend enough above MS-Office. Often character conversion issues can come from custom (Microsoft) fonts -- the Symbol font comes to mind. Depending on if you're previewing the galley file in IE, it may appear correctly, however once it's published through the web server, character encodings go funny.

A few recommendations on converting MS-Word to HTML:

- use OpenOffice instead of MS-Word; it creates *much* better HTML (actual valid HTML 4.0 rather than proprietary Word-HTML)

- try running your HTML files through HTMLTidy and/or converting everything to HTML entities rather than UTF-8; some browsers (notably IE) don't handle UTF-8 HTML very well.

There seem to be an increasing number of people who are going in this direction for their HTML galleys, so we are trying to get the XML facilities out there as quickly (and reliably) as possible.

Hope this helps,
MJ
mj
Site Admin
 
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada

Re: re: MS Word derived HTML galleys

Postby delong » Thu Oct 26, 2006 8:26 pm

Thanks mj,

Unfortunately I'm stuck with MS Office, and I can't assume those that come after me will use anything other than MS Office. The web conversion service sounds great - it is a good option for me since I have to plan for future editors and their technical knowledge (or lack thereof).

I've long ago figured out it was Word's use of non-standard characters. My workaround has been to do laborious find-and-replace operations to get rid of the smart quotes, section and paragraph symbols, etc. I'll take a look at HTMLTidy. Belated thanks for the help!

mj wrote:Hi delong,

One of the features we're working on for OJS 2.2 is integration with a web service that will allow document conversion from MS-Word (and others) to XML, which in turn can be rendered into (consistent, UTF-8) XHTML galleys.

I have actually done a fair amount of work in this area; mostly using OpenOffice.org to do the conversion, which I can't recommend enough above MS-Office. Often character conversion issues can come from custom (Microsoft) fonts -- the Symbol font comes to mind. Depending on if you're previewing the galley file in IE, it may appear correctly, however once it's published through the web server, character encodings go funny.

A few recommendations on converting MS-Word to HTML:

- use OpenOffice instead of MS-Word; it creates *much* better HTML (actual valid HTML 4.0 rather than proprietary Word-HTML)

- try running your HTML files through HTMLTidy and/or converting everything to HTML entities rather than UTF-8; some browsers (notably IE) don't handle UTF-8 HTML very well.

There seem to be an increasing number of people who are going in this direction for their HTML galleys, so we are trying to get the XML facilities out there as quickly (and reliably) as possible.

Hope this helps,
MJ
delong
 
Posts: 13
Joined: Thu Dec 01, 2005 3:17 pm

Re: re: MS Word derived HTML galleys

Postby mj » Fri Oct 27, 2006 2:57 pm

Hi delong,

You're not the only one who's stuck with MS Office -- I was at the MedNet confernce last week, and BioMed Central was discussing the same problem of submission format handling.

The good news is that work on the web conversion system has been progressing well, and we will be trialling the new software with a select number of journals very shortly.

The document-conversion part of this software uses the Docvert application (http://holloway.co.nz/docvert/) with OpenOffice. It is a bit complex to set up, and we are hoping to provide the service to the general public (or at least, the OJS public) in the coming months.

If you're looking for reliable character conversion from Word to HTML (or XHTML), HTMLTidy has a word-2000 option (http://www.w3.org/People/Raggett/tidy/) designed specifically for normalizing the symbols you describe. Much less work than manual search/replace.

Hope this helps, and stay tuned!

MJ

Open Journal Systems
Development Team
mj
Site Admin
 
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada

Postby soj » Tue Oct 31, 2006 7:19 pm

I am sitting down to create HTML Galley files for two long papers. They orginate in MSWord - gack. My process is :

1. Remove hyphenations
2. Copy and paste the text into a plain text editor (TextPad)
3. Paste plain text into Dreamweaver and call the articleview.css, then proceed to format the text with style tags. A long process on a 22-page article.

If a more efficient method exists, I'm all ears and would be grateful to hear it.

Meanwhile, I have found one useful aspect of converting the MSWord doc to (MS)HTML within the MSWord application: it turns mathematical equations into image files and places them neatly into a directory for me. This is the sole positive I have found in MsWord's "Save as HTML" functionality. Thought I'd share...

Cheers! soj
http://www.ejssm.org
http://www.insojourn.com
soj
 
Posts: 151
Joined: Fri Oct 28, 2005 1:53 pm
Location: Norman OK USA

Postby mj » Tue Oct 31, 2006 7:37 pm

Hi soj,

Thanks for your note! There are a couple of ways to accomplish what you're doing without quite as much manual effort.

If you're comfortable exporting Word-HTML directly, you can then run it through HTMLTidy to remove all of the surplus markup (using the word-2000 option), and then begin editing the CSS/styles in something like Dreameaver. I believe HTMLTidy can also remove hyphenation for you, but it's been a while since I've used it, so I may be mistaken.

You can also load the MS-Word file in OpenOffice.org, and export to a variety of formats (including HTML 4.0 and XHTML 1.0). This produces much cleaner HTML, and more consistent styling -- however, embedded OLE objects (such as some equations, some image formats like WMF, refman/endnote references, etc.) will not be converted/exported properly as there are no filters for them.

Unfortunately, making styles consistent is a very cumbersome process at the layout level. We are putting more effort into an XML-backed process to make it easier (so styles are applied uniformly to the XML rather than manually to the HTML), but as long as authors are inconsistent in their submissions, it will always be a substantial amount of manual work for the layout editor.

MJ

Open Journal Systems
Development Team
mj
Site Admin
 
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada

Postby soj » Tue Oct 31, 2006 8:08 pm

THX for your response, MJ! Dreamweaver does a pretty good job (base) coding the pasted plain txt, but it's occured to me that I should give your option a spin with the MS HTML text containing the equations images, since that portion is certainly worth 'cleaning' the hard way with a search/replace.

Looking forward to progress with XML.

Best to you and the group! soj
soj
 
Posts: 151
Joined: Fri Oct 28, 2005 1:53 pm
Location: Norman OK USA


Return to OJS Technical Support

Who is online

Users browsing this forum: No registered users and 6 guests