Lemon8-XML: using styles and more

Forum for PKP's Lemon8-XML.

Moderators: jmacgreg, mj

Forum rules
The Public Knowledge Project Support Forum is moving to http://forum.pkp.sfu.ca

This forum will be maintained permanently as an archived historical resource, but all new questions should be added to the new forum. Questions will no longer be monitored on this old forum after March 30, 2015.
Site Admin
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada

Lemon8-XML: using styles and more

Postby mj » Mon Jun 23, 2008 10:40 pm

Peter Sefton has very thoughtfully made a blog posting about his recent experience trying out the new beta, and in it he raises a number of points that I suspect others may likely come to as well. I'd like to address a few of them, and hopefully clarify some things along the way.

The bad news is that the service had significant problems with my document; It could not locate author metadata, incorrectly identified some ordinary text as being citations, and lost most of the document text, which is obviously a very major issue.

This is definitely a "major issue" in the sense that, at the moment, L8X doesn't offer much feedback as to how well the document parser was able to do it's job on the provided document -- it's an area that we've identified as being critical to the usability of L8X, as well as showing the need to be able to adjust the document parser interactively to try to get the best result possible by simply modifying some settings. A simple example of this would be showing the current parse score, with something like a slider to adjust the "laxness" of the parsing boundaries.

The larger problem, of course, is that L8X is encumbered, in a way, by the common expectation that it should just "magically" work on whatever format the author or user is providing -- it is an application that is designed to solve, in part, an infinitely-unsolvable problem. So, the user has to meet the application halfway. Peter is, once again, on top of this in his next comment:

If the PKP team can get a decent structure guessing application to work on arbitrary input that would be great, but even better would be to close the loop and give back documents with more structure than you put in. At the ICE project we will help however we can.

After refining L8X over dozens of documents and a number of years, we are hoping that the current version is decent at handling a decent number of (relatively) arbitrary documents. Of course, with only limited resources for testing, clearly that's still a tiny sliver of the variations that can exist in document formatting. This is one of the main reasons why we're so excited to get other groups involved with developing L8X.

There are a couple of points that I think Peter may have misunderstood, and I'd like to make sure that others aren't similarly mistaken; I'd highly encourage anyone to look at the Lemon8-XML Architecture diagram to get an idea of how L8X is designed internally:

1. Build a converter that can take structured word processing documents and map them to the NLM XML format used by L8X. ICE offers one well worked out structure for generic documents, others may exist for specific formats.

This is a common misconception, but understandable given the prominence of the NLM DTD among many of those interested in L8X, and admittedly, in much of the L8X literature. L8X doesn't use any particular schema internally, although it does store hierarchical structure. All of the metadata fields (and this applies to citations as well) are dynamic, based on the parsers and the UI. The rationale for this is the same as Peter alludes to, and that of most XML-based document schemas, which is to work for any kind of generic document. The whole point of L8X is to help people add structure to their documents, without being tied to a specific schema.

2. Build a structure-guessing application to add structure to word processing documents (something which Ian Barnes has been chipping away at for a while).

At the risk of repeating my statements above about a "decent structure guessing application", this is exactly what the document parser and citation parser in L8X are designed to do: add structure to word processing documents.

With both of these in place you can improve documents in the wild as you go; every time someone submits a draft add styles and give it back to them, rather than trying to guess structure at the end. I would like to see this embedded in the OJS journal management system from PKP so that authors get rapid and continual feedback every time they upload a draft. This would allow some editorial and review processes to take place in an HTML interface as well – rather than via PDF on word processing files.

If you leave L8X as the final step, authors will have little feedback as to how they can improve the structure of their drafts.

Likewise, see above regarding the need for a more interactive interface into the document parser and real-time feedback. This is certainly a direction that we're going with our integration of L8X into OJS (and OMP), with the idea of being able to use L8X earlier and earlier in the journal publication process, to the point that the author can get immediate feedback on how "well-formatted" their article is, and enable rich document interaction such as automatically suggesting indexing keywords based on content, checking citation quality levels, displaying various galley formats pre-publication, or even offering submission fee incentives for articles that are sufficiently well-formatted by the author. It's good to see that we're heading in the right direction here, and considering that the earliest original purpose of L8X was purely for generating galleys, we've undoubtedly got our work cut out for us.

My two-part plan would re-ordering sections in L8X become redundant – word processors have outlining tools with which you can reorder content, so why try to do it through an HTML interface?

This is my fault for not being clear in the documentation to date. L8X does have a fairly rudimentary editor component to it, but this is not designed (at least, presently) to compete with rich-document editors or word processors. The guiding principle behind L8X is to do as much "structure guessing" as possible automatically, but provide the user with the ability to make corrections for the inevitable things that L8X has guessed incorrectly. The flip-side of this approach is that, with a sufficiently well-formed document in the first place, the amount of manual correction required should be minimal. This is one of the reasons why both the online demo and the source code of L8X include a sample document to give an example of what is currently considered "well-formatted".

On a technical note, last time I looked at L8X I concluded that Docvert is a weak link – it tries to to use XSLT to guess structure; our experience with ICE was that XSLT (version one at least) was not a productive way to do this as the austere functional programming environment in XSLT made the structure-reasoning code very hard to maintain and very slow, so we moved to more traditional parser written in Python which is much easier for typical programmers to work with.

There's definitely some confusion around the relationship between Docvert and L8X, which I'm hoping will become more clear with time and as more people try to install both for themselves. Docvert is "simply" (although it's a pretty powerful application) a wrapper tool around OpenOffice.org that "normalizes" uploaded documents into ODT so that L8X can try to parse them. All of the parsing and structure guessing is done in PHP using DOM and XPath; and this is done for exactly the reasons that Peter specifies: a functional language like PHP is essential for the kind of processing that needs to be done; moreover, making it modular means that the parser can be replaced, re-written, or enhanced relatively easily, making L8X much more extensible and accessible to developers.

L8X does use XSLT, however, in the export stage, to transform the internal "Lemon8 XML schema" (if it can be called that) into whatever export XML formats are desired, such as NLM DTD, Docbook, TEI, and so on. This is precisely the domain that XSLT was designed for, and it does a remarkable job of making it easy to create additional export formats from L8X by writing a single mapping XSL to the desired output.

I should close by thanking Peter once again for bringing these points up -- although one of the goals of the beta release was in making L8X available to people to download and work with for themselves, I consider an equal goal to be the cultivating of discussion on the approaches we've chosen to solve what is undoubtedly an extremely common and challenging problem. We want to be sure that the best ideas are incorporated into L8X, and at the very least, conversations like this will only improve the application for everyone.


Site Admin
Posts: 304
Joined: Fri Mar 26, 2004 9:32 am
Location: Toronto, Canada

Re: Lemon8-XML: using styles and more

Postby mj » Fri Jul 04, 2008 10:54 pm

In continuing what feels like a bit of a disjointed -- if thoughtful -- conversation, Peter has written a few words on some of the problems with user expectations.

Once again, he touches on a few areas that I should probably clarify, although this time more design philosophy rather than technical approach. To begin:

I understand the requirement to try to understand the structure of ad hoc documents if you can, but I don’t think it’s a good idea to encourage people to keep creating them; if L8X has a version of “meet me half way” which involves direct formatting instead of styles then that will be a step backwards in my opinion. My version of meet me half way would be at least to try to get people to use headings. If they don’t then the structure guesser will step in, try to guess and give them their document back to correct when the inevitable errors occur.

It is important to help our colleagues who are authoring documents in word processors to use styles. It’s good for them. It will improve their working lives.

I agree that relying solely on a structure-guessing algorithm like the current document parser does is a pretty small step "halfway", as I had put it. But I also think it is a very good starting point, since it doesn't require authors to do anything more than they're used to at the present. It may be good for authors to use styles, but you're asking them to do extra work (eg. "but, I already marked that as a heading -- by making it 12 point boldface"), and if you're asking them to do extra work, there has to be some sort of reward (or punishment).

I can see a couple of sample use cases:

  1. an author fully marks their entire article correctly using a tool like ICE or a recognized style template; L8X detects these styles and returns a very high (eg. 99%) score
  2. an author uses partial or custom styles, but omits some markup through forgetfullness, being rushed, etc; L8X detects the styles that exist and tries to detect the rest using the document parser, but results in a high (eg. 85%) score
  3. an author follows the recommend style guidelines as per the sample L8X document; L8X uses the document parser to discern structure based on convention, and returns a medium (eg. 75%) score
  4. an author doesn't follow any guidelines, doesn't use any styles or tools, and submits a generally poor-quality document; L8X uses the document parser to guess structure and returns a low (eg. 40%) score

The point I'm trying to make is that each of these cases is already happening, so instead of forcing authors to do it a particular way, we should enable L8X to take advantage of whatever extra work the author has done, and reward them with a higher document score.

I should note, for example, that Peter's sample ICE-RS document, which is based purely on ICE styles but contains very little actual content, parses very poorly with a score of about 11% -- but as soon as content is added (I just generated some lorem ipsum paragraphs) to mimic a real-world document, this score jumps to 83%. Imagine how much higher it'd be if L8X could detect both structure and ICE styles.

In this way, we're not encouraging people to keep creating ad hoc documents, but we're not really punishing them either. What we are doing is rewarding them for taking the time to manually mark structure using a tool like ICE, or the preferred style template supplied to them by the journal, ePress, library, etc. An overarching theme among the PKP development philosophy has been to give users the choice of what's appropriate for them, and this approach follows directly from that. Journals could even take this a step further by mandating minimum parse scores to qualify for submission, and so on (but that's a topic for another discussion).

Peter also identifies another philosophical difference between ICE and Lemon8-XML, that being the workflow and output design goals:

1. Styled word processing document to XML conversion, with the obvious caveat that if you’re turing a generic format into a domain specific one you’re going to be producing stuff that doesn’t use the whole of the target format and may have gaps that need to be filled in.

2. Ad hoc-formatting to styled word processing conversion using the best available heuristics to guess structure and give the document back to the author in an improved form. As far as I can tell that’s not a goal for the PKP team, but the code is out there so we could do it, using their algorithm.

The goal as I see it is a combination of both. As far as I understand, ICE works like an interactive editor designed to ultimately output well-structured XHTML or PDF (as opposed to a semantic XML schema) -- that is, the author iteratively adds WYSIWYM markup, checks the (eg. HTML) output, and then switches back to the editor to make corrections. This makes perfect sense since it meshes well with authors' current work practices. L8X provides two possible workflows, and I'm completely open to questioning the validity of each:

  1. an author uploads a document to L8X, uses the PDF/HTML preview to identify areas for correction, modifies the original document, and re-uploads the corrected version; this is analogous to the iterative ICE workflow above, but it's unquestionably cumbersome
  2. an author uploads a document to L8X, takes cues from the scoring to identify areas for correction, and modifies them directly in L8X using the built-in editors; this is more natural and quantitative, but is highly underdeveloped at present and needs some pretty slick UI to make it usable

It would seem, then, that ICE would make a natural WYSIWYM editor for documents that are to be submitted to L8X, and perhaps, if it can be integrated into a rich web UI, an ideal built-in editor for correcting semantic markup in documents already ingested into L8X. This would facilitate both workflows above (again, providing options for the user) while building on the strengths of each application.

This is precisely why I see Lemon8 and ICE as natural companions, and why I think we need to have a clear understanding of how each compliments the other's functionally.

Peter and I seem to have started with opposing views on how the two might fit together: him envisioning L8X as a precursor tool to ICE, and me the other way around. Perhaps a good place to start on collaboration might be in identifying a set of common functional requirements (eg. round-trippable modified ODT, sufficient citation markup, element mapping to the NLM DTD, etc.) and see where the bridges need to be built, and gaps need to be filled in.


Return to “Lemon8-XML”

Who is online

Users browsing this forum: No registered users and 2 guests