The bad news is that the service had significant problems with my document; It could not locate author metadata, incorrectly identified some ordinary text as being citations, and lost most of the document text, which is obviously a very major issue.
This is definitely a "major issue" in the sense that, at the moment, L8X doesn't offer much feedback as to how well the document parser was able to do it's job on the provided document -- it's an area that we've identified as being critical to the usability of L8X, as well as showing the need to be able to adjust the document parser interactively to try to get the best result possible by simply modifying some settings. A simple example of this would be showing the current parse score, with something like a slider to adjust the "laxness" of the parsing boundaries.
The larger problem, of course, is that L8X is encumbered, in a way, by the common expectation that it should just "magically" work on whatever format the author or user is providing -- it is an application that is designed to solve, in part, an infinitely-unsolvable problem. So, the user has to meet the application halfway. Peter is, once again, on top of this in his next comment:
If the PKP team can get a decent structure guessing application to work on arbitrary input that would be great, but even better would be to close the loop and give back documents with more structure than you put in. At the ICE project we will help however we can.
After refining L8X over dozens of documents and a number of years, we are hoping that the current version is decent at handling a decent number of (relatively) arbitrary documents. Of course, with only limited resources for testing, clearly that's still a tiny sliver of the variations that can exist in document formatting. This is one of the main reasons why we're so excited to get other groups involved with developing L8X.
There are a couple of points that I think Peter may have misunderstood, and I'd like to make sure that others aren't similarly mistaken; I'd highly encourage anyone to look at the Lemon8-XML Architecture diagram to get an idea of how L8X is designed internally:
1. Build a converter that can take structured word processing documents and map them to the NLM XML format used by L8X. ICE offers one well worked out structure for generic documents, others may exist for specific formats.
This is a common misconception, but understandable given the prominence of the NLM DTD among many of those interested in L8X, and admittedly, in much of the L8X literature. L8X doesn't use any particular schema internally, although it does store hierarchical structure. All of the metadata fields (and this applies to citations as well) are dynamic, based on the parsers and the UI. The rationale for this is the same as Peter alludes to, and that of most XML-based document schemas, which is to work for any kind of generic document. The whole point of L8X is to help people add structure to their documents, without being tied to a specific schema.
2. Build a structure-guessing application to add structure to word processing documents (something which Ian Barnes has been chipping away at for a while).
At the risk of repeating my statements above about a "decent structure guessing application", this is exactly what the document parser and citation parser in L8X are designed to do: add structure to word processing documents.
With both of these in place you can improve documents in the wild as you go; every time someone submits a draft add styles and give it back to them, rather than trying to guess structure at the end. I would like to see this embedded in the OJS journal management system from PKP so that authors get rapid and continual feedback every time they upload a draft. This would allow some editorial and review processes to take place in an HTML interface as well – rather than via PDF on word processing files.
If you leave L8X as the final step, authors will have little feedback as to how they can improve the structure of their drafts.
Likewise, see above regarding the need for a more interactive interface into the document parser and real-time feedback. This is certainly a direction that we're going with our integration of L8X into OJS (and OMP), with the idea of being able to use L8X earlier and earlier in the journal publication process, to the point that the author can get immediate feedback on how "well-formatted" their article is, and enable rich document interaction such as automatically suggesting indexing keywords based on content, checking citation quality levels, displaying various galley formats pre-publication, or even offering submission fee incentives for articles that are sufficiently well-formatted by the author. It's good to see that we're heading in the right direction here, and considering that the earliest original purpose of L8X was purely for generating galleys, we've undoubtedly got our work cut out for us.
My two-part plan would re-ordering sections in L8X become redundant – word processors have outlining tools with which you can reorder content, so why try to do it through an HTML interface?
This is my fault for not being clear in the documentation to date. L8X does have a fairly rudimentary editor component to it, but this is not designed (at least, presently) to compete with rich-document editors or word processors. The guiding principle behind L8X is to do as much "structure guessing" as possible automatically, but provide the user with the ability to make corrections for the inevitable things that L8X has guessed incorrectly. The flip-side of this approach is that, with a sufficiently well-formed document in the first place, the amount of manual correction required should be minimal. This is one of the reasons why both the online demo and the source code of L8X include a sample document to give an example of what is currently considered "well-formatted".
On a technical note, last time I looked at L8X I concluded that Docvert is a weak link – it tries to to use XSLT to guess structure; our experience with ICE was that XSLT (version one at least) was not a productive way to do this as the austere functional programming environment in XSLT made the structure-reasoning code very hard to maintain and very slow, so we moved to more traditional parser written in Python which is much easier for typical programmers to work with.
There's definitely some confusion around the relationship between Docvert and L8X, which I'm hoping will become more clear with time and as more people try to install both for themselves. Docvert is "simply" (although it's a pretty powerful application) a wrapper tool around OpenOffice.org that "normalizes" uploaded documents into ODT so that L8X can try to parse them. All of the parsing and structure guessing is done in PHP using DOM and XPath; and this is done for exactly the reasons that Peter specifies: a functional language like PHP is essential for the kind of processing that needs to be done; moreover, making it modular means that the parser can be replaced, re-written, or enhanced relatively easily, making L8X much more extensible and accessible to developers.
L8X does use XSLT, however, in the export stage, to transform the internal "Lemon8 XML schema" (if it can be called that) into whatever export XML formats are desired, such as NLM DTD, Docbook, TEI, and so on. This is precisely the domain that XSLT was designed for, and it does a remarkable job of making it easy to create additional export formats from L8X by writing a single mapping XSL to the desired output.
I should close by thanking Peter once again for bringing these points up -- although one of the goals of the beta release was in making L8X available to people to download and work with for themselves, I consider an equal goal to be the cultivating of discussion on the approaches we've chosen to solve what is undoubtedly an extremely common and challenging problem. We want to be sure that the best ideas are incorporated into L8X, and at the very least, conversations like this will only improve the application for everyone.
MJ
