The XML Markup Evaluation Corpus
PKP is pleased to announce the release of our automated XML markup evaluation corpus. This corpus is a major component of PKP’s Smarter Scholarly Texts for Cross-Platform Publishing, Text-mining and Indexing project. When we began development of our new automated XML markup pipeline (http://github.com/pkp/xmlps) several years ago, we realized there was no easy way for us to assess the accuracy of our parsing tools across a broad spectrum of documents — publishers do not make typically-formatted author drafts available at scale, and even among our publishing partners, these authentically-messy documents are not 1:1 with the eventual published versions and therefore aren’t suitable for automated comparison.
This corpus was created over the course of the past year as a way to provide empirical metrics for the accuracy of our XML markup pipeline. It consists of two sets of approximately 850 documents each — one set containing original author manuscripts in Word format (6134 pages) and matching professionally typeset JATS XML, and one set containing published articles in PDF format (5794 pages) and matching machine-generated JATS frontmatter.
The corpus is “split” this way in order to cater to the challenges of capturing “real world,” differently-formatted articles: the body text parsing used in our stack primarily targets Word format inputs, and the front matter parser used in our stack primarily targets PDF input, because body text formatting tends to vary the most across author manuscripts, and frontmatter and metadata formatting tend to vary the most across published articles.
Many organizations contributed to the creation of this corpus. Co-Action Publishing and INASP provided source material. The Charlesworth Group provided professional typesetting. Funding for this and continuing work on our XML pipeline has been provided by the Canadian Internet Registration Authority and by mediaX at Stanford University with a generous contribution from Konica Minolta. As we announced at our 2015 AGM, it is our intent for this corpus to serve as a community resource for any other automated typesetting projects or related initiatives.