Difference between revisions of "Lemon8-XML Roadmap"

From PKP Wiki
Jump to: navigation, search
(Adapted L8X roadmap to reflect re-priorization of OMP development.)
 
(48 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
=Development Roadmap=
 
=Development Roadmap=
  
 +
== Q1 2009 ==
  
==Milestone 1 - Private Alpha Release (Q3/Q4 2007)==
+
This is an initial release of the 1.x line, to be shortly deprecated into maintenance mode; we will still be tracking and addressing major / security-related bugs, and you are encouraged to browse [http://pkp.sfu.ca/bugzilla our Bugzilla database] fully.
  
'''<big>Release Objectives:</big>''' Establish a stable initial preview release.
+
== Q3 2009 ==
  
'''<big>Citation Parser</big>'''
+
As of Q3 2009, development on L8X as a stand-alone application has been halted in favor of a refactoring of the L8X functionality into the [[PKP WAL Roadmap|PKP Web Application Library]].  The rationale for this approach is to provide direct integration with OJS and OCS, as well as functionality for the initial relase of OMP.  Users can expect a major change to bring the UI in line with the rest of the PKP suite, while keeping much of the dynamic interface in 1.x.
  
* <s>Major XML refactor</s>
+
== Q4 2009/Q1 2010 ==
* <s>Remove NLM-specific generation</s>
+
* Port all L8X's citation parsing/lookup/editing functionality to OJS
* <s>Citation lookup based on UI metadata, not parsed text (eg. PMID)</s>
+
** citation lookup filters
* <s>Add [http://paracite.eprints.org/developers/ ParaCite] parsing</s>
+
** citation parsers
* <s>Handle embedded URLs</s> and access-dates
+
* specify and develop supporting infra-structure
* <s>Use author list as (additional?) Pubmed lookup method</s>
+
** meta-data framework
 +
** filter framework
  
'''<big>Framework</big>'''
+
== Q2 2010 ==
 +
* Specify and implement citation assistant user interface
 +
* Implement citation output use-cases
 +
** addition of citation data in XML export (NLM/PubMed, Synergies)
 +
** Allow readers to view citations in citation output formats (APA, MLA, Vancouver)
 +
* Initial release of the citation markup assistant in OJS
  
* <s>Integrate [http://holloway.co.nz/docvert/ Docvert]</s>
+
== Not yet scheduled ==
* <s>More informative upload/parsing messages</s>
+
=== Originally Scheduled for 2010 (Pushed back in favor of OMP development) ===
 +
* Additional citation output use cases:
 +
** addition of citation data in XML export (e.g. for PubMed, Synergies, and CrossRef)
 +
** generation of COinS (Context Object in Span) from citations, including Zotero integration
 +
** Allow readers to view citations in all existing citation output formats (EndNote?, RefWorks? integration)
 +
* Add document parsing/editing capability to OJS
 +
** automatic citation data extraction from ODT in submission process
 +
** add section parser / editor to editorial process (generate and edit full semantic XML structure) in OJS
 +
* Implement XML-to-PDF and XML-to-HTML rendering
 +
* Add document conversion capability to OJS
 +
** automatic document conversion during submission process (*.*)->(*.odt) to allow automatic extraction for more formats
 +
* Add L8X's meta-data extraction to OJS
 +
** automatic metadata extraction from ODT in submission process
 +
* Market migrated parsing/lookup code as a standalone library
  
 +
=== Additional Use Cases ===
 +
* Copyediting: Author match between the name used in body of the text and name used in the citations, as per spelling and reference link between text and bibliography (author with no reference; reference with no link to body of the text);
 +
* Copyediting: Quotation checking, where a quote in the body of the text is checked against the web for accuracy, with candidates proposed for comparison and correction, as well as reference checking;
 +
* Plagiarism: Random check of not-quoted bits of text for matches and possible plagiarism.
  
==Milestone 2 - Public Beta Release ('''Current - Q2 2008''')==
+
=== Usability ===
 +
* Let users "lock" citations once they are in their final state. Locked citations won't be overwritten by parser or lookup results.
 +
* Introduce a "batch processing mode" for citation parsing/lookup
 +
** keep the application responsive while citation parsing is going on in the background
 +
** do citation parsing/lookup during off-hours (e.g. every night)
  
'''<big>Release Objectives:</big>''' Remove legacy code and provide a stable foundation for beta testing.
+
=== Document Parsing ===
 +
* Let users configure "content types" (document types) to improve parsing and reduce manual work for batches of similar documents
 +
# extract styles from sample document
 +
# extract sections from sample document
 +
# let user attribute semantic information to styles and sections (e.g. first section = always contains author information)
 +
# parse document (metadata, citations, structure) batch based on these specific user definitions
 +
* Additional file conversion based on plugins: XSLT, ICE, GD, ImageMagick, etc.
 +
* Integrate [http://viewer.opencalais.com/ OpenCalais] service for metadata identification and extraction.
 +
** using OpenCalais on the full-text of an article is less accurate, though it does a pretty good job of finding entities
 +
** use L8X to detect the front, body, and back matter of a document, then:
 +
**# send the front matter to Calais to be broken into metadata (more accurately than we do now)
 +
**# send the back matter to the L8X citation handling and associated parse/lookup services
 +
**# send the body to eg. Lucene for full-text indexing and/or Calais for automatic keyword assignment (this works well, eg. with medical terms in MeSH, etc.)
  
'''<big>Document Parser</big>'''
+
=== Citation Parsing ===
 +
* Use machine-learning approaches (e.g. data mining/classifiers) to improve parser results
  
* <s>Major XML refactor</s>
+
=== Citation Lookup ===
* <s>Remove NLM-specific metadata parsing</s>
+
* Integrate more citation lookup services: OAIster, CiteSeer, Amazon, LibraryThing, OpenLibrary, SRU/SRW, Z39.50
* <s>Extract section hierarchy</s>
+
* generic OAI-DC: maybe with a local Harvester as meta-data cache and as a search interface?
 +
* Port source adapters from Umlaut project, see http://umlaut.rubyforge.org/.
  
'''<big>Section Editor</big>'''
+
=== Citation Output ===
 +
* Implement citation output plug-ins for Chicago Manual of Style, American Medical Association, American Sociological Association and Council of Science Editors (see mails to pkp-support from Mark and John, 20/10/2009)
 +
* Auto-COinS plugin (WAL): generate COinS in HTML/abstract view for marked references in textarea
 +
* Apply reading tools to references within articles (provide additional information about cited works in RT sidebar)
  
* <s>Enable reordering sections</s>
+
=== Document Export ===
* <s>Enable delete section</s>
+
* Additional XML schemas for export
* Change section heading level
+
* Add/upload new figure
+
  
'''<big>Citation Editor</big>'''
+
=== Backporting to other Applications ===
 +
* Extend L8X functionality to OCS and OMP
 +
* Add citation support to Harvester
 +
** If a metadata element in Harvester looks like a citation, parse the citation and render it in HTML with COinS
 +
** use Harvester to retrieve additional citation meta-data that will be attached to the meta-data we already retrieve (i.e. every single harvester record may contain or point to additional citation records)
  
* <s>Enable delete/reorder citations</s>
+
=Additional Requirements=
* Add new citation
+
* No new initial installation requirements
 
+
* Maintain PHP4 compatibility for initial installation, new installation requirements (additional software, PHP>4) only for optional plug-ins - a notable example being the citation editor/parser/lookup which requires at least PHP5.0
'''<big>Framework</big>'''
+
* Thorough documentation of additional installation / runtime environment requirements
 
+
* Make sure that L8X functionality will be easily portable to other PKP products (OMP, OCS, Harvester)
* <s>Upgrade to [http://www.cakephp.org/ CakePHP] 1.2 beta</s>
+
* Closely integrate with OMP to make sure that the GUI components will work in OMP without adaptation
* Per-user accounts and self-signup with captcha (eg. [http://recaptcha.net/ reCaptcha] ?)
+
* All contributions should be fully unit-test covered
* Convert incoming HTML entities in content into UTF-8
+
* All workflows should be fully web-test covered
* Refactor document controller using unbinding
+
* Basic installer script
+
 
+
'''<big>Support/Development</big>'''
+
 
+
* Add source code to PKP CVS system
+
* Open Bugzilla tracker for issues
+
 
+
 
+
==Milestone 3 - 1.0 Proposed (Q3 2008)==
+
 
+
'''<big>Personalization</big>'''
+
 
+
* Allow upload of custom XSL/CSS for preview/export
+
* Set default metadata values (eg. copyright statement)
+
 
+
'''<big>Document Parser</big>'''
+
 
+
* "Garbage collection" of content that is left unparsed
+
 
+
'''<big>XML Export</big>'''
+
 
+
* Move xref detection into generateOutputXML() & remove NLM schema; add superscript detection and compare to list of references
+
* XML pre-validation
+
* Add Docbook DTD export schema
+
 
+
'''<big>Framework</big>'''
+
 
+
* Refactor to add plugin classes: lookup, export, import, metadata schema
+
 
+
'''<big>Reported Bugs</big>'''
+
 
+
* <s>References with ndash don't get parsed: e.g., [1–6] will not get parsed, but [1-6] will.</s>
+
* <s>A large amount of text gets missed in some documents (Peter Sefton)</s>
+
* <s>Conflict of interest does not get sent to <back> matter</s>
+
* Parser does not find a title in a .odt when it is in the document properties (Peter Sefton)
+
* Does not gracefully deal with "et al"
+
* Adding author or affiliation goes to error screen
+
* Extraneous "Aff1" with nothing attached to it - on first author.
+
* Affiliations are numbered by their metadatas ID, not by their sequence.
+
* Occasionally random "<name name-style="western">" appear when an article has many, many authors
+
* Does not properly output "et al" for citations with more than X authors (6?) (in HTML)
+
* Does not at all output <publisher-loc> or <publisher-name>
+
 
+
 
+
==Proposed for Future Release==
+
 
+
'''<big>Metadata Editor</big>'''
+
 
+
* Enable multiple article ID
+
* Add primary author selector / role-aff association
+
* Add acknowledgements, reviewers, review dates, etc.
+
* Create markup for abstract sections in XHTML
+
* Enable collapsable sections (authors, affiliations, etc)
+
 
+
'''<big>Section Editor</big>'''
+
 
+
* Enable collapsable sections
+
* Add/paste/edit XHTML tables and sections (TinyMCE)
+
 
+
'''<big>Citation Editor</big>'''
+
 
+
* Lookup/parse UI consistency
+
* Citation types and elements mapped to NLM
+
 
+
'''<big>HTML/PDF Preview</big>'''
+
 
+
* Investigate move to [http://www.digitaljunkies.ca/dompdf/ DOMPDF] from FOP
+
* Improve PDF XSL (as per [http://pkp.sfu.ca/ojs OJS] development)
+
* Tweak XHTML stylesheets for tables/figures
+
 
+
'''<big>XML Export</big>'''
+
 
+
* Integrate Pubmed Central [http://www.pubmedcentral.nih.gov/about/PMC_Utilities.html Style Checker / Article Previewer] and feedback from Open Medicine
+
* NLM: metadata generation w/full aff linking
+
* NLM: Improve figure/abstract/list transformation
+
* NLM: Add figures/tables to xref detection
+
* Add Erudit DTD export schema
+
* Add initial [http://www.scribus.net/ Scribus] 1.5 DTD export schema
+
 
+
'''<big>Framework</big>'''
+
 
+
* Add form data validation
+
* Better error/warning messages (eg. citations, required fields, etc.)
+
* Full I18n and L10n to French, Spanish
+
 
+
 
+
=Tutorials=
+
===[[Five steps to an XML document]]===
+

Latest revision as of 17:02, 7 September 2010

Development Roadmap

Q1 2009

This is an initial release of the 1.x line, to be shortly deprecated into maintenance mode; we will still be tracking and addressing major / security-related bugs, and you are encouraged to browse our Bugzilla database fully.

Q3 2009

As of Q3 2009, development on L8X as a stand-alone application has been halted in favor of a refactoring of the L8X functionality into the PKP Web Application Library. The rationale for this approach is to provide direct integration with OJS and OCS, as well as functionality for the initial relase of OMP. Users can expect a major change to bring the UI in line with the rest of the PKP suite, while keeping much of the dynamic interface in 1.x.

Q4 2009/Q1 2010

  • Port all L8X's citation parsing/lookup/editing functionality to OJS
    • citation lookup filters
    • citation parsers
  • specify and develop supporting infra-structure
    • meta-data framework
    • filter framework

Q2 2010

  • Specify and implement citation assistant user interface
  • Implement citation output use-cases
    • addition of citation data in XML export (NLM/PubMed, Synergies)
    • Allow readers to view citations in citation output formats (APA, MLA, Vancouver)
  • Initial release of the citation markup assistant in OJS

Not yet scheduled

Originally Scheduled for 2010 (Pushed back in favor of OMP development)

  • Additional citation output use cases:
    • addition of citation data in XML export (e.g. for PubMed, Synergies, and CrossRef)
    • generation of COinS (Context Object in Span) from citations, including Zotero integration
    • Allow readers to view citations in all existing citation output formats (EndNote?, RefWorks? integration)
  • Add document parsing/editing capability to OJS
    • automatic citation data extraction from ODT in submission process
    • add section parser / editor to editorial process (generate and edit full semantic XML structure) in OJS
  • Implement XML-to-PDF and XML-to-HTML rendering
  • Add document conversion capability to OJS
    • automatic document conversion during submission process (*.*)->(*.odt) to allow automatic extraction for more formats
  • Add L8X's meta-data extraction to OJS
    • automatic metadata extraction from ODT in submission process
  • Market migrated parsing/lookup code as a standalone library

Additional Use Cases

  • Copyediting: Author match between the name used in body of the text and name used in the citations, as per spelling and reference link between text and bibliography (author with no reference; reference with no link to body of the text);
  • Copyediting: Quotation checking, where a quote in the body of the text is checked against the web for accuracy, with candidates proposed for comparison and correction, as well as reference checking;
  • Plagiarism: Random check of not-quoted bits of text for matches and possible plagiarism.

Usability

  • Let users "lock" citations once they are in their final state. Locked citations won't be overwritten by parser or lookup results.
  • Introduce a "batch processing mode" for citation parsing/lookup
    • keep the application responsive while citation parsing is going on in the background
    • do citation parsing/lookup during off-hours (e.g. every night)

Document Parsing

  • Let users configure "content types" (document types) to improve parsing and reduce manual work for batches of similar documents
  1. extract styles from sample document
  2. extract sections from sample document
  3. let user attribute semantic information to styles and sections (e.g. first section = always contains author information)
  4. parse document (metadata, citations, structure) batch based on these specific user definitions
  • Additional file conversion based on plugins: XSLT, ICE, GD, ImageMagick, etc.
  • Integrate OpenCalais service for metadata identification and extraction.
    • using OpenCalais on the full-text of an article is less accurate, though it does a pretty good job of finding entities
    • use L8X to detect the front, body, and back matter of a document, then:
      1. send the front matter to Calais to be broken into metadata (more accurately than we do now)
      2. send the back matter to the L8X citation handling and associated parse/lookup services
      3. send the body to eg. Lucene for full-text indexing and/or Calais for automatic keyword assignment (this works well, eg. with medical terms in MeSH, etc.)

Citation Parsing

  • Use machine-learning approaches (e.g. data mining/classifiers) to improve parser results

Citation Lookup

  • Integrate more citation lookup services: OAIster, CiteSeer, Amazon, LibraryThing, OpenLibrary, SRU/SRW, Z39.50
  • generic OAI-DC: maybe with a local Harvester as meta-data cache and as a search interface?
  • Port source adapters from Umlaut project, see http://umlaut.rubyforge.org/.

Citation Output

  • Implement citation output plug-ins for Chicago Manual of Style, American Medical Association, American Sociological Association and Council of Science Editors (see mails to pkp-support from Mark and John, 20/10/2009)
  • Auto-COinS plugin (WAL): generate COinS in HTML/abstract view for marked references in textarea
  • Apply reading tools to references within articles (provide additional information about cited works in RT sidebar)

Document Export

  • Additional XML schemas for export

Backporting to other Applications

  • Extend L8X functionality to OCS and OMP
  • Add citation support to Harvester
    • If a metadata element in Harvester looks like a citation, parse the citation and render it in HTML with COinS
    • use Harvester to retrieve additional citation meta-data that will be attached to the meta-data we already retrieve (i.e. every single harvester record may contain or point to additional citation records)

Additional Requirements

  • No new initial installation requirements
  • Maintain PHP4 compatibility for initial installation, new installation requirements (additional software, PHP>4) only for optional plug-ins - a notable example being the citation editor/parser/lookup which requires at least PHP5.0
  • Thorough documentation of additional installation / runtime environment requirements
  • Make sure that L8X functionality will be easily portable to other PKP products (OMP, OCS, Harvester)
  • Closely integrate with OMP to make sure that the GUI components will work in OMP without adaptation
  • All contributions should be fully unit-test covered
  • All workflows should be fully web-test covered